Results 1 -
6 of
6
Dynamic IPC/Clock Rate Optimization
- PROCEEDINGS OF THE 25TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1998
"... Current microprocessor designs set the functionality and clock rate of the chip at design time based on the configuration that achieves the best overall performance over a range of target applications. The result may be poor performance when running applications whose requirements are not well-match ..."
Abstract
-
Cited by 73 (17 self)
- Add to MetaCart
Current microprocessor designs set the functionality and clock rate of the chip at design time based on the configuration that achieves the best overall performance over a range of target applications. The result may be poor performance when running applications whose requirements are not well-matched to the particular hardware organization chosen. We present a new approach called Complexity-Adaptive Processors (CAPs) in which the IPC/clock rate tradeoff can be altered at runtime to dynamically match the changing requirements of the instruction stream. By exploiting repeater methodologies used increasingly in deep sub-micron designs, CAPs achieve this flexibility with potentially no cycle time impact compared to a fixed architecture. Our preliminary results in applying this approach to on-chip caches and instruction queues indicate that CAPs have the potential to significantly outperform conventional approaches on workloads containing both general-purpose and scientific applications.
User Transparency: A Fully Sequential Programming Model for Efficient Data Parallel Image Processing
- Science, University of Amsterdam, The Netherlands
, 2002
"... Although many image processing applications are ideally suited for parallel implementation, most researchers in imaging do not benefit from high performance computing on a daily basis. Essentially, this is due to the fact that no parallelization tools exist that truly match the image processing rese ..."
Abstract
-
Cited by 15 (8 self)
- Add to MetaCart
Although many image processing applications are ideally suited for parallel implementation, most researchers in imaging do not benefit from high performance computing on a daily basis. Essentially, this is due to the fact that no parallelization tools exist that truly match the image processing researcher's frame of reference. As it is unrealistic to expect imaging researchers to become experts in parallel computing, tools must be provided to allow them to develop high performance applications in a highly familiar manner. In an attempt to provide such a tool, we have designed a software architecture that allows transparent (i.e., sequential) implementation of data parallel imaging applications for execution on homogeneous distributed memory MIMD-style multicomputers. This paper presents an extensive overview of the design rationale behind the software architecture, and gives an assessment of the architecture's e#ectiveness in providing significant performance gains. In particular, we describe the implementation and automatic parallelization of three well-known example applications that contain many fundamental imaging operations: (1) template matching, (2) multi-baseline stereo vision, and (3) line detection. Based on experimental results we conclude that our software architecture constitutes a powerful and user-friendly tool for obtaining high performance in many important image processing research areas.
Efficient applications in user transparent parallel image processing
- In: Proceeding of International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshop on Parallel and Distributed Computing in Image Processing, Video Processing, and Multimedia (PDIVM’2002), Fort Lauderdale
, 2001
"... Although many image processing applications are ideally suited for parallel implementation, most researchers in imaging do not benefit from high performance computing on a daily basis. Essentially, this is due to the fact that no parallelization tools exist that truly match the image processing rese ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Although many image processing applications are ideally suited for parallel implementation, most researchers in imaging do not benefit from high performance computing on a daily basis. Essentially, this is due to the fact that no parallelization tools exist that truly match the image processing researcher’s frame of reference. As it is unrealistic to expect imaging researchers to become experts in parallel computing, tools must be provided to allow them to develop high performance applications in a highly familiar manner. In an attempt to provide such a tool, we have designed a software architecture that allows transparent (i.e., sequential) implementation of data parallel imaging applications for execution on homogeneous distributed memory MIMD-style multicomputers. This paper gives an assessment of the architecture’s effectiveness in providing significant performance gains. In particular, we describe
Formal Derivation of Divide-and-Conquer Programs: A Case Study in the Multidimensional FFT's
- Formal Methods for Parallel Programming: Theory and Applications. Workshop at IPPS'97
, 1997
"... This paper reports a case study in the development of parallel programs in the Bird-Meertens formalism (BMF), starting from divide-and-conquer algorithm specifications. The contribution of the paper is two-fold: (1) we classify divide-and-conquer algorithms and formally derive a parameterized family ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
This paper reports a case study in the development of parallel programs in the Bird-Meertens formalism (BMF), starting from divide-and-conquer algorithm specifications. The contribution of the paper is two-fold: (1) we classify divide-and-conquer algorithms and formally derive a parameterized family of parallel implementations for an important subclass of divide-and-conquer, called DH (distributable homomorphisms); (2) we systematically adjust the mathematical specification of the Fast Fourier Transform (FFT) to the DH format and thereby obtain a generic SPMD program, well suited for implementation under MPI. The target program includes the efficient FFT solutions used in practice the binary-exchange and the 2D- and 3D-transpose implementations as its special cases.
A Comparative Analysis of Four Parallelisation Schemes
- in Proceedings of the 1999 ACM International Conference on Supercomputing, ACM
, 1999
"... An experimental study of four different schemes for parallelisation of FORTRAN codes is presented. One scheme is manual (performed by the programmer), the other three are automatic (performed entirely by software). The performance of code generated for two parallel computers from seven different tes ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
An experimental study of four different schemes for parallelisation of FORTRAN codes is presented. One scheme is manual (performed by the programmer), the other three are automatic (performed entirely by software). The performance of code generated for two parallel computers from seven different test cases is compared, and reasons for differences in achieved performance between the four parallelisation schemes are analysed. It is concluded that, even using sophisticated techniques, high performance parallelised code cannot be generated by automatic tools unless they take into account feedback about execution-time behaviour. Both post-execution performance analysis and interaction with the programmer are necessary for success. This observation argues for user-centred, feedback-driven parallelisation tools that aid the manual process. 1 Introduction Despite the ready availability of parallel computers over the last thirty years, they have not been taken up extensively as a means of red...
An Architectural And Circuit-Level Approach To Improving The Energy Efficiency Of Microprocessor Memory Structures
- In Proc. the 10th International Conference on VLSI
, 1999
"... We present a combined architectural and circuit technique for reducing the energy dissipation of microprocessor memory structures. This approach exploits the subarray partitioning of high speed memories and varying application requirements to dynamically disable partitions during appropriate executi ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We present a combined architectural and circuit technique for reducing the energy dissipation of microprocessor memory structures. This approach exploits the subarray partitioning of high speed memories and varying application requirements to dynamically disable partitions during appropriate execution periods. When applied to 4-way set associative caches, trading off a 2% performance degradation yields a combined 40% reduction in L1 Dcache and L2 cache energy dissipation. 1. INTRODUCTION The continuing microprocessor performance gains afforded by advances in semiconductor technology have come at the cost of increased power consumption. Each new high performance microprocessor generation brings additional on-chip functionality, and thus an increase in switching capacitance, as well as increased clock speeds over the previous generation. For example, both transistor count and clock speed have roughly doubled in the three years separating the Alpha 21164 microprocessor [6, 11] and the...

