Results 1 - 10
of
49
Program Analysis and Specialization for the C Programming Language
, 1994
"... Software engineers are faced with a dilemma. They want to write general and wellstructured programs that are flexible and easy to maintain. On the other hand, generality has a price: efficiency. A specialized program solving a particular problem is often significantly faster than a general program. ..."
Abstract
-
Cited by 472 (0 self)
- Add to MetaCart
Software engineers are faced with a dilemma. They want to write general and wellstructured programs that are flexible and easy to maintain. On the other hand, generality has a price: efficiency. A specialized program solving a particular problem is often significantly faster than a general program. However, the development of specialized software is time-consuming, and is likely to exceed the production of today’s programmers. New techniques are required to solve this so-called software crisis. Partial evaluation is a program specialization technique that reconciles the benefits of generality with efficiency. This thesis presents an automatic partial evaluator for the Ansi C programming language. The content of this thesis is analysis and transformation of C programs. We develop several analyses that support the transformation of a program into its generating extension. A generating extension is a program that produces specialized programs when executed on parts of the input. The thesis contains the following main results.
The Performance Impact of Flexibility in the Stanford FLASH Multiprocessor
- In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems
, 1994
"... A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by ..."
Abstract
-
Cited by 53 (9 self)
- Add to MetaCart
A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%-12 % slower than the idealized machine. 1
A Code Generation Interface for ANSI C
, 1991
"... machine code resembles assembly or machine language for a fictitious computer 8 A front end emits a stream of instructions (in a text or compressed binary encoding) to a logically separate back end. Each approach has strengths and weaknesses. Abstract machine codes permit the front and back ends ..."
Abstract
-
Cited by 42 (6 self)
- Add to MetaCart
machine code resembles assembly or machine language for a fictitious computer 8 A front end emits a stream of instructions (in a text or compressed binary encoding) to a logically separate back end. Each approach has strengths and weaknesses. Abstract machine codes permit the front and back ends, and perhaps an optimizer, to run as separate processes. Uni-process compilers are generally faster, but multi-process compilers might run faster on some multi-processor computers. If the compiler is complex, a multi-process compiler might simplify team devel- opment.
Components, Frameworks, Patterns
- COMMUNICATIONS OF THE ACM
, 1997
"... Frameworks are an object-oriented reuse technique that are widely used in industry but not discussed much by the software engineering research community. They are a way of reusing design that is part of the reason that some object-oriented developers are so productive. This paper compares and co ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
Frameworks are an object-oriented reuse technique that are widely used in industry but not discussed much by the software engineering research community. They are a way of reusing design that is part of the reason that some object-oriented developers are so productive. This paper compares and contrasts frameworks with other reuse techniques, and describes how to use them, how to evaluate them, and how to develop them. It describe the tradeoffs involved in using frameworks, including the costs and pitfalls, and when frameworks are appropriate.
LISA - Machine Description Language and Generic Machine Model for HW/SW Co-Design
- in Proceedings of the IEEE Workshop on VLSI Signal Processing
, 1996
"... In the paper a new machine description language is presented. The new language LISA, and its generic machine model are able to produce bit- and cycle/phase-accurate processor models covering the specific needs of HW/SW co-design, and co-simulation environments. The development of a new language was ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
In the paper a new machine description language is presented. The new language LISA, and its generic machine model are able to produce bit- and cycle/phase-accurate processor models covering the specific needs of HW/SW co-design, and co-simulation environments. The development of a new language was necessary in order to cover the gap between coarse ISA models used in compilers, and instruction-set simulators on the one hand, and detailed models used for hardware design on the other. The main part of the paper is devoted to behavioral pipeline modeling. The pipeline controller of the generic machine model is represented as an ASAP (As Soon As Possible) sequencer parameterized by precedence and resource constraints of operations of each instruction. The standard pipeline description based on reservation tables and Gantt charts was extended by additional operation descriptors which enable the detection of data and control hazards, and permit modeling of pipeline flushes. Using the newly i...
Evaluating Runtime-Compiled Value-Specific Optimizations
, 1993
"... Traditional compiler optimizations are either dataindependent or optimize around common data values while retaining correct behavior for uncommon values. This paper examines value-specific data-dependent optimizations (VSO), where code is optimized at runtime around particular input values. Because ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
Traditional compiler optimizations are either dataindependent or optimize around common data values while retaining correct behavior for uncommon values. This paper examines value-specific data-dependent optimizations (VSO), where code is optimized at runtime around particular input values. Because VSO optimizes for the specific case, the resulting code is more efficient. However, since optimization is performed at runtime, the performance improvement must more than pay for the runtime compile costs. We describe two VSO implementation techniques and compare the performance of applications that have been implemented using both VSO and static code. The results demonstrate that VSO produces better code and often for reasonable input sizes. The machine-independent implementations showed speedups of up to 1.5 over static C code, and the machine-dependent versions showed speedups of up to 4.3 over static assembly code. 1 Introduction Traditional compiler optimizations are performed statical...
Integrating Performance Monitoring and Communication in Parallel Computers
- in Proceedings of the SIGMETRICS International Conference
, 1996
"... A large and increasing gap exists between processor and memory speeds in scalable cache-coherent multiprocessors. To cope with this situation, programmers and compiler writers must increasingly be aware of the memory hierarchy as they implement software. Tools to support memory performance tuning ha ..."
Abstract
-
Cited by 28 (6 self)
- Add to MetaCart
A large and increasing gap exists between processor and memory speeds in scalable cache-coherent multiprocessors. To cope with this situation, programmers and compiler writers must increasingly be aware of the memory hierarchy as they implement software. Tools to support memory performance tuning have, however, been hobbled by the fact that it is difficult to observe the caching behavior of a running program. Little hardware support exists specifically for observing caching behavior; furthermore, what support does exist is often difficult to use for making fine-grained observations about program memory behavior. Our work observes that in a multiprocessor, the actions required for memory performance monitoring are similar to those required for enforcing cache coherence. In fact, we argue that on several machines, the coherence/communication system itself can be used as machine support for performance monitoring. We have demonstrated this idea by implementing the FlashPoint memory performance monitoring tool. FlashPoint is implemented as a special performance-monitoring coherence protocol for the Stanford FLASH Multiprocessor. By embedding performance monitoring into a cache-coherence scheme based on a programmable controller, we can gather detailed, per-data-structure, memory statistics with less than a 10 % slowdown compared to unmonitored program executions. We present results on the accuracy of the data collected, and on how FlashPoint performance scales with the number of processors. 1
Signatures: A Language Extension for Improving Type Abstraction and Subtype Polymorphism in C++
- SOFTWARE–PRACTICE AND EXPERIENCE
, 1995
"... ..."
The effect of code expanding optimizations on instruction cache design
- IEEE TRANS. COMPUT
, 1993
"... This paper shows that code expanding optimizations have strong and non-intuitive implications on instruction cache design. Three types of code expanding optimizations are studied in this paper: instruction placement, function inline expansion, and superscalar optimizations. Overall, instruction plac ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
This paper shows that code expanding optimizations have strong and non-intuitive implications on instruction cache design. Three types of code expanding optimizations are studied in this paper: instruction placement, function inline expansion, and superscalar optimizations. Overall, instruction placement reduces the miss ratio of small caches. Function inline expansion improves the performance for small cache sizes, but degrades the performance of medium caches. Superscalar optimizations increases the cache size required for a given miss ratio. On the other hand, they also increase the sequentiality of instruction access so that a simple load-forward scheme effectively cancels the negative effects. Overall, we show that with load forwarding, the three types of code expanding optimizations jointly improve the performance of small caches and have little effect on large caches.

