Results 1 - 10
of
33
An enhanced access and cycle time model for on-chip caches
, 1994
"... research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There is a second research laboratory located in Palo Al ..."
Abstract
-
Cited by 230 (5 self)
- Add to MetaCart
research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There is a second research laboratory located in Palo Alto, the Systems Research Center (SRC). Other Digital research groups are located in Paris (PRL) and in Cambridge,
Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines
, 1989
"... Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruc ..."
Abstract
-
Cited by 192 (13 self)
- Add to MetaCart
Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining. This is a preprint of a paper that will be presented at the 3rd International Conference on Architectur...
Potential benefits of delta encoding and data compression for HTTP (Corrected version)
, 1997
"... ..."
Rewriting Executable Files to Measure Program Behavior
- SOFTWARE PRACTICE & EXPERIENCE
, 1994
"... ..."
Compacting Garbage Collection with Ambiguous Roots
, 1988
"... This paper introduces a copying garbage collection algorithm which is able to compact most of the accessible storage in the heap without having an explicitly defined set of pointers that contain the roots of all accessible storage. Using "hints" found in the processor's registers and stack, the algo ..."
Abstract
-
Cited by 98 (3 self)
- Add to MetaCart
This paper introduces a copying garbage collection algorithm which is able to compact most of the accessible storage in the heap without having an explicitly defined set of pointers that contain the roots of all accessible storage. Using "hints" found in the processor's registers and stack, the algorithm is able to divide heap allocated objects into two groups: those that might be referenced by a pointer in the stack or registers, and those that are not. The objects which might be referenced are left in place, and the other objects are copied into a more compact representation. A Lisp compiler and runtime system which uses such a collector need not have complete control of the processor in order to force a certain discipline on the stack and registers. A Scheme implementation has been done for the Digital WRL Titan processor which uses a garbage collector based on this "mostly copying" algorithm. Like other languages for the Titan, it uses the Mahler intermediate language as its targe...
Systems for Late Code Modification
- WRL Research Report 91/5
, 1991
"... Modifying code after the compiler has generated it can be useful for both optimization and instrumentation. This paper compares the code modification systems of Mahler and pixie, and describes two new systems we have built that are hybrids of the two. This paper covers material presented at the CODE ..."
Abstract
-
Cited by 88 (5 self)
- Add to MetaCart
Modifying code after the compiler has generated it can be useful for both optimization and instrumentation. This paper compares the code modification systems of Mahler and pixie, and describes two new systems we have built that are hybrids of the two. This paper covers material presented at the CODE '91 International Workshop on Code Generation, Schloss Dagstuhl, Germany, May 20-24, 1991. i 1. Introduction Late code modification is the process of modifying the output of a compiler after the compiler has generated it. The reasons one might want to do this fall into two categories, optimization and instrumentation. Some forms of optimization must be performed on assembly-level or machinelevel code. The oldest is peephole optimization [11], which acts to tidy up code that a compiler has generated; it has since been generalized to include transformations on more machine-independent code [2,3]. Reordering of code to avoid pipeline stalls [4,7,18] is most often done after the code is gene...
Scalable kernel performance for Internet servers under realistic loads
, 1998
"... UNIX Internet servers with an event-driven architecture often perform poorly under real workloads, even if they perform well under laboratory benchmarking conditions. We investigated the poor performance of event-driven servers. We found that the delays typical in wide-area networks cause busy serve ..."
Abstract
-
Cited by 86 (9 self)
- Add to MetaCart
UNIX Internet servers with an event-driven architecture often perform poorly under real workloads, even if they perform well under laboratory benchmarking conditions. We investigated the poor performance of event-driven servers. We found that the delays typical in wide-area networks cause busy servers to manage a large number of simultaneous connections. We also observed that the select system call implementation in most UNIX kernels scales poorly with the number of connections being managed by a process. The UNIX algorithm for allocating file descriptors also scales poorly. These algorithmic problems lead directly to the poor performance of event-driven servers. We implemented scalable versions of the select system call and the descriptor allocation algorithm. This led to an improvement of up to 58% in Web proxy and Web server throughput, and dramatically improved the scalability of the system.
Quantifying behavioral differences between C and C++ programs
- JOURNAL OF PROGRAMMING LANGUAGES
, 1994
"... Improving the performance of C programs has been a topic of great interest for many years. Both hardware technology and compiler optimization research has been applied in an effort to make C programs execute faster. In many application domains, the C++ language is replacing C as the programming lang ..."
Abstract
-
Cited by 83 (15 self)
- Add to MetaCart
Improving the performance of C programs has been a topic of great interest for many years. Both hardware technology and compiler optimization research has been applied in an effort to make C programs execute faster. In many application domains, the C++ language is replacing C as the programming language of choice. In this paper, we measure the empirical behavior of a group of significant C and C++ programs and attempt to identify and quantify behavioral differences between them. Our goal is to determine whether optimization technology that has been successful for C programs will also be successful in C++ programs. We furthermore identify behavioral characteristics of C++ programs that suggest optimizations that should be applied in those programs. Our results show that C++ programs exhibit behavior that is significantly different than C programs. These results should be of interest to compiler writers and architecture designers who are designing systems to execute object-oriented programs.
Memory-System Design Considerations For Dynamically-Scheduled Microprocessors
, 1997
"... Memory-System Design Considerations for Dynamically-Scheduled Microprocessors Keith Istvan Farkas Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 1997 Dynamically-scheduled processors challenge hardware and software architects to develop designs ..."
Abstract
-
Cited by 66 (4 self)
- Add to MetaCart
Memory-System Design Considerations for Dynamically-Scheduled Microprocessors Keith Istvan Farkas Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 1997 Dynamically-scheduled processors challenge hardware and software architects to develop designs that balance hardware complexity and compiler technology against performance targets. This dissertation presents a first thorough look at some of the issues introduced by this hardware complexity. The focus of the investigation of these issues is the register file and the other components of the data memory system. These components are: the lockup-free data cache, the stream buffers, and the interface to the lower levels of the memory system. The investigation is based on software models. These models incorporate the features of a dynamically-scheduled processor that affect the design of the data-memory components. The models represent a balance between accuracy and generality, and ar...
System Support for Automatic Profiling and Optimization
"... The Morph system provides a framework for automatic collection and management of profile information and application of profile-driven optimizations. In this paper, we focus on the operating system support that is required to collect and manage profile information on an end-user’s workstation in an ..."
Abstract
-
Cited by 59 (6 self)
- Add to MetaCart
The Morph system provides a framework for automatic collection and management of profile information and application of profile-driven optimizations. In this paper, we focus on the operating system support that is required to collect and manage profile information on an end-user’s workstation in an automatic, continuous, and transparent manner. Our implementation for a Digital Alpha machine running Digital UNIX 4.0 achieves run-time overheads of less than 0.3 % during profile collection. Through the application of three code layout optimizations, we further show that Morph can use statistical profiles to improve application performance. With appropriate system support, automatic profiling and optimization is both possible and effective.

