Results 1 - 10
of
14
Optimistic Incremental Specialization: Streamlining a Commercial Operating System
- Proc. of SOSP
, 1995
"... Conventionaloperating system code is written to deal with all possible system states, and performs considerable interpretation to determine the current system state before taking action. A consequence of this approach is that kernel calls which perform little actual work take a long time to execute. ..."
Abstract
-
Cited by 142 (44 self)
- Add to MetaCart
Conventionaloperating system code is written to deal with all possible system states, and performs considerable interpretation to determine the current system state before taking action. A consequence of this approach is that kernel calls which perform little actual work take a long time to execute. To address this problem, we use specialized operating system code that reduces interpretation for common cases, but still behaves correctly in the fully general case. We describe how specialized operating system code can be generated and bound incrementally as the information on which it depends becomes available. We extendour specialization techniquesto include the notion of optimistic incremental specialization: a technique for generating specialized kernel code optimistically for system states that are likely to occur, but not certain. The ideas outlined in this paper allow the conventional kernel design tenet of “optimizing for the common case ” to be extended to the domain of adaptive operating systems. We also show that aggressive use of specialization can produce in-kernel implementations of operating system functionality with performance comparable to user-level implementations. We demonstrate that these ideas are applicable in real-world operating systems by describing a re-implementation of the HP-UX file system. Our specialized read system call reduces the cost of a single byte read by a factor of 3, and an 8 KB read by 26%, while preserving the semantics of the HP-UX read call. By relaxing the semantics of HP-UX read we were able to cut the cost of a single byte read system call by more than an order of magnitude. 1
Generating Efficient Protocol Code from an Abstract Specification
, 1996
"... A protocol compiler takes as input an abstract specification of a protocol and generates an implementation of that protocol. Protocol compilers usually produce inefficient code both in terms of code speed and code size. In this paper, we show that by compiling a modular specification into an integra ..."
Abstract
-
Cited by 50 (0 self)
- Add to MetaCart
A protocol compiler takes as input an abstract specification of a protocol and generates an implementation of that protocol. Protocol compilers usually produce inefficient code both in terms of code speed and code size. In this paper, we show that by compiling a modular specification into an integrated automaton and by selectively optimizing its different transitions, it is possible to automatically generate efficient protocol code. Our protocol compiler takes as input a protocol specification in the synchronous language Esterel and compiles it into a C implementation. This process is divided into two stages. First, the specicfiation is compiled into an integrated automaton by the Esterel front end. This automaton is then optimized and converted into an efficient C implementation by a protocol code optimizer called HIPPCO. HIPPCO improves performance and reduces code size by simultaneously optimizing the performance of common path whi...
Fast Concurrent Dynamic Linking for an Adaptive Operating System
- In International Conference on Configurable Distributed Systems (ICCDS'96
, 1996
"... The need for customizable and application-specific operating systems has been recognized for many years. A customizable operating system is one that can adapt to some particular circumstance to gain some functional or performance benefits. Microkernels have attempted to address this problem, but suf ..."
Abstract
-
Cited by 28 (9 self)
- Add to MetaCart
The need for customizable and application-specific operating systems has been recognized for many years. A customizable operating system is one that can adapt to some particular circumstance to gain some functional or performance benefits. Microkernels have attempted to address this problem, but suffer performance degradation due to the cost of inter-process protection barriers. Commercial operating systems that can efficiently adapt themselves to changing circumstances have failed to appear, in part due to the difficulty of providing an interface that is efficient to invoke, provides a protection barrier, and can be dynamically reconfigured. Providing such a safe, efficient, and dynamic interface in a concurrent operating system requires an effective concurrency control mechanism to prevent conflicts between system components proposing to execute specialized components, and those components responsible for dynamically replacing specialized components. This paper outlines our basic app...
Demultiplexed Architectures: A Solution for Efficient STREAMS Based Communication Stacks
- IEEE Network Magazine
, 1997
"... : This paper analyzes the efficiency of various high performance implementation techniques for the communication system of UNIX workstations. Using an Open System implies that a certain compatibility level is required from the protocol, user interface, and implementation framework. These constraints ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
: This paper analyzes the efficiency of various high performance implementation techniques for the communication system of UNIX workstations. Using an Open System implies that a certain compatibility level is required from the protocol, user interface, and implementation framework. These constraints limit the opportunities to design a high performance communication system. We have designed an experimental platform around the TCP/IP protocol suite, using the STREAMS environment. A BSD TCP/IP stack and a classic STREAMS based TCP/IP stack serve as reference implementations for performance comparisons. We explain why the efficiency of some high performance implementation techniques we applied to this platform is limited. The impacts of the hardware architecture, of the operating system, and of the communication stack architecture on performances are analyzed. It is shown that the efficiency of data transmission would benefit from more simplicity and more synchronism in the communication e...
Integrated Layer Processing Can Be Hazardous to Your Performance
, 1996
"... Integrated Layer Processing (ILP) has been presented as an implementation technique to improve communication protocol performance by reducing the number of memory references. Previous research has however not pointed out that in some circumstances ILP can significantly increase the number of memory ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Integrated Layer Processing (ILP) has been presented as an implementation technique to improve communication protocol performance by reducing the number of memory references. Previous research has however not pointed out that in some circumstances ILP can significantly increase the number of memory references, resulting in lower communication throughput. We explore the performance effects of applying ILP to data manipulation functions with varying characteristics. The functions are generated from a set of parameters including input and output block size, state size and number of instructions. We present experimental data for varying function state sizes, number of integrated functions and instruction counts. The results clearly show that the aggregated state of the functions must fit in registers for ILP to be competitive. Keywords ILP, Integrated Layer Processing, performance, protocol implementation Supported by the Swedish National Board for Industrial and Technical Development,...
When does Dedicated Protocol Processing Make Sense?
- Computer Sciences Department, University of Wisconsin--Madison
, 1996
"... Distributed-memory parallel computers and networks of workstations (NOWs) both rely on efficient communication over increasingly high-speed networks. Software communication protocols--- from flow-control and reliable delivery to multicasting and coherent distributed shared memory--- are often the pe ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Distributed-memory parallel computers and networks of workstations (NOWs) both rely on efficient communication over increasingly high-speed networks. Software communication protocols--- from flow-control and reliable delivery to multicasting and coherent distributed shared memory--- are often the performance bottleneck. Several current and proposed parallel systems---e.g., the Intel Paragon---address this problem by dedicating one general-purpose processor (in a multiprocessor node) specifically for protocol processing. This operating system convention reduces communication latency and increases effective bandwidth, but also reduces the peak performance since the dedicated processor no longer performs "useful" computation. In this paper, we study a network of multiprocessor workstations and ask the question: "when does it make sense to dedicate a processor specifically for protocol processing?" We compare three protocol processing policies: Single, the baseline case with one processor ...
Towards Predictable ILP Performance - Controlling Communication Buffer Cache Effects
, 1996
"... Cache memory behavior is becoming more and more important as the speed of CPUs is increasing faster than the speed of memories. The operation of caches are statistical which means that the system level performance becomes unpredictable. In this paper we investigate the worst case behavior of cache l ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Cache memory behavior is becoming more and more important as the speed of CPUs is increasing faster than the speed of memories. The operation of caches are statistical which means that the system level performance becomes unpredictable. In this paper we investigate the worst case behavior of cache line conflicts in the context of communication protocols implemented using Integrated Layer Processing. The goal of our work is to control the cache by placing communication buffers and code in non-conflicting positions in the cache. The result would be higher and more predictable performance. Our first results indicate that the worst case behavior can be up to almost four times slower than the best case. Supported by the Swedish National Board for Industrial and Technical Development, under Esprit Basic Research Action project HIPPARCH EC-AUS004. y Swedish Institute of Computer Science, Box 1263, S-164 28 Kista, Sweden. z Uppsala University, Dept. of Computer Systems, Box 325, S-751 05...
Fine-Grain Protocol Execution Mechanisms & Scheduling Policies on SMP Clusters
, 1998
"... Symmetric multiprocessor (SMP) clusters are emerging as the cost-effective medium- to large-scale parallel computers of choice, exploiting the superior cost-performance of SMP desktops and servers. These machines implement communication among SMP nodes by sending/receiving messages through an interc ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Symmetric multiprocessor (SMP) clusters are emerging as the cost-effective medium- to large-scale parallel computers of choice, exploiting the superior cost-performance of SMP desktops and servers. These machines implement communication among SMP nodes by sending/receiving messages through an interconnection network. Many applications and systems use a variety of software protocols to coordinate this communication. As such, protocol performance can significantly impact communication time and overall system performance. This thesis proposes and evaluates techniques to improve fine-grain software protocol performance. Rather than provide embedded network interface processors, some systems schedule and execute the protocol code on the SMP processors to reduce hardware complexity and cost. This thesis evaluates when it is beneficial to dedicate one or more processors in every SMP to always execute the protocol code. Results from simulating a finegrain software distributed shared memory (D...
HIPPCO: A High Performance Protocol Code Optimizer
, 1995
"... This report presents HIPPCO, an High Performance Protocol Code Optimizer. HIPPCO belongs to the HIPPARCH compiler. HIPPARCH is a tool which proposes to generate automatically from the application communication requirements and the network characteristics an efficient implementation of a customized p ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This report presents HIPPCO, an High Performance Protocol Code Optimizer. HIPPCO belongs to the HIPPARCH compiler. HIPPARCH is a tool which proposes to generate automatically from the application communication requirements and the network characteristics an efficient implementation of a customized protocol. HIPPCO is the last stage of this protocol compiler. It takes as input a description of the protocol automaton, optimizes it and generates an implementation in C. HIPPCO decomposes the protocol automaton in two parts: the common and uncommon path. It then uses this decomposition to apply a set of optimizations toward a good code speed/code size tradeoff. In the first part of this report, the code speed optimizations are described. Those optimizations reduces the number of executed instructions and improves the instruction cache and pipeline behaviors. In the second part, a comparison of HIPPCO automatically generated implementations of TCP are compared with the BSD implementation. ...
Function Outlining and Partial Inlining
"... Frequently invoked large functions are common in non-numeric applications. These large functions present challenges to modern compilers not only because they require more time and resources at compilation time, but also because they may prevent optimizations such as function inlining. However, usual ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Frequently invoked large functions are common in non-numeric applications. These large functions present challenges to modern compilers not only because they require more time and resources at compilation time, but also because they may prevent optimizations such as function inlining. However, usually it is the case that large portions of the code in a hot function fhost are executed much less frequently than fhost itself. Partial inlining is a natural solution to the problems caused by including cold code segments that are seldom executed into hot functions that are frequently invoked. When applying partial inlining, a compiler outlines cold statements from a hot function fhost. After outlining, fhost becomes smaller and thus can be easily inlined. This paper presents a framework for function outlining and partial inlining that includes several innovations: (1) an abstract-syntax-tree-based analysis and transformation to form cold regions for outlining; (2) a set of ¤exible heuristics to control the aggressiveness of function outlining; (3) several possible function outlining strategies; (4) alias agent, a new technique that overcomes negative side-effects of function outlining. With the proper strategy, partial inlining improves performance by up to 5.75%. A performance study also suggests that partial inlining is not effective on enabling more aggressive inlining. The performance improvement from partial inlining actually comes from better code placement and better code generation. 1

