Results 1 - 10
of
14
Quantifying behavioral differences between C and C++ programs
- JOURNAL OF PROGRAMMING LANGUAGES
, 1994
"... Improving the performance of C programs has been a topic of great interest for many years. Both hardware technology and compiler optimization research has been applied in an effort to make C programs execute faster. In many application domains, the C++ language is replacing C as the programming lang ..."
Abstract
-
Cited by 83 (15 self)
- Add to MetaCart
Improving the performance of C programs has been a topic of great interest for many years. Both hardware technology and compiler optimization research has been applied in an effort to make C programs execute faster. In many application domains, the C++ language is replacing C as the programming language of choice. In this paper, we measure the empirical behavior of a group of significant C and C++ programs and attempt to identify and quantify behavioral differences between them. Our goal is to determine whether optimization technology that has been successful for C programs will also be successful in C++ programs. We furthermore identify behavioral characteristics of C++ programs that suggest optimizations that should be applied in those programs. Our results show that C++ programs exhibit behavior that is significantly different than C programs. These results should be of interest to compiler writers and architecture designers who are designing systems to execute object-oriented programs.
Reducing Branch Costs via Branch Alignment
- In Six International Conference on Architectural Support for Programming Languages and Operating Systems
, 1994
"... Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch prediction reported small or minimal improvements. ..."
Abstract
-
Cited by 80 (13 self)
- Add to MetaCart
Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch prediction reported small or minimal improvements. As wide-issue architectures become increasingly popular the importance of reducing branch costs will increase, and branch alignment is one mechanism which can effectively reduce these costs. In this paper, we propose an improved branch alignment algorithm that takes into consideration the architectural cost model and the branch prediction architecture when performing the basic block reordering. We show that branch alignment algorithms can improve a broad range of static and dynamicbranch prediction architectures. We also show that a programs performance can be improved by approximately 5% even whenusing recently proposed,highly accurate branch prediction architectures. The programs are compi...
SPAID: Software Prefetching in Pointer- and Call-Intensive Environments
- In Proceedings of the 28th annual international symposium on Microarchitecture
, 1995
"... Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in ..."
Abstract
-
Cited by 58 (3 self)
- Add to MetaCart
Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in pointer- and call-intensive environments. We use trace-driven cache simulation of a number of pointer- and call-intensive benchmarks to evaluate the benefits and implementation trade-offs of SPAID. Our results indicate that a significant proportion of the cost of data cache misses can be eliminated or reduced with SPAID without unduly increasing memory traffic. 1. Introduction It is well known that processor clock speeds are increasing exponentially over time, while memory speeds are not increasing nearly as rapidly [RD94]. The computing industry has reached the point where system performance is dominated by the cost of servicing cache misses. To address this problem, several instruction s...
Procedure Placement Using Temporal-Ordering Information
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1997
"... ..."
Code Layout Optimizations for Transaction Processing Workloads
- IN PROC. 28TH ANNUAL INT. SYMP. COMPUTER ARCHITECTURE
, 2001
"... Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for system designs since they often exhibit ineffic ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for system designs since they often exhibit inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates. A number of recent studies have characterized the behavior of commercial workloads and proposed architectural features to improve their performance. However, there has been little research on the impact of software and compiler-level optimizations for improving the behavior of such workloads. This paper provides a detailed study of profile-driven compiler optimizations to improve the code layout in commercial workloads with
A Hypergraph Framework For Optimal Model-Based Decomposition Of Design Problems
- Computational Optimization and Applications
, 1997
"... Decomposition of large engineering system models is desirable since increased model size reduces reliability and speed of numerical solution algorithms. The article presents a methodology for optimal model-based decomposition (OMBD) of design problems, whether or not initially cast as optimization p ..."
Abstract
-
Cited by 17 (11 self)
- Add to MetaCart
Decomposition of large engineering system models is desirable since increased model size reduces reliability and speed of numerical solution algorithms. The article presents a methodology for optimal model-based decomposition (OMBD) of design problems, whether or not initially cast as optimization problems. The overall model is represented by a hypergraph and is optimally partitioned into weakly connected subgraphs that satisfy decomposition constraints. Spectral graph-partitioning methods together with iterative improvement techniques are proposed for hypergraph partitioning. A known spectral K-partitioning formulation, which accounts for partition sizes and edge weights, is extended to graphs with also vertex weights. The OMBD formulation is robust enough to account for computational demands and resources and strength of interdependencies between the computational modules contained in the model. KEYWORDS: Model decomposition, multidisciplinary design, hypergraph partitioning, larges...
Local Area Network Traffic Locality: Characteristics and Application
, 1992
"... Local area networks (LANs) are a popular means for connecting autonomous workstations together in a computational environment. LANs offer several advantages over traditional centralized computer systems, including better reliability, scalability, and cost. Certain limitations of LANs like their limi ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Local area networks (LANs) are a popular means for connecting autonomous workstations together in a computational environment. LANs offer several advantages over traditional centralized computer systems, including better reliability, scalability, and cost. Certain limitations of LANs like their limited geographical span and the upper bound on the number of hosts that can be connected by a single LAN, can be overcome by interconnecting LANs together into an extended LAN or internetwork. Interconnection devices, such as bridges or routers, forward packets as necessary between the LANs, creating the illusion of a single large network. The performance of network interconnection devices is key to the success of extended LANs. Interconnection devices must selectively forward packets between LANs with minimal delay, so as to transparently extend the LAN. As networks increase in size and bandwidth, the need for efficient network interconnection devices increases. One of the goals of this the...
Code Placement using Temporal Profile Information
, 1998
"... Instruction cache performance is important to instruction fetch efficiency and overall processor performance. The layout of an executable has a substantial effect on the cache miss rate and the instruction working set size during execution. This means that the performance of an executable can be imp ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Instruction cache performance is important to instruction fetch efficiency and overall processor performance. The layout of an executable has a substantial effect on the cache miss rate and the instruction working set size during execution. This means that the performance of an executable can be improved significantly by applying a code-placement algorithm that minimizes instruction cache conflicts and improves spatial locality. We describe an algorithm for procedure placement, one type of code-placement algorithm, that significantly differs from previous approaches in the type of information used to drive the placement algorithm. In particular, we gather temporal ordering information that summarizes the interleaving of procedures in a program trace. Our algorithm uses this information along with cache configuration and procedure size information to better estimate the conflict cost of a potential procedure ordering. It optimizes the procedure placement for single- and multi-level caches. In addition to reducing instruction cache conflicts, the algorithm simultaneously minimizes the instruction working set size of the program. We compare the performance of our algorithm with a particularly successful procedure-placement algorithm and show noticeable improvements in the instruction cache behavior, while maintaining the same instruction working set size.
Hardware And Software Mechanisms For Reducing Load Latency
, 1996
"... As processor demands quickly outpace memory, the performance of load instructions becomes an increasingly critical component to good system performance. This thesis contributes four novel load latency reduction techniques, each targeting a different component of load latency: address calculation, da ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
As processor demands quickly outpace memory, the performance of load instructions becomes an increasingly critical component to good system performance. This thesis contributes four novel load latency reduction techniques, each targeting a different component of load latency: address calculation, data cache access, address translation, and data cache misses. The contributed techniques are as follows: ffl Fast Address Calculation employs a stateless set index predictor to allow address calculation to overlap with data cache access. The design eliminates the latency of address calculation for many loads. ffl Zero-Cycle Loads combine fast address calculation with an early-issue mechanism to produce pipeline designs capable of hiding the latency of many loads that hit in the data cache. ffl High-Bandwidth Address Translation develops address translation mechanisms with better latency and area characteristics than a multi-ported TLB. The new designs provide multiple-issue processors with ...
Tailoring programs to models of program behavior
- IBM Journal of Research and Development
, 1975
"... Abstract: This paper considers the premise that, in addition to trying to solve the virtual-memory-system performance problem by devising a storage management strategy suitable for the broad spectrum of behavior exhibited by programs, efforts also be made to tailor the behavior of each program to th ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract: This paper considers the premise that, in addition to trying to solve the virtual-memory-system performance problem by devising a storage management strategy suitable for the broad spectrum of behavior exhibited by programs, efforts also be made to tailor the behavior of each program to the model underlying the storage management strategy under which the program will have to run. It is observed that a viable approach to program tailoring is offered by restructuring techniques. The application of dynamic off-line techniques to the tailoring problem is discussed, and an algorithm which may be used to fit program behavior to the working set model is described in detail as an example. The performance of this algorithm in dealing with two real-program traces is experimentally evalu-ated under a variety of conditions and found to be always satisfactory.

