Results 11 - 20
of
258
Memory safety without garbage collection for embedded applications
- ACM Transactions on Embedded Computing Systems
, 2005
"... Traditional approaches to enforcing memory safety of programs rely heavily on run-time checks of memory accesses and on garbage collection, both of which are unattractive for embedded applications. The goal of our work is to develop advanced compiler techniques for enforcing memory safety with minim ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
Traditional approaches to enforcing memory safety of programs rely heavily on run-time checks of memory accesses and on garbage collection, both of which are unattractive for embedded applications. The goal of our work is to develop advanced compiler techniques for enforcing memory safety with minimal runtime overheads. In this paper, we describe a set of compiler techniques that, together with minor semantic restrictions on C programs and no new syntax, ensure memory safety and provide most of the error-detection capabilities of type-safe languages, without using garbage collection, and with no run-time software checks (on systems with standard hardware support for memory management). The language permits arbitrary pointer-based data structures, explicit deallocation of dynamically allocated memory, and restricted array operations. One of the key results of this paper is a compiler technique that ensures that dereferencing dangling pointers to freed memory does not violate memory safety, without annotations, run-time checks or garbage collection, and works for arbitrary type-safe C programs. Furthermore, we present a new interprocedural analysis for static array bounds checking under certain assumptions. For a diverse set of embedded C programs we show that we are able to ensure memory safety of pointer and dynamic memory usage in all these programs with no run-time software checks (on systems with standard hardware memory protection), requiring only minor restructuring to conform to simple type restrictions. Static array bounds checking fails for roughly half the programs we study due to complex array references, and these are the only cases where explicit runtime software checks would be needed under our language and system assumptions.
Copy Or Discard Execution Model For Speculative Parallelization On Multicores
"... The advent of multicores presents a promising opportunity for speeding up sequential programs via profile-based speculative parallelization of these programs. In this paper we present a novel solution for efficiently supporting software speculation on multicore processors. We propose the Copy or Dis ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
The advent of multicores presents a promising opportunity for speeding up sequential programs via profile-based speculative parallelization of these programs. In this paper we present a novel solution for efficiently supporting software speculation on multicore processors. We propose the Copy or Discard (CorD) execution model in which the state of speculative parallel threads is maintained separately from the nonspeculative computation state. If speculation is successful, the results of the speculative computation are committed by copying them into the non-speculative state. If misspeculation is detected, no costly state recovery mechanisms are needed as the speculative state can be simply discarded. Optimizations are proposed to reduce the cost of data copying between nonspeculative and speculative state. A lightweight mechanism that maintains version numbers for non-speculative data values enables misspeculation detection. We also present an algorithm for profile-based speculative parallelization that is effective in extracting parallelism from sequential programs. Our experiments show that the combination of CorD and our speculative parallelization algorithm achieves speedups ranging from 3.7 to 7.8 on a Dell PowerEdge 1900 server with two Intel Xeon quad-core processors.
Dataflow mini-graphs: Amplifying superscalar capacity and bandwidth
- In Proc. of the 37th Annual International Symposium on Microarchitecture
, 2004
"... A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer. Previous work has exploited dataflow sub-graphs ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer. Previous work has exploited dataflow sub-graphs whose execution latency can be reduced via programmable FPGA-style hardware. In this paper we show that mini-graphs can improve performance by amplifying the bandwidths of a superscalar processor’s stages and the capacities of many of its structures without custom latency-reduction hardware. Amplification is achieved because the processor deals with a complete mini-graph via a single quasi-instruction, the handle. By constraining mini-graph structure and forcing handles to behave as much like singleton instructions as possible, the number and scope of the modifications over a conventional superscalar microarchitecture is kept to a minimum. This paper describes mini-graphs, a simple algorithm for extracting them from basic block frequency profiles, and a microarchitecture for exploiting them. Cycle-level simulation of several benchmark suites shows that mini-graphs can provide average performance gains of 2–12 % over an aggressive baseline, with peak gains exceeding 40%. Alternatively, they can compensate for substantial reductions in register file and scheduler size, and in pipeline bandwidth. 1.
Exhaustive optimization phase order space exploration
- In The International Symposium on Code Generation and Optimization
, 2006
"... The phase-ordering problem is a long standing issue for compiler writers. Most optimizing compilers typically have numerous different code-improving phases, many of which can be applied in any order. These phases interact by enabling or disabling opportunities for other optimization phases to be app ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
The phase-ordering problem is a long standing issue for compiler writers. Most optimizing compilers typically have numerous different code-improving phases, many of which can be applied in any order. These phases interact by enabling or disabling opportunities for other optimization phases to be applied. As a result, varying the order of applying optimization phases to a program can produce different code, with potentially significant performance variation amongst them. Complicating this problem further is the fact that there is no universal optimization phase order that will produce the best code, since the best phase order depends on the function being compiled, the compiler, and the target architecture characteristics. Moreover, finding the optimal optimization sequence for even a single function is hard
Qin Analysing memory resource bounds for low-level programs
- In ISMM 08
, 2008
"... Embedded systems are becoming more widely used but these systems are often resource constrained. Programming models for these systems should take into formal consideration resources such as stack and heap. In this paper, we show how memory resource bounds can be inferred for assembly-level programs. ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Embedded systems are becoming more widely used but these systems are often resource constrained. Programming models for these systems should take into formal consideration resources such as stack and heap. In this paper, we show how memory resource bounds can be inferred for assembly-level programs. Our inference process captures the memory needs of each method in terms of the symbolic values of its parameters. For better precision, we infer path-sensitive information through a novel guarded expression format. Our current proposal relies on a Presburger solver to capture memory requirements symbolically, and to perform fixpoint analysis for loops and recursion. Apart from safety in memory adequacy, our proposal can provide estimate on memory costs for embedded devices and improve performance via fewer runtime checks against memory bound. 1.
Improving program efficiency by packing instructions into registers
- In Proceedings of the 2005 ACM/IEEE International Symposium on Computer Architecture
, 2005
"... New processors, both embedded and general purpose, often have conflicting design requirements involving space, power, and performance. Architectural features and compiler optimizations often target one or more design goals at the expense of the others. This paper presents a novel architectural and c ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
New processors, both embedded and general purpose, often have conflicting design requirements involving space, power, and performance. Architectural features and compiler optimizations often target one or more design goals at the expense of the others. This paper presents a novel architectural and compiler approach to simultaneously reduce power requirements, decrease code size, and improve performance by integrating an instruction register file (IRF) into the architecture. Frequently occurring instructions are placed in the IRF. Multiple entries in the IRF can be referenced by a single packed instruction in ROM or L1 instruction cache. Unlike conventional code compression, our approach allows the frequent instructions to be referenced in arbitrary combinations. The experimental results show significant improvements in space and power, as well as some improvement in execution time when using only 32 entries. These advantages make packing instructions into registers an effective approach for improving overall efficiency. 1.
Using DISE to Protect Return Addresses from Attack
- In Proceedings of the Workshop on Architectural Support for Security and Anti-Virus (WASSA
, 2004
"... Stack-smashing by buffer overflow is a common tactic used by viruses and worms to crash or hijack systems. Exploiting a bounds-unchecked copy into a stack buffer, an attacker can---by supplying a specially-crafted and unexpectedly long input--- overwrite a stored return address and trigger the exec ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Stack-smashing by buffer overflow is a common tactic used by viruses and worms to crash or hijack systems. Exploiting a bounds-unchecked copy into a stack buffer, an attacker can---by supplying a specially-crafted and unexpectedly long input--- overwrite a stored return address and trigger the execution of code of her choosing. In this paper, we propose to protect code from this common form of attack using dynamic instruction stream editing (DISE), a previously proposed hardware mechanism that implements binary rewriting in a transparent, efficient, and convenient way by rewriting the dynamic instruction stream rather than the static executable. Simply, we define productions (rewriting rules) that instrument program calls and returns to maintain and verify a "shadow" stack of return addresses in a protected region of memory. When invalid return addresses are detected, the application is terminated.
NpBench: A benchmark suite for control plane and data plane applications for network processors,” inProc.ofICCD’03
, 2003
"... Modern network interfaces demand highly intelligent traffic management in addition to the basic requirement of wire speed packet forwarding. Several vendors are releasing network processors in order to handle these demands. Network workloads can be classified into data plane and control plane worklo ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Modern network interfaces demand highly intelligent traffic management in addition to the basic requirement of wire speed packet forwarding. Several vendors are releasing network processors in order to handle these demands. Network workloads can be classified into data plane and control plane workloads, however most network processors are optimized for data plane. Also, existing benchmark suites for network processors primarily contain data plane workloads, which perform packet processing for a forwarding function. In this paper, we present a set of benchmarks, called NpBench, targeted towards control plane (e.g., traffic management, quality of service, etc.) as well as data plane workloads. The characteristics of NpBench workloads, such as instruction mix, parallelism, cache behavior and required processing capability per packet, are presented and compared with CommBench, an existing network processor benchmark suite [9]. We also discuss the architectural characteristics of the benchmarks having control plane functions, their implications to designing network processors and the significance of Instruction Level Parallelism (ILP) in network processors. 1.
Measuring Benchmark Similarity Using Inherent Program Characteristics,” Laboratory of Computer Architecture
, 2006
"... Abstract—This paper proposes a methodology for measuring the similarity between programs based on their inherent microarchitecture-independent characteristics, and demonstrates two applications for it: 1) finding a representative subset of programs from benchmark suites and 2) studying the evolution ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Abstract—This paper proposes a methodology for measuring the similarity between programs based on their inherent microarchitecture-independent characteristics, and demonstrates two applications for it: 1) finding a representative subset of programs from benchmark suites and 2) studying the evolution of four generations of SPEC CPU benchmark suites. Using the proposed methodology, we find a representative subset of programs from three popular benchmark suites—SPEC CPU2000, MediaBench, and MiBench. We show that this subset of representative programs can be effectively used to estimate the average benchmark suite IPC, L1 data cache miss-rates, and speedup on 11 machines with different ISAs and microarchitectures—this enables one to save simulation time with little loss in accuracy. From our study of the similarity between the four generations of SPEC CPU benchmark suites, we find that, other than a dramatic increase in the dynamic instruction count and increasingly poor temporal data locality, the inherent program characteristics have more or less remained unchanged. Index Terms—Measurement techniques, modeling techniques, performance of systems, performance attributes. æ 1
Operation Tables for Scheduling in the Presence of Incomplete Bypassing
- in CODES+ISSS ’04: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
, 2004
"... Register byp ssing is ap owerful and widely used feature in modernp rocessors to eliminate certain data hazards. Although comp lete byp assing is ideal forp erformance, byp assing has significantimp act on cycle time, area, andp ower consump4 on of the pq cessor. Due to the strict con ..."
Abstract
-
Cited by 12 (9 self)
- Add to MetaCart
Register byp ssing is ap owerful and widely used feature in modernp rocessors to eliminate certain data hazards. Although comp lete byp assing is ideal forp erformance, byp assing has significantimp act on cycle time, area, andp ower consump4 on of the pq cessor. Due to the strict constraints onp erformance, cost andp ower consump3 on in embedded p rocessors, architects need to evaluate and imp lement incompL [: register by p ssing mechanisms. However traditional data hazard detection and/or avoidance techniques used in retargetable schedulers break down in thep resence of incomp ete by p ssing. In thisp ap er, wep resent the concep of Op eration Tables, which can be used to detect data hazards, even in the pq sence of incomp ete by p ssing. Furthermore our technique integrates the detection of both data, as well as resource hazards, and can be easilyemp loyed in a comp iler to generate better schedules. Our exp erimental results on thep op4 ar Intel XScale embeddedp rocessorp latform show that even with a simp e intra-basic block scheduling technique, we achieveupP 20%p erformance impL vement over fully op timized GCC generated code on embedded ap p lications from the MiBench suite.

