Results 1 - 10
of
258
An Optimal Memory Allocation Scheme for Scratch-Pad Based Embedded Systems
, 2002
"... This paper presents a technique for the efficient compiler management of software-exposed heterogeneous memory. In many lower-end embedded chips, often used in micro-controllers and DSP processors, heterogeneous memory units such as scratch-pad SRAM, internal DRAM, external DRAM and ROM are visible ..."
Abstract
-
Cited by 66 (5 self)
- Add to MetaCart
This paper presents a technique for the efficient compiler management of software-exposed heterogeneous memory. In many lower-end embedded chips, often used in micro-controllers and DSP processors, heterogeneous memory units such as scratch-pad SRAM, internal DRAM, external DRAM and ROM are visible directly to the software, without automatic management by a hardware caching mechanism. Instead the memory units are mapped to different portions of the address space. Caches are avoided because of their cost and power consumption, and because they make it difficult to guarantee real-time performance. For this important class of embedded chips, the allocation of data to different memory units to maximize performance is the responsibility of the software
DELI: A New Run-Time Control Point
- In 35th International Symposium on Microarchitecture
, 2002
"... The Dynamic Execution Layer Interface (DELI) offers the following unique capability: it provides fine-grain control over the execution of programs, by allowing its clients to observe and optionally manipulate every single instruction ---at run time---just before it runs. DELI accomplishes this by op ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
The Dynamic Execution Layer Interface (DELI) offers the following unique capability: it provides fine-grain control over the execution of programs, by allowing its clients to observe and optionally manipulate every single instruction ---at run time---just before it runs. DELI accomplishes this by opening up an interface to the layer between the execution of software and hardware. To avoid the slowdown, DELI caches a private copy of the executed code and always runs out of its own private cache.
Finding effective optimization phase sequences
- In Proceedings of the 2003 ACM SIGPLAN conference on Language, Compiler, and Tool for Embedded Systems
, 2003
"... It has long been known that a single ordering of optimization phases will not produce the best code for every application. This phase ordering problem can be more severe when generating code for embedded systems due to the need to meet conflicting constraints on time, code size, and power consumptio ..."
Abstract
-
Cited by 45 (11 self)
- Add to MetaCart
It has long been known that a single ordering of optimization phases will not produce the best code for every application. This phase ordering problem can be more severe when generating code for embedded systems due to the need to meet conflicting constraints on time, code size, and power consumption. Given that many embedded application developers are willing to spend time tuning an application, we believe a viable approach is to allow the developer to steer the process of optimizing a function. In this paper, we describe support in VISTA, an interactive compilation system, for finding effective sequences of optimization phases. VISTA provides the user with dynamic and static performance information that can be used during an interactive compilation session to gauge the progress of improving the code. In addition, VISTA provides support for automatically using performance information to select the best optimization sequence among several attempted. One such feature is the use of a genetic algorithm to search for the most efficient sequence based on specified fitness criteria. We hav e included a number of experimental results that evaluate the effectiveness of using a genetic algorithm in VISTA to find effective optimization phase sequences.
Methods for Evaluating and Covering the Design Space during Early Design Development
- Integration, the VLSI Journal
, 2003
"... This paper gives an overview of methods used for Design Space Exploration (DSE) at the system- and micro-architecture levels. The DSE problem is considered to be two orthogonal issues: (I) How could a single design point be evaluated, (II) how could the design space be covered during the explorat ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
This paper gives an overview of methods used for Design Space Exploration (DSE) at the system- and micro-architecture levels. The DSE problem is considered to be two orthogonal issues: (I) How could a single design point be evaluated, (II) how could the design space be covered during the exploration process? The latter question arises since an exhaustive exploration of the design space by evaluating every possible design point is usually prohibitive due to the sheer size of the design space. We therefore reveal trade-o#s linked to the choice of appropriate evaluation and coverage methods. The designer has to balance the following issues: the accuracy of the evaluation, the time it takes to evaluate one design point (including the implementation of the evaluation model), the precision/granularity of the design space coverage, and last but not least the possibilities for automating the exploration process. We also list common representations of the design space and compare current system and micro-architecture level design frameworks. This review thus eases the choice of a decent exploration policy by providing a comprehensive survey and classification of recent related work. It is focused on System-on-a-Chip designs, particularly those used for network processors. These systems are heterogeneous in nature using multiple computation, communication, memory, and peripheral resources.
Memory Safety Without Runtime Checks or Garbage Collection
- In ACM SIGPLAN 2003 Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’2003
, 2003
"... Traditional approaches to enforcing memory safety of programs rely heavily on runtime checks of memory accesses and on garbage collection, both of which are unattractive for embedded applications. The long-term goal of our work is to enable 100% static enforcement of memory safety for embedded progr ..."
Abstract
-
Cited by 39 (7 self)
- Add to MetaCart
Traditional approaches to enforcing memory safety of programs rely heavily on runtime checks of memory accesses and on garbage collection, both of which are unattractive for embedded applications. The long-term goal of our work is to enable 100% static enforcement of memory safety for embedded programs through advanced compiler techniques and minimal semantic restrictions on programs. The key result of this paper is a compiler technique that ensures memory safety of dynamically allocated memory without programmer annotations, runtime checks, or garbage collection, and works for a large subclass of type-safe C programs. The technique is based on a fully automatic pool allocation (i.e., region-inference) algorithm for C programs we developed previously, and it ensures safety of dynamically allocated memory while retaining explicit deallocation of individual objects within regions (to avoid garbage collection). For a diverse set of embedded C programs (and using a previous technique to avoid null pointer checks), we show that we are able to statically ensure the safety of pointer and dynamic memory usage in all these programs. We also describe some improvements over our previous work in static checking of array accesses. Overall, we achieve 100% static enforcement of memory safety without new language syntax for a significant subclass of embedded C programs, and the subclass is much broader if array bounds checks are ignored.
Scalable Vector Media-processors for Embedded Systems
, 2002
"... Over the past twenty years, processor designers have concentrated on superscalar and VLIW architectures that exploit the instruction-level parallelism (ILP) available in engineering applications for workstation systems. Recently, however, the focus in computing has shifted from engineering to multim ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
Over the past twenty years, processor designers have concentrated on superscalar and VLIW architectures that exploit the instruction-level parallelism (ILP) available in engineering applications for workstation systems. Recently, however, the focus in computing has shifted from engineering to multimedia applications and from workstations to embedded systems. In this new computing environment, the performance, energy consumption, and development cost of ILP processors renders them ineffective despite their theoretical generality. This thesis
From Sequences of Dependent Instructions to Functions: An Approach for Improving Performance without ILP or Speculation
- ISCA'04
, 2004
"... In this article, we present an approach for improving the performance of sequences of dependent instructions. We observe that many sequences of instructions can be interpreted as functions. Unlike sequences of instructions, functions can be translated into very fast but exponentially costly two-leve ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
In this article, we present an approach for improving the performance of sequences of dependent instructions. We observe that many sequences of instructions can be interpreted as functions. Unlike sequences of instructions, functions can be translated into very fast but exponentially costly two-level combinational circuits. We present an approach that exploits this principle, speeds up programs thanks to circuit-level parallelism/redundancy, but avoids the exponential costs. We analyze the potential of this approach, and then we propose an implementation that consists of a superscalar processor with a large specific functional unit associated with specific back-end transformations. The performance of the SpecInt2000 benchmarks and selected programs from the Olden and MiBench benchmark suites improves on average from 2.4 % to 12 % depending on the latency of the functional units, and up to 39.6%; more precisely, the performance of optimized code sections improves on average from 3.5 % to 19%, and up to 49%.
The ALPBench Benchmark Suite for Complex Multimedia Applications
- In Proc. of the IEEE Int. Symp. on Workload Characterization
, 2005
"... Multimedia applications are becoming increasingly important for a large class of general-purpose processors. Contemporary media applications are highly complex and demand high performance. A distinctive feature of these applications is that they have significant parallelism, including thread-, data- ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Multimedia applications are becoming increasingly important for a large class of general-purpose processors. Contemporary media applications are highly complex and demand high performance. A distinctive feature of these applications is that they have significant parallelism, including thread-, data-, and instruction-level parallelism, that is potentially well-aligned with the increasing parallelism supported by emerging multi-core architectures. Designing systems to meet the demands of these applications therefore requires a benchmark suite comprising these complex applications and that exposes the parallelism present in them. This paper makes two contributions. First, it presents ALPBench, a publicly available benchmark suite that pulls together five complex media applications from various sources: speech recognition (CMU Sphinx 3), face recognition (CSU), ray tracing (Tachyon), MPEG-2 encode (MSSG), and MPEG-2 decode (MSSG). We have modified the original applications to expose thread-level and datalevel parallelism using POSIX threads and sub-word SIMD (Intel’s SSE2) instructions respectively. Second, the paper provides a performance characterization of the ALPBench benchmarks, with a focus on parallelism. Such a characterization is useful for architects and compiler writers for designing systems and compiler optimizations for these applications. 1.
Rapidly selecting good compiler optimizations using performance counters
- In Proceedings of the 5th Annual International Symposium on Code Generation and Optimization (CGO
, 2007
"... Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is nontrivial. There have been several proposed techniques that search the space ..."
Abstract
-
Cited by 22 (10 self)
- Add to MetaCart
Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is nontrivial. There have been several proposed techniques that search the space of compiler options to find good solutions; however such approaches can be expensive. This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings. This is achieved by learning a model off-line which can then be used to determine good settings for any new program. We show that such an approach outperforms the state-ofthe-art and is two orders of magnitude faster on average. Furthermore, we show that our performance counter based approach outperforms techniques based on static code features. Finally, we show that such improvements are stable across varying input data sets. Using our technique we achieve a 10 % improvement over the highest optimization setting of the commercial PathScale EKOPath 2.3.1 optimizing compiler on the SPEC benchmark suite on a recent AMD Athlon 64 3700+ platform in just three evaluations. 1
Ultra low-cost defect protection for microprocessor pipelines
- In Proceedings of the 12th International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS
, 2006
"... The sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing defects, are a growing challenge that threatens the ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
The sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing defects, are a growing challenge that threatens the yield and product lifetime of future systems. In this paper we introduce the BulletProof pipeline, the first ultra low-cost mechanism to protect a microprocessor pipeline and on-chip memory system from silicon defects. To achieve this goal we combine area-frugal on-line testing techniques and system-level checkpointing to provide the same guarantees of reliability found in traditional solutions, but at much lower cost. Our approach utilizes a microarchitectural checkpointing mechanism which creates coarse-grained epochs of execution, during which distributed on-line built in self-test (BIST) mechanisms validate the integrity of the underlying hardware. In case a failure is detected, we rely on the natural redundancy of instructionlevel parallel processors to repair the system so that it can still operate in a degraded performance mode. Using detailed circuit-level and architectural simulation, we find that our approach provides very high coverage of silicon defects (89%) with little area cost (5.8%). In addition, when a defect occurs, the subsequent degraded mode of operation was found to have only moderate performance impacts, (from 4 % to 18 % slowdown).

