Results 1 - 10
of
90
Heap Data Allocation To Scratch-Pad Memory In Embedded Systems
, 2007
"... This thesis presents the first-ever compile-time method for allocating a portion of a program’s dynamic data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs ..."
Abstract
-
Cited by 43 (5 self)
- Add to MetaCart
This thesis presents the first-ever compile-time method for allocating a portion of a program’s dynamic data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its signifi-cantly lower overheads in access time, energy consumption, area and overall runtime. Dynamic data refers to all objects allocated at run-time in a program, as opposed to static data objects which are allocated at compile-time. Existing compiler methods for allocating data to scratch-pad are able to place only code, global and stack data (static data) in scratch-pad memory; heap and recursive-function objects(dynamic data) are allocated entirely in DRAM, resulting in poor performance for these dy-namic data types. Runtime methods based on software caching can place data in scratch-pad, but because of their high overheads from software address translation, they have not been successful, especially for dynamic data. In this thesis we present a dynamic yet compiler-directed allocation method for dynamic data that for the first time, (i) is able to place a portion of the dynamic
Multiprocessor mapping of process networks: a JPEG decoding case study
- In Proceedings of 15th Intl. Symp. on System Synthesis, 2002
, 2002
"... We present a system-level design and programming method for embedded multiprocessor systems. The aim of the method is to improve the design time and design quality by providing a structured approach for implementing process networks. We use process networks as re-usable and architecture-independent ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
(Show Context)
We present a system-level design and programming method for embedded multiprocessor systems. The aim of the method is to improve the design time and design quality by providing a structured approach for implementing process networks. We use process networks as re-usable and architecture-independent functional specifications. The method facilitates the cost-driven and constraint-driven source code transformation of process networks into architecture-specific implementations in the form of communicating tasks. We apply the method to implement a JPEG decoding process network in software on a set of MIPS processors. We apply three transformations to optimize synchronization rates and data transfers and to exploit data parallelism for this target architecture. We evaluate the impact of the source code transformations and the performance of the resulting implementations in terms of design time, execution time, and code size. The results show that process networks can be implemented quickly and efficiently on embedded multiprocessor systems.
Data remapping for design space optimization of embedded memory systems
- ACM Transactions in Embedded Computing Systems
, 2003
"... In this article, we present a novel linear time algorithm for data remapping, that is, (i) lightweight; (ii) fully automated; and (iii) applicable in the context of pointer-centric programming languages with dynamic memory allocation support. All previous work in this area lacks one or more of these ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
In this article, we present a novel linear time algorithm for data remapping, that is, (i) lightweight; (ii) fully automated; and (iii) applicable in the context of pointer-centric programming languages with dynamic memory allocation support. All previous work in this area lacks one or more of these features. We proceed to demonstrate a novel application of this algorithm as a key step in optimizing the design of an embedded memory system. Specifically, we show that by virtue of locality enhancements via data remapping, we may reduce the memory subsystem needs of an application by 50%, and hence concomitantly reduce the associated costs in terms of size, power, and dollar-investment (61%). Such a reduction overcomes key hurdles in designing highperformance embedded computing solutions. Namely, memory subsystems are very desirable from a performance standpoint, but their costs have often limited their use in embedded systems. Thus, our innovative approach offers the intriguing possibility of compilers playing a significant role in exploring and optimizing the design space of a memory subsystem for an embedded design. To this end and in order to properly leverage the improvements afforded by a compiler optimization, we identify a range of measures for quantifying the cost-impact of popular notions of locality, prefetching, regularity of memory access and others. The proposed methodology will
Data dependency size estimation for use in memory optimization
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
, 2003
"... Abstract — A novel storage requirement estimation methodology is presented for use in the early system design phases when the data transfer ordering is only partly fixed. At that stage, none of the existing estimation tools are adequate, as they either assume a fully specified execution order or ign ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
(Show Context)
Abstract — A novel storage requirement estimation methodology is presented for use in the early system design phases when the data transfer ordering is only partly fixed. At that stage, none of the existing estimation tools are adequate, as they either assume a fully specified execution order or ignore it completely. This paper presents an algorithm for automated estimation of strict upper and lower bounds on the individual data dependency sizes in high level application code given a partially fixed execution ordering. In the overall estimation technique, this is followed by a detection of the maximally combined size of simultaneously alive dependencies, resulting in the overall storage requirement of the application. Using representative application demonstrators, we show how our techniques can effectively guide the designer to achieve a transformed specification with low storage requirement.
Synthesis of Pipelined Memory Access Controllers for Streamed Data Applications on FPGA-based Computing Engines
- In ISSS
, 2001
"... Commercially available behavioral synthesis tools do not adequately support FPGA vendor-specific external memory interfaces making it extremely difficult to exploit pipelined memory access modes as well as application specific memory operations scheduling critical for high-performance solutions. Thi ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
(Show Context)
Commercially available behavioral synthesis tools do not adequately support FPGA vendor-specific external memory interfaces making it extremely difficult to exploit pipelined memory access modes as well as application specific memory operations scheduling critical for high-performance solutions. This lack of support substantially increases the complexity and the burden on designers in the mapping of applications to FPGAbased computing engines. In this paper we address the problem of external memory interfacing and aggressive scheduling of memory operations by proposing a decoupled architecture with two components - one component captures the specific target architecture timing while the other component uses application specific memory access pattern information. Our results support the claim that it is possible to exploit application specific information and integrate that knowledge into custom schedulers that mix pipelined and non-pipelined access modes aimed at reducing the overhead associated with external memory accesses. The results also reveal that the additional design complexity of the scheduler, and its impact in the overall design is minimal.
Influence of onchip scratchpad memories on wcet prediction
, 2004
"... In contrast to standard PCs and many high-performance computer systems, systems that have to meet real-time requirements usually do not feature caches, since caches primarily improve the average case performance, whereas their impact on WCET is generally hard to predict. Especially in embedded syste ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
(Show Context)
In contrast to standard PCs and many high-performance computer systems, systems that have to meet real-time requirements usually do not feature caches, since caches primarily improve the average case performance, whereas their impact on WCET is generally hard to predict. Especially in embedded systems, scratchpad memories have become popular. Since these small, fast memories can be controlled by the programmer or the compiler, their behavior is perfectly predictable. In this paper, we study for the first time the impact of scratchpad memories on worst case execution time (WCET) prediction. Our results indicate that scratchpads can significantly improve WCET at no extra analysis cost. 1
Compiler-optimized Usage of Partitioned Memories
- In Proceedings of the 3rd Workshop on Memory Performance Issues (WMPI2004
, 2004
"... In order to meet the requirements concerning both performance and energy consumption in embedded systems, new memory architectures are being introduced. Beside the well-known use of caches in the memory hierarchy, processor cores today also include small onchip memories called scratchpad memories ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
(Show Context)
In order to meet the requirements concerning both performance and energy consumption in embedded systems, new memory architectures are being introduced. Beside the well-known use of caches in the memory hierarchy, processor cores today also include small onchip memories called scratchpad memories whose usage is not controlled by hardware, but rather by the programmer or the compiler. Techniques for utilization of these scratchpads have been known for some time. Some new processors provide more than one scratchpad, making it necessary to enhance the workflow such that this complex memory architecture can be efficiently utilized. In this work, we present an energy model and an ILP formulation to optimally assign memory objects to different partitions of scratchpad memories at compile time, achieving energy savings of up to 22% compared to previous approaches.
Traffic shaping for an FPGA based SDRAM controller with complex QoS requirements
- In DAC ’05: Proceedings of the 42nd annual conference on Design automation
, 2005
"... Today high-end video and multimedia processing applications require huge amounts of memory. For cost reasons, the usage of conventional dynamic RAM (SDRAM) is preferred. However, SDRAM access optimization is a complex task, especially if multistream access with different QoS requirements is involved ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
Today high-end video and multimedia processing applications require huge amounts of memory. For cost reasons, the usage of conventional dynamic RAM (SDRAM) is preferred. However, SDRAM access optimization is a complex task, especially if multistream access with different QoS requirements is involved. In [8], a multi-stream DDR-SDRAM controller IP covering combinations of low latency requirements for processor cache access, hard realtime constraints for periodic video signals and hard real-time bursty accesses for video coprocessors was described. To handle these contradictory QoS requirements at high system performance, a combination of a 2-stage scheduling algorithm and static priorities were used. This paper describes an additional flow control which enhances the overall performance. Experiments with an FPGA based high-end video platform demonstrate the superiority of this architecture.
Dynamic Platform Management for Configurable Platform-Based System-on-Chips
- in Proc. Int. Conf. Computer-Aided Design
, 2003
"... General-purpose System-on-Chip platforms consisting of configurable components are emerging as an attractive alternative to traditional, customized solutions (e.g., ASICs, custom SoCs), owing to their flexibility, time-to-market advantage, and low engineering costs. However, the adoption of such pla ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
General-purpose System-on-Chip platforms consisting of configurable components are emerging as an attractive alternative to traditional, customized solutions (e.g., ASICs, custom SoCs), owing to their flexibility, time-to-market advantage, and low engineering costs. However, the adoption of such platforms in many high-volume markets (e.g, wireless handhelds) is limited by concerns about their performance and energy-efficiency. This paper addresses the problem of enabling the use of configurable platforms in domains where custom approaches have traditionally been used. We introduce Dynamic Platform Management, a methodology for customizing a configurable general-purpose platform at run-time, to help bridge the performance and energy efficiency gap with custom approaches. The proposed technique uses a software layer that detects time-varying processing requirements imposed by a set of applications, and dynamically optimizes architectural parameters and platform components. Dynamic platform management enables superior application performance, more efficient utilization of platform resources, and improved energy efficiency, as compared to a statically optimized platform, without requiring any modifications to the underlying hardware.
Compact procedural implementation in DSP software synthesis through recursive graph decomposition
- In Proceedings of the International Workshop on Software and Compilers for Embedded Processors
, 2004
"... Abstract. Synthesis of digital signal processing (DSP) software from dataflowbased formal models is an effective approach for tackling the complexity of modern DSP applications. In this paper, an efficient method is proposed for applying subroutine call instantiation of module functionality when syn ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
(Show Context)
Abstract. Synthesis of digital signal processing (DSP) software from dataflowbased formal models is an effective approach for tackling the complexity of modern DSP applications. In this paper, an efficient method is proposed for applying subroutine call instantiation of module functionality when synthesizing embedded software from a dataflow specification. The technique is based on a novel recursive decomposition of subgraphs in a cluster hierarchy that is optimized for low buffer size. Applying this technique, one can achieve significantly lower buffer sizes than what is available for minimum code size inlined schedules, which have been the emphasis of prior software synthesis work. Furthermore, it is guaranteed that the number of procedure calls in the synthesized program is polynomially bounded in the size of the input dataflow graph, even though the number of module invocations may increase exponentially. This recursive decomposition approach provides an efficient means for integrating subroutinebased module instantiation into the design space of DSP software synthesis. The experimental results demonstrate a significant improvement in buffer cost, especially