Results 1 - 10
of
60
Processor acceleration through automated instruction set customization
- In MICRO
, 2003
"... Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism to meet the growing performance and power demands of embedded applications. Hardware, in the form of new function units (or co-processors), and the corresponding instructions, are added to ..."
Abstract
-
Cited by 70 (5 self)
- Add to MetaCart
Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism to meet the growing performance and power demands of embedded applications. Hardware, in the form of new function units (or co-processors), and the corresponding instructions, are added to a baseline processor to meet the critical computational demands of a target application. The central challenge with this approach is the large degree of human effort required to identify and create the custom hardware units, as well as porting the application to the extended processor. In this paper, we present the design of a system to automate the instruction set customization process. A dataflow graph design space exploration engine efficiently identifies profitable computation subgraphs from which to create custom hardware, without artificially constraining their size or shape. The system also contains a compiler subgraph matching framework that identifies opportunities to exploit and generalize the hardware to support more computation graphs. We demonstrate the effectiveness of this system across a range of application domains and study the applicability of the custom hardware across the domain. 1.
Scalable Custom Instructions Identification for Instruction-Set Extensible Processors
- In CASES
, 2004
"... Extensible processors allow addition of application-specific custom instructions to the core instruction set architecture. However, it is computationally expensive to automatically select the optimal set of custom instructions. Therefore, heuristic techniques are often employed to quickly search the ..."
Abstract
-
Cited by 44 (7 self)
- Add to MetaCart
Extensible processors allow addition of application-specific custom instructions to the core instruction set architecture. However, it is computationally expensive to automatically select the optimal set of custom instructions. Therefore, heuristic techniques are often employed to quickly search the design space. In this paper, we present an efficient algorithm for exact enumeration of all possible candidate instructions given the dataflow graph (DFG) corresponding to a code fragment. Even though this is similar to the “subgraph enumeration” problem (which is exponential), we find that most subgraphs are not feasible candidates for various reasons. In fact, the number of candidates is quite small compared to the size of the DFG. Compared to previous approaches, our technique achieves orders of magnitude speedup in enumerating these candidate custom instructions for very large DFGs.
Application-specific instruction generation for configurable processor architectures
- in Proc. ACM International Symposium on Field-Programmable Gate Arrays
, 2004
"... Designing an application-specific embedded system in nanometer technologies has become more difficult than ever due to the rapid increase in design complexity and manufacturing cost. Efficiency and flexibility must be carefully balanced to meet different application requirements. The recently emerge ..."
Abstract
-
Cited by 42 (6 self)
- Add to MetaCart
Designing an application-specific embedded system in nanometer technologies has become more difficult than ever due to the rapid increase in design complexity and manufacturing cost. Efficiency and flexibility must be carefully balanced to meet different application requirements. The recently emerged configurable and extensible processor architectures offer a favorable tradeoff between efficiency and flexibility, and a promising way to minimize certain important metrics (e.g., execution time, code size, etc.) of the embedded processors. This paper addresses the problem of generating the application-specific instructions to improve the execution speed for configurable processors. A set of algorithms, including pattern generation, pattern selection, and application mapping, are proposed to efficiently utilize the instruction set extensibility of the target configurable processor. Applications of our approach to several real-life benchmarks on the Altera Nios processor show encouraging performance speedup (2.75X on average and up to 3.73X in some cases).
Exact and approximate algorithms for the extension of embedded processor instruction sets
- IEEE Trans. on CAD of Integrated Circuits and Systems
"... Abstract—In embedded computing, cost, power, and performance constraints call for the design of specialized processors, rather than for the use of the existing off-the-shelf solutions. While the design of these application-specific CPUs could be tackled from scratch, a cheaper and more effective opt ..."
Abstract
-
Cited by 30 (14 self)
- Add to MetaCart
Abstract—In embedded computing, cost, power, and performance constraints call for the design of specialized processors, rather than for the use of the existing off-the-shelf solutions. While the design of these application-specific CPUs could be tackled from scratch, a cheaper and more effective option is that of extending the existing processors and toolchains. Extensibility is indeed a feature now offered in real designs, e.g., by processors such as Tensilica Xtensa [T. R. Halfhill, Microprocess
Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization
- In Proceedings of the International Symposium on Microarchitecture
, 2004
"... Application-specific instruction set extensions are an effective way of improving the performance of processors. Critical computation subgraphs can be accelerated by collapsing them into new instructions that are executed on specialized function units. Collapsing the subgraphs simultaneously reduces ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Application-specific instruction set extensions are an effective way of improving the performance of processors. Critical computation subgraphs can be accelerated by collapsing them into new instructions that are executed on specialized function units. Collapsing the subgraphs simultaneously reduces the length of computation as well as the number of intermediate results stored in the register file. The main problem with this approach is that a new processor must be generated for each application domain. While new instructions can be designed automatically, there is a substantial amount of engineering cost incurred to verify and to implement the final custom processor. In this work, we propose a strategy to transparent customization of the core computation capabilities of the processor without changing its instruction set. A configurable array of function units is added to the baseline processor that enables the acceleration of a wide range of dataflow subgraphs. To exploit the array, the microarchitecture performs subgraph identification at run-time, replacing them with new microcode instructions to configure and utilize the array. We compare the e#ectiveness of replacing subgraphs in the fill unit of a trace cache versus using a translation table during decode, and evaluate the tradeo#s between static and dynamic identification of subgraphs for instruction set customization.
Characterizing Embedded Applications for Instruction-Set Extensible Processors
- In DAC
, 2004
"... Extensible processors, which allow customization for an application domain by extending the core instruction set architecture, are becoming increasingly popular for embedded systems. However, existing techniques restrict the set of possible candidates for custom instructions by imposing a variety of ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Extensible processors, which allow customization for an application domain by extending the core instruction set architecture, are becoming increasingly popular for embedded systems. However, existing techniques restrict the set of possible candidates for custom instructions by imposing a variety of constraints. As a result, the true extent of performance improvement achievable by extensible processors for embedded applications remains unknown. Moreover, it is unclear how the interplay among these restrictions impacts the performance potential. Our careful examination of this issue shows that significant speedup can only be obtained by relaxing some of the constraints to a reasonable extent. In particular, to the best of our knowledge, ours is the first work that studies the impact of relaxing control flow constraint by identifying instructions across basic blocks and indicates 5–148 % relative speedup for different applications.
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors
- In Proceedings of the 32nd Annual International Symposium on Computer Architecture. Pages
, 2005
"... Instruction set customization is an e#ective way to improve processor performance. Critical portions of application dataflow graphs are collapsed for accelerated execution on specialized hardware. Collapsing dataflow subgraphs will compress the latency along critical paths and reduces the number of ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Instruction set customization is an e#ective way to improve processor performance. Critical portions of application dataflow graphs are collapsed for accelerated execution on specialized hardware. Collapsing dataflow subgraphs will compress the latency along critical paths and reduces the number of intermediate results stored in the register file. While custom instructions can be e#ective, the time and cost of designing a new processor for each application is immense. To overcome this roadblock, this paper proposes a flexible architectural framework to transparently integrate custom instructions into a general-purpose processor. Hardware accelerators are added to the processor to execute the collapsed subgraphs. A simple microarchitectural interface is provided to support a plug-and-play model for integrating a wide range of accelerators into a pre-designed and verified processor core. The accelerators are exploited using an approach of static identification and dynamic realization. The compiler is responsible for identifying profitable subgraphs, while the hardware handles discovery, mapping, and execution of compatible subgraphs. This paper presents the design of a plug-and-play transparent accelerator system and evaluates the cost/performance implications of the design.
Extracting and improving microarchitecture performance on reconfigurable architectures
- International Journal of Parallel Programming
, 2005
"... We describe our experience using reconfigurable architectures to develop an understanding of an application’s performance and to enhance its performance with respect to customized, constrained logic. We begin with a standard ISA currently in use for embedded systems. We modify its core to measure pe ..."
Abstract
-
Cited by 8 (8 self)
- Add to MetaCart
We describe our experience using reconfigurable architectures to develop an understanding of an application’s performance and to enhance its performance with respect to customized, constrained logic. We begin with a standard ISA currently in use for embedded systems. We modify its core to measure performance characteristics, obtaining a system that provides cycle-accurate timings and presents results in the style of gprof, but with absolutely no software overhead. We then provide cache-behavior statistics that are typically unavailable in a generic processor. In contrast with simulation, our approach executes the program at full speed and delivers statistics based on the actual behavior of the cache subsystem. Finally, in response to the performance profile developed on our platform, we evaluate various uses of the FPGA-realized instruction and data caches in terms of the application’s performance. 1
ISEGEN: Generation of high-quality instruction set extensions by iterative improvement
- In DATE ’05
, 2005
"... Customization of processor architectures through Instruction Set Extensions (ISEs) is an effective way to meet the growing performance demands of embedded applications. A high-quality ISE generation approach needs to obtain results close to those achieved by experienced designers, particularly for c ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Customization of processor architectures through Instruction Set Extensions (ISEs) is an effective way to meet the growing performance demands of embedded applications. A high-quality ISE generation approach needs to obtain results close to those achieved by experienced designers, particularly for complex applications that exhibit regularity: expert designers are able to exploit manually such regularity in the data flow graphs to generate high-quality ISEs. In this paper, we present ISEGEN, an approach that identifies high-quality ISEs by iterative improvement following the basic principles of the well-known Kernighan-Lin (K-L) min-cut heuristic. Experimental results on a number of MediaBench, EEMBC and cryptographic applications show that our approach matches the quality of the optimal solution obtained by exhaustive search. We also show that our ISEGEN technique is on average faster than a genetic formulation that generates equivalent solutions. Furthermore, the ISEs identified by our technique exhibit � more speedup than the genetic solution on a large cryptographic application (AES) by effectively exploiting its regular structure. This work was partially supported by NSF grants: CCR-0203813,
Instruction set extension with shadow registers for configurable processors
- in FPGA ’05: Proceedings of the 2005 ACM/SIGDA 13th international
, 2005
"... Configurable processors, which allow customization and extension of the base instruction set architecture for a specific application or a domain of applications, are becoming increasingly popular for modern embedded systems (especially for the field-programmable system-on-a-chip). While steady progr ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Configurable processors, which allow customization and extension of the base instruction set architecture for a specific application or a domain of applications, are becoming increasingly popular for modern embedded systems (especially for the field-programmable system-on-a-chip). While steady progress has been made in the tools and methodologies of automatic instruction set extension for configurable processors, the limited data bandwidth available in the core processor (e.g., the number of simultaneous accesses to the register file) becomes a potential performance bottleneck. In this paper we first present a quantitative analysis of the data bandwidth limitation in configurable processors, and then propose a novel low-cost architectural extension and associated compilation techniques to address the problem. Specifically, shadow registers are introduced to selectively copy the execution results in the write-back stage, which can efficiently reduce the communication overhead due to the data transfers between the core processor and the custom logic. To take full advantage of the extension, an effective shadow-register binding algorithm is presented to minimize the communication overhead. The application of our approach results in a promising performance improvement. 1.

