Results 1 - 10
of
22
Polynomial-time subgraph enumeration for automated instruction set extension
- In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition
, 2007
"... This paper proposes a novel algorithm that, given a data-flow graph and an input/output constraint, enumerates all convex subgraphs under the given constraint in polynomial time with respect to the size of the graph. These subgraphs have been shown to represent efficient Instruction Set Extensions f ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
This paper proposes a novel algorithm that, given a data-flow graph and an input/output constraint, enumerates all convex subgraphs under the given constraint in polynomial time with respect to the size of the graph. These subgraphs have been shown to represent efficient Instruction Set Extensions for customizable processors. The search space for this problem is inherently polynomial but, to our knowledge, this is the first paper to prove this and to present a practical algorithm for this problem with polynomial complexity. Our algorithm is based on properties of convex subgraphs that link them to the concept of multiple-vertex dominators. We discuss several pruning techniques that, without sacrificing the optimality of the algorithm, make it practical for data-flow graphs of a thousands nodes or more. 1.
P.: Rethinking custom ISE identification: A new processor-agnostic method
- In: Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems
, 2007
"... The last decade has witnessed the emergence of the Application Specific Instruction-set Processor (ASIP) as a viable platform for embedded systems. Extensible ASIPs allow the user to augment a base processor with Instruction Set Extensions (ISEs) that execute on Application Specific Functional Units ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
The last decade has witnessed the emergence of the Application Specific Instruction-set Processor (ASIP) as a viable platform for embedded systems. Extensible ASIPs allow the user to augment a base processor with Instruction Set Extensions (ISEs) that execute on Application Specific Functional Units (AFUs) − dedicated hardware that executes the ISEs. Due to the limited number of read and write ports in the register file of the base processor, the size and complexity of AFUs are generally limited. Recent work has focused on overcoming these constraints by serialising access to the register file. Apart from these complications, the primary challenge in the identification and selection of the best AFU is the modelling of AFU performance in the context of different base processors: once the base processor changes, the ISE identification and AFU selection process must be redone from scratch. Exhaustive ISE/AFU enumeration methods are not scalable and generally fail for larger applications. To address this concern, a new approach to ISE/AFU identification is proposed. In particular, we show that the speedup model of ISEs/AFUs is independent of the specific details of the base processor, under fairly reasonable assumptions. The approach presented here significantly prunes the list of best ISE/AFU candidates compared to previous approaches. Experimentally, we observe the new approach produces optimal results on larger applications where prior approaches either fail or produce inferior results.
Introduction of architecturally visible storage in instruction set extensions
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
, 2007
"... Abstract—Instruction set extensions (ISEs) can be used effectively ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract—Instruction set extensions (ISEs) can be used effectively
Automatic identification of application-specific functional units with architecturally visible storage
- in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition
, 2006
"... Instruction Set Extensions (ISEs) can be used effectively to accelerate the performance of embedded processors. The critical, and difficult task of ISE selection is often performed manually by designers. A few automatic methods for ISE generation have shown good capabilities, but are still limited i ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Instruction Set Extensions (ISEs) can be used effectively to accelerate the performance of embedded processors. The critical, and difficult task of ISE selection is often performed manually by designers. A few automatic methods for ISE generation have shown good capabilities, but are still limited in the handling of memory accesses, and so they fail to directly address the memory wall problem. We present here the first ISE identification technique that can automatically identify state-holding Application-specific Functional Units (AFUs) comprehensively, thus being able to eliminate a large portion of memory traffic from cache and main memory. Our cycle-accurate results obtained by the SimpleScalar simulator show that the identified AFUs with architecturally visible storage gain significantly more than previous techniques, and achieve an average speedup of 2.8 × over pure software execution. Moreover, the number of required memory-access instructions is reduced by two thirds on average, suggesting corresponding benefits on energy consumption. 1.
E.: Speculative DMA for Architecturally Visible Storage in Instruction Set Extensions
- In: Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis
, 2008
"... Instruction set extensions (ISEs) can accelerate embedded processor performance. Many algorithms for ISE generation have shown good potential; some of them have recently been expanded to include Architecturally Visible Storage (AVS)—compiler-controlled memories, similar to scratchpads, that are acce ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Instruction set extensions (ISEs) can accelerate embedded processor performance. Many algorithms for ISE generation have shown good potential; some of them have recently been expanded to include Architecturally Visible Storage (AVS)—compiler-controlled memories, similar to scratchpads, that are accessible only to ISEs. To achieve a speedup using AVS, Direct Memory Access (DMA) transfers are required to move data from the main memory to the AVS; unfortunately, this creates coherence problems between the AVS and the cache, which previous methods for ISEs with AVS failed to address; additionally, these methods need to leave many conservative DMA transfers in place, whose execution significantly limits the achievable speedup. This paper presents a memory coherence scheme for ISEs with AVS, which can ensure execution correctness and memory consistency with minimal area overhead. We also present a method that speculatively removes redundant DMA transfers. Cycle-accurate experimental results were obtained using an FPGA-emulation platform. These results show that the application-specific instruction-set extended processors with speculative DMA-enhanced AVS gain significantly over previous techniques, despite the overhead of the coherence mechanism.
N.: Resource sharing in custom instruction set extensions
- In: Proceedings of the 6th IEEE Symposium on Application Specific Processors
, 2008
"... Abstract—Customised processor performance generally increases as additional custom instructions are added. However, performance is not the only metric that modern systems must take into account; die area and energy efficiency are equally important. Resource sharing during synthesis of instruction se ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract—Customised processor performance generally increases as additional custom instructions are added. However, performance is not the only metric that modern systems must take into account; die area and energy efficiency are equally important. Resource sharing during synthesis of instruction set extensions (ISEs) can reduce significantly the die area and energy consumption of a customized processor. This may increase the number of custom instructions that can be synthesized with a given area budget. Resource sharing involves combining the graph representations of two or more ISEs which contain a similar sub-graph. This coupling of multiple sub-graphs, if performed naively, can increase the latency of the extension instructions considerably. And yet, as we show in this paper, an appropriate level of resource sharing provides a significantly simpler design with only modest increases in average latency for extension instructions. Based on existing resource-sharing techniques, this study presents a new heuristic that controls the degree of resource sharing between a given set of custom instructions. Our main contributions are the introduction of a parametric method for exploring the trade-offs that can be achieved between instruction latency and implementation complexity, and the coupling of design-space exploration with fast area-delay models for the operators comprising each ISE. We present experimental evidence that our heuristic exposes a broad range of design points, allowing advantageous trade-offs between die area and latency to be found and exploited. I.
Exhaustive Enumeration of Legal Custom Instructions for Extensible Processors
"... Today’s customizable processors allow the designer to augment the base processor with custom accelerators. By choosing appropriate set of accelerators, designer can significantly enhance the performance and power of an application. Due to the large number of accelerator choices and their complex tra ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Today’s customizable processors allow the designer to augment the base processor with custom accelerators. By choosing appropriate set of accelerators, designer can significantly enhance the performance and power of an application. Due to the large number of accelerator choices and their complex trade-offs among reuse, gain and area, manually deciding the optimal combination of accelerators is quite cumbersome and time consuming. This calls for CAD tools that select optimal combination of accelerators by thoroughly searching theentiredesignspace. Thetermpattern is commonly used to represent the computation performed by a custom accelerator. In this paper, we propose an algorithm for rapidly enumerating all the legal patterns taking into account several constraints posed by a typical micro-architecture. The proposed algorithm achieves significant reduction in run-time by a) enumerating the patterns in the increasing order of sizes and b) relating the characteristics of a (k +1) node pattern with the characteristics of its k node subgraphs. Also, in scenarios where I/O is not a bottleneck, designer can optionally relax the I/O constraint and our algorithm efficiently enumerates all legal I/O unbound legal patterns. The experimental evidence indicate an order of two run-time speedup over state of the art techniques. 1.
Efficient Custom Instruction Identification with Exact Enumeration
, 2007
"... Extensible processors allow addition of application-specific custom instructions to the core instruc-tion set architecture. These custom instructions are selected through an analysis of the program’s dataflow graphs. The characteristics of certain applications and the modern compiler optimization te ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Extensible processors allow addition of application-specific custom instructions to the core instruc-tion set architecture. These custom instructions are selected through an analysis of the program’s dataflow graphs. The characteristics of certain applications and the modern compiler optimization techniques (e.g., loop unrolling, region formation, etc.) have lead to substantially larger dataflow graphs. Hence, it is computationally expensive to automatically select the optimal set of custom instructions. Heuristic techniques are often employed to quickly search the design space. In order to leverage full potential of custom instructions, our previous work [P. Yu and T. Mitra, CASES, 2004, p.69] proposed an efficient algorithm for exact enumeration of all possible candidate instructions (or patterns) given the dataflow graphs. But the algorithm was restricted to connected computation patterns. In this report, we describe efficient algorithms to generate all feasible patterns (connected and disjoint) based on our previous algorithm. More optimization techniques are used in both connected and disjoint pattern enumeration, resulting in further reduction of search space.
Utilizing Custom Registers in Application-specific Instruction Set Processors for Register Spills Elimination ∗ ABSTRACT
"... Application-specific instruction set processor (ASIP) has become an important design choice for embedded systems. It can achieve both high flexibility offered by the base processor core and high performance and energy efficiency offered by the dedicated hardware extensions. Although a lot of efforts ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Application-specific instruction set processor (ASIP) has become an important design choice for embedded systems. It can achieve both high flexibility offered by the base processor core and high performance and energy efficiency offered by the dedicated hardware extensions. Although a lot of efforts have been devoted to computation acceleration, e.g., automatic custom instruction identification and synthesis, the limited on-chip data storage elements, including the register file and data cache, have become a potential performance bottleneck. In this paper, we propose a hardware/software cooperative approach and a linear scan register allocation algorithm to utilize the existing custom registers in ASIPs for eliminating register spills. The data traffic between the processor and memory can be reduced through efficient on-chip communications between the base processor core and custom hardware extensions. Our experimental results demonstrate that a promising performance gain can be achieved, which is orthogonal to improvements by any other technique in ASIP design.
Introducing Control-Flow Inclusion to Support Pipelining in Custom Instruction Set Extensions
"... Abstract—Multi-cycle Instruction set extensions (ISE) can be pipelined in order to increase their throughput; however, typical program traces seldom contain consecutive calls to the same ISE that would allow this temporal parallelism. Often, there are intermittent calls to branch instructions, at a ..."
Abstract
- Add to MetaCart
Abstract—Multi-cycle Instruction set extensions (ISE) can be pipelined in order to increase their throughput; however, typical program traces seldom contain consecutive calls to the same ISE that would allow this temporal parallelism. Often, there are intermittent calls to branch instructions, at a minimum, that prevent the pipelined execution of subsequent calls to the same ISE within a loop. What is needed is ISEs that cover an entire loop body, which can create a stream of repeated calls to the same ISE during program execution; this, in turn, permits the use of hardware pipelining. To address this concern, we introduce a new type of ISE that borrows ideas from zero-overhead loop instructions to permit pipelined execution of loops. To further expose instruction-level parallelism, the ISE supports loops whose bodies form hyperblocks, which are regions of program control flow that have multiple exits (including loop iterations and break points within loops). These ISEs broaden the scope of instructionlevel parallelism and obtain higher speed ups compared to traditional ISEs, primarily through pipelining, the exploitation of spatial parallelism, and reducing the overhead of control flow statements and branches. I.

