Results 1 - 10
of
21
Scalable Custom Instructions Identification for Instruction-Set Extensible Processors
- In CASES
, 2004
"... Extensible processors allow addition of application-specific custom instructions to the core instruction set architecture. However, it is computationally expensive to automatically select the optimal set of custom instructions. Therefore, heuristic techniques are often employed to quickly search the ..."
Abstract
-
Cited by 44 (7 self)
- Add to MetaCart
Extensible processors allow addition of application-specific custom instructions to the core instruction set architecture. However, it is computationally expensive to automatically select the optimal set of custom instructions. Therefore, heuristic techniques are often employed to quickly search the design space. In this paper, we present an efficient algorithm for exact enumeration of all possible candidate instructions given the dataflow graph (DFG) corresponding to a code fragment. Even though this is similar to the “subgraph enumeration” problem (which is exponential), we find that most subgraphs are not feasible candidates for various reasons. In fact, the number of candidates is quite small compared to the size of the DFG. Compared to previous approaches, our technique achieves orders of magnitude speedup in enumerating these candidate custom instructions for very large DFGs.
Exact and approximate algorithms for the extension of embedded processor instruction sets
- IEEE Trans. on CAD of Integrated Circuits and Systems
"... Abstract—In embedded computing, cost, power, and performance constraints call for the design of specialized processors, rather than for the use of the existing off-the-shelf solutions. While the design of these application-specific CPUs could be tackled from scratch, a cheaper and more effective opt ..."
Abstract
-
Cited by 30 (14 self)
- Add to MetaCart
Abstract—In embedded computing, cost, power, and performance constraints call for the design of specialized processors, rather than for the use of the existing off-the-shelf solutions. While the design of these application-specific CPUs could be tackled from scratch, a cheaper and more effective option is that of extending the existing processors and toolchains. Extensibility is indeed a feature now offered in real designs, e.g., by processors such as Tensilica Xtensa [T. R. Halfhill, Microprocess
Dataflow mini-graphs: Amplifying superscalar capacity and bandwidth
- In Proc. of the 37th Annual International Symposium on Microarchitecture
, 2004
"... A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer. Previous work has exploited dataflow sub-graphs ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer. Previous work has exploited dataflow sub-graphs whose execution latency can be reduced via programmable FPGA-style hardware. In this paper we show that mini-graphs can improve performance by amplifying the bandwidths of a superscalar processor’s stages and the capacities of many of its structures without custom latency-reduction hardware. Amplification is achieved because the processor deals with a complete mini-graph via a single quasi-instruction, the handle. By constraining mini-graph structure and forcing handles to behave as much like singleton instructions as possible, the number and scope of the modifications over a conventional superscalar microarchitecture is kept to a minimum. This paper describes mini-graphs, a simple algorithm for extracting them from basic block frequency profiles, and a microarchitecture for exploiting them. Cycle-level simulation of several benchmark suites shows that mini-graphs can provide average performance gains of 2–12 % over an aggressive baseline, with peak gains exceeding 40%. Alternatively, they can compensate for substantial reductions in register file and scheduler size, and in pipeline bandwidth. 1.
Automated Custom Instruction Generation for Domain-Specific Processor Acceleration
- IEEE Transactions on Computers
, 2005
"... Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism to meet the growing performance and power demands of embedded applications. Hardware, in the form of new function units (or coprocessors), and the corresponding instructions are added to ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism to meet the growing performance and power demands of embedded applications. Hardware, in the form of new function units (or coprocessors), and the corresponding instructions are added to a baseline processor to meet the critical computational demands of a target application. In this paper, the design of a system to automate the instruction set customization process is presented. A dataflow graph design space exploration engine efficiently identifies computation subgraphs to create custom hardware and a compiler subgraph matching framework seamlessly exploits this hardware. We demonstrate the effectiveness of this system across a range of application domains and study the applicability of the custom hardware across an entire application domain. Generalization techniques are presented which enable the application-specific hardware to be more effectively used across a domain.
P.: Rethinking custom ISE identification: A new processor-agnostic method
- In: Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems
, 2007
"... The last decade has witnessed the emergence of the Application Specific Instruction-set Processor (ASIP) as a viable platform for embedded systems. Extensible ASIPs allow the user to augment a base processor with Instruction Set Extensions (ISEs) that execute on Application Specific Functional Units ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
The last decade has witnessed the emergence of the Application Specific Instruction-set Processor (ASIP) as a viable platform for embedded systems. Extensible ASIPs allow the user to augment a base processor with Instruction Set Extensions (ISEs) that execute on Application Specific Functional Units (AFUs) − dedicated hardware that executes the ISEs. Due to the limited number of read and write ports in the register file of the base processor, the size and complexity of AFUs are generally limited. Recent work has focused on overcoming these constraints by serialising access to the register file. Apart from these complications, the primary challenge in the identification and selection of the best AFU is the modelling of AFU performance in the context of different base processors: once the base processor changes, the ISE identification and AFU selection process must be redone from scratch. Exhaustive ISE/AFU enumeration methods are not scalable and generally fail for larger applications. To address this concern, a new approach to ISE/AFU identification is proposed. In particular, we show that the speedup model of ISEs/AFUs is independent of the specific details of the base processor, under fairly reasonable assumptions. The approach presented here significantly prunes the list of best ISE/AFU candidates compared to previous approaches. Experimentally, we observe the new approach produces optimal results on larger applications where prior approaches either fail or produce inferior results.
ABSTRACT Pattern-Based Behavior Synthesis for FPGA Resource Reduction
"... Pattern-based synthesis has drawn wide interest from researchers who tried to utilize the regularity in applications for design optimizations. In this paper we present a general pattern-based behavior synthesis framework which can efficiently extract similar structures in programs. Our approach is v ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Pattern-based synthesis has drawn wide interest from researchers who tried to utilize the regularity in applications for design optimizations. In this paper we present a general pattern-based behavior synthesis framework which can efficiently extract similar structures in programs. Our approach is very scalable in benefit of advanced pruning techniques that include locality sensitive hashing and characteristic vectors. The similarity of structures is captured by a mismatch-tolerant metric: graph edit distance. The edit distance between two graphs is the minimum number of vertex/edge insertion, deletion, substitution operations to transform one graph into the other. Graph edit distance can naturally handle various program variations such as bit-width, structure, and port variations. In addition, we apply our pattern-based synthesis system to FPGA resource optimization with the observation that multiplexors are particularly expensive on FPGA platforms. Considering knowledge of discovered patterns, the resource binding step can intelligently generate the data-path to reduce interconnect costs. Experiments show our approach can, on average, reduce the total area by about 20% with 7 % latency overhead on the Xilinx Virtex-4 FPGAs, compared to the traditional behavior synthesis flow.
S.: Automatic selection of application-specific instruction-set extensions
- In: Proceedings of CODES+ISSS ’06
, 2006
"... In this paper, we present a general and an efficient algorithm for automatic selection of new application-specific instructions under hardware resources constraints. The instruction selection is formulated as an ILP problem and efficient solvers can be used for finding the optimal solution. An impor ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper, we present a general and an efficient algorithm for automatic selection of new application-specific instructions under hardware resources constraints. The instruction selection is formulated as an ILP problem and efficient solvers can be used for finding the optimal solution. An important feature of our algorithm is that it is not restricted to basic-block level nor does it impose any limitation on the number of the newly added instructions or on the number of the inputs/outputs of these instructions. The presented results show that a significant overall application speedup is achieved even for large kernels (for ADPCM decoder the speedup ranges from x1.2 to x3.7) and that our algorithm compares well with other state-of-art algorithms for automatic instruction set extensions.
S.: A linear complexity algorithm for the automatic generation of convex multiple input multiple output instructions
- In: Proceedings of ARC 2007
"... Abstract. The Instruction-Set Extensions problem has been one of the major topic in the last years and it consists of the addition of a set of new complex instructions to a given Instruction-Set. This problem in its general formulation requires an exhaustive search of the design space to identify th ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. The Instruction-Set Extensions problem has been one of the major topic in the last years and it consists of the addition of a set of new complex instructions to a given Instruction-Set. This problem in its general formulation requires an exhaustive search of the design space to identify the candidate instructions. This search turns into an exponential complexity of the solution. In this paper we propose an efficient linear complexity algorithm for the automatic generation of convex Multiple Input Multiple Output (MIMO) instructions, whose convexity is theoretically guaranteed. The proposed approach is not restricted to basic-block level and does not impose limitations either on the number of input and/or output, or on the number of new instructions generated. Our results show a significant overall application speedup (up to x2.9 for ADPCM decoder) considering the linear complexity of the proposed solution and which therefore compares well with other state-of-art algorithms for automatic instruction set extensions. 1
Optimizing Instruction-set Extensible Processors under Data Bandwidth Constraints
- in Proceedings of the Design, Automation and Test in Europe Conference, 2007
"... We present a methodology for generating optimized architectures for data bandwidth constrained extensible processors. We describe a scalable Integer Linear Programming (ILP) formulation, that extracts the most profitable set of instruction-set extensions given the available data bandwidth and transf ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We present a methodology for generating optimized architectures for data bandwidth constrained extensible processors. We describe a scalable Integer Linear Programming (ILP) formulation, that extracts the most profitable set of instruction-set extensions given the available data bandwidth and transfer latency. Unlike previous approaches, we differentiate between number of inputs and outputs for instruction-set extensions and the number of register file ports. This differentiation makes our approach applicable to architectures that include architecturally visible state registers and dedicated data transfer channels. We support a comprehensive design space exploration to characterize the area/performance trade-offs for various applications. We evaluate our approach using actual ASIC implementations to demonstrate that our automatically customized processors meet timing within the target silicon area. For an embedded processor with only two register read ports and one register write port, we obtain up to 4.3 × speed-up with extensions incurring only a 35 % area overhead. 1
Organizing Pattern Libraries for ASIP Design
, 2003
"... In this paper we propose a new method to arrange a library of application-graph patterns. Such libraries are employed in the design process for ApplicationSpecific Instruction-set Processors (ASIPs) to find opportunities to specialize a processor instruction-set for an application domain. In current ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper we propose a new method to arrange a library of application-graph patterns. Such libraries are employed in the design process for ApplicationSpecific Instruction-set Processors (ASIPs) to find opportunities to specialize a processor instruction-set for an application domain. In current approaches, these libraries are unordered data collections. Therefore, to search a library for a specific pattern entails comparing the pattern with each entry in the library, which is O(n*p) with n the total number of operation nodes of all patterns in the library and p the size of the pattern sought. Our new method employs identity operations to organize a library in such a way that a directed search strategy with only O(d) and d=<p is possible. Furthermore, the organization reveals synergies between patterns for the ASIP design process and for code generation.

