Results 1 - 10
of
55
Processor acceleration through automated instruction set customization
- In MICRO
, 2003
"... Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism to meet the growing performance and power demands of embedded applications. Hardware, in the form of new function units (or co-processors), and the corresponding instructions, are added to ..."
Abstract
-
Cited by 70 (5 self)
- Add to MetaCart
Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism to meet the growing performance and power demands of embedded applications. Hardware, in the form of new function units (or co-processors), and the corresponding instructions, are added to a baseline processor to meet the critical computational demands of a target application. The central challenge with this approach is the large degree of human effort required to identify and create the custom hardware units, as well as porting the application to the extended processor. In this paper, we present the design of a system to automate the instruction set customization process. A dataflow graph design space exploration engine efficiently identifies profitable computation subgraphs from which to create custom hardware, without artificially constraining their size or shape. The system also contains a compiler subgraph matching framework that identifies opportunities to exploit and generalize the hardware to support more computation graphs. We demonstrate the effectiveness of this system across a range of application domains and study the applicability of the custom hardware across the domain. 1.
Methods for Evaluating and Covering the Design Space during Early Design Development
- Integration, the VLSI Journal
, 2003
"... This paper gives an overview of methods used for Design Space Exploration (DSE) at the system- and micro-architecture levels. The DSE problem is considered to be two orthogonal issues: (I) How could a single design point be evaluated, (II) how could the design space be covered during the explorat ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
This paper gives an overview of methods used for Design Space Exploration (DSE) at the system- and micro-architecture levels. The DSE problem is considered to be two orthogonal issues: (I) How could a single design point be evaluated, (II) how could the design space be covered during the exploration process? The latter question arises since an exhaustive exploration of the design space by evaluating every possible design point is usually prohibitive due to the sheer size of the design space. We therefore reveal trade-o#s linked to the choice of appropriate evaluation and coverage methods. The designer has to balance the following issues: the accuracy of the evaluation, the time it takes to evaluate one design point (including the implementation of the evaluation model), the precision/granularity of the design space coverage, and last but not least the possibilities for automating the exploration process. We also list common representations of the design space and compare current system and micro-architecture level design frameworks. This review thus eases the choice of a decent exploration policy by providing a comprehensive survey and classification of recent related work. It is focused on System-on-a-Chip designs, particularly those used for network processors. These systems are heterogeneous in nature using multiple computation, communication, memory, and peripheral resources.
A Pipelined Memory Architecture for High Throughput Network Processors
- 2003. 30th Annual International Symposium on Computer Architecture
, 2003
"... Designing ASICs for each new generation of backbone routers is a time intensive and fiscally draining process. In this paper we focus on the design of a programmable architecture for backbone routers, based on the manipulation of wide irregular memory words, that can provide a feasible design altern ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
Designing ASICs for each new generation of backbone routers is a time intensive and fiscally draining process. In this paper we focus on the design of a programmable architecture for backbone routers, based on the manipulation of wide irregular memory words, that can provide a feasible design alternative to custom ASICs. We propose a pipelined memory design that emphasizes worst-case throughput over latency, and co-explore architectural tradeoffs with the design of several important network algorithms. Through this co-exploration, we show that a programmable architecture can efficiently exploit behavior inherent to most common network algorithms to keep up with next generation network speeds.
Profiling tools for hardware/software partitioning of embedded applications
- Proc. ACM Symp. On Languages, Compilers and Tools for Embedded Systems (LCTES 2003
, 2003
"... Loops constitute the most executed segments of programs and therefore are the best candidates for hardware software partitioning. We present a set of profiling tools that are specifically dedicated to loop profiling and do support combined function and loop profiling. One tool relies on an instructi ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Loops constitute the most executed segments of programs and therefore are the best candidates for hardware software partitioning. We present a set of profiling tools that are specifically dedicated to loop profiling and do support combined function and loop profiling. One tool relies on an instruction set simulator and can therefore be augmented with architecture and microarchitecture features simulation while the other is based on compile-time instrumentation of gcc and therefore has very little slow down compared to the original program We use the results of the profiling to identify the compute core in each benchmark and study the effect of compile-time optimization on the distribution of cores in a program. We also study the potential speedup that can be achieved using a configurable system on a chip, consisting of a CPU embedded on an FPGA, as an example application of these tools in hardware/software partitioning.
Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems
- IEEE TRANSACTIONS ON EMBEDDED COMPUTER SYSTEMS
, 2004
"... ..."
Bitwidth Aware Global Register Allocation
- In POPL 03: Principles of Programming Languages
, 2002
"... use of subword data. Since registers are capable of holding a full data word, when a subword variable is assigned a register, only part of the register is used. New embedded processors have started supporting instruction sets that allow direct referencing of bit sections within registers and therefo ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
use of subword data. Since registers are capable of holding a full data word, when a subword variable is assigned a register, only part of the register is used. New embedded processors have started supporting instruction sets that allow direct referencing of bit sections within registers and therefore multiple subword variables can be made to simultaneously reside in the same register without hindering accesses to these variables. However, a new register allocation algorithm is needed that is aware of the bitwidths of program variables and is capable of packing multiple subword variables into a single register. This paper presents one such algorithm.
Warp processors
- ACM Transactions on Design Automation of Electronic Systems (TODAES
, 2006
"... We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so usi ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so using a special compiler, a warp processor achieves those improvements completely transparently and operates from a standard binary. A warp processor dynamically detects the binary’s critical regions, re-implements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region. While not all benchmarks can be improved using warp processing, many can, and the improvements are dramatically better than achievable by more traditional architecture improvements. The hardest part of warp processing is that of dynamically re-implementing code regions on an FPGA, requiring partitioning, decompilation, synthesis, placement, and routing tools, all having to execute with minimal computation time and data memory so as to coexist on-chip with the main processor. We describe our results of developing a warp processor. We developed a custom FPGA fabric specifically designed to enable lean place and route tools, and we developed extremely fast and efficient versions of partitioning, decompilation, synthesis, technology mapping, placement, and routing. Warp processors achieve overall application speedups of 6.3X with energy savings of 66 % across a set of embedded benchmark applications. We
Versatility and versabench: A new metric and a benchmark suite for flexible architectures
, 2004
"... For the last several decades, computer architecture research has largely benefited from, and continues to be driven by ad-hoc benchmarking. Often the benchmarks are selected to represent workloads that architects believe should run on the computational platforms they design. For example, benchmark s ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
For the last several decades, computer architecture research has largely benefited from, and continues to be driven by ad-hoc benchmarking. Often the benchmarks are selected to represent workloads that architects believe should run on the computational platforms they design. For example, benchmark suites such as SPEC, Winstone, and MediaBench, which represent workstation, desktop and media workloads respectively, have influenced computer architecture innovation for the last decade. Recently, advances in VLSI technology have created an increasing interest within the computer architecture community to build a new kind of processor that is more flexible than extant general purpose processors. Such new processor architectures must efficiently support a broad class of applications including graphics, networking, and signal processing in addition to the traditional desktop workloads. Thus, given the new focus on flexibility demands, a new benchmark suite and new metrics are necessary to accurately reflect the goals of the architecture community. This paper thus proposes (i) VersaBench as a new benchmark suite, and (ii) a new Versatility measure to characterize architectural flexibility, or in other words, the ability of the architecture to effectively execute a wide array of workloads. The benchmark suite is composed of applications drawn from several domains including desktop, server, stream, and bit-level processing. The Versatility measure is a single scalar metric inspired by the SPEC paradigm. It normalizes processor performance on each benchmark by that of the highest-performing machine for that application. This paper reports the measured versatility for several existing processors, as well as for some new and emerging research processors. The benchmark suite is freely distributed, and we are actively cataloging and sharing results for various reference processors. 1.
A case for clumsy packet processors
- in International Symposium on Microarchitecture
, 2004
"... Hardware faults can occur in any computer system. Although faults cannot be tolerated for most systems (e.g., servers or desktop processors), many applications (e.g., networking applications) provide robustness in software. However, processors do not utilize this resiliency, i.e., regardless of the ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Hardware faults can occur in any computer system. Although faults cannot be tolerated for most systems (e.g., servers or desktop processors), many applications (e.g., networking applications) provide robustness in software. However, processors do not utilize this resiliency, i.e., regardless of the application at hand, a processor is expected to operate completely fault-free. In this paper, we will question this traditional approach of complete correctness and investigate possible performance and energy optimizations when this correctness constraint is released. We first develop a realistic model that estimates the change in the fault rates according to the clock frequency of the cache. Then, we present a scheme that dynamically adjusts the clock frequency of the data caches to achieve the desired optimization goal, e.g., reduced energy or reduced access latency. Finally, we present simulation results investigating the optimal operation frequency of the data caches, where reliability is compromised in exchange of reduced energy and increased performance. Our simulation results indicate that the clock frequency of the data caches can be increased as much as 4 times without incurring a major penalty on the reliability. This also results in 41 % reduction in the energy consumed in the data caches and a 24 % reduction in the energy-delay-fallibility product. 1.
PacketBench: A Tool for Workload Characterization of Network Processing
, 2003
"... Network processing is becoming an increasingly important paradigm as the Internet moves towards an architecture with more complex functionality inside the network. Modern routers not only forward packets, but also process headers and payloads to implement a variety of functions related to security, ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Network processing is becoming an increasingly important paradigm as the Internet moves towards an architecture with more complex functionality inside the network. Modern routers not only forward packets, but also process headers and payloads to implement a variety of functions related to security, performance, and customization. It is important to get a detailed understanding of the workloads associated with this processing in order to be able to develop efficient network processing engines. We present a tool called PacketBench, which provides a framework for implementing network processing applications and obtaining an extensive set of workload characteristics. PacketBench provides the support functions to handle various packet traces and manage packet memory. For statistics collection, PacketBench provides the ability to derive a number of microarchitectural and networking related metrics. We present the results of such measurements for four different networking applications ranging from simple packet forwarding to complex packet payload encryption. The results show that such workload analysis has a range of uses from network processor design to application optimization.

