Results 1 - 10
of
13
HPL-PD architecture specification: Version 1.1
, 2000
"... instruction-level parallelism, parametric architecture, EPIC, VLIW, superscalar, speculative execution, predicated execution, programmatic cache control, run-time memory disambiguation, branch architecture HPL-PD is a parametric processor architecture conceived for research in instruction-level para ..."
Abstract
-
Cited by 52 (6 self)
- Add to MetaCart
instruction-level parallelism, parametric architecture, EPIC, VLIW, superscalar, speculative execution, predicated execution, programmatic cache control, run-time memory disambiguation, branch architecture HPL-PD is a parametric processor architecture conceived for research in instruction-level parallelism (ILP). Its main purpose is to serve as a vehicle to investigate processor architectures having significant parallelism and to investigate the compiler technology needed to effectively exploit such architectures. The architecture is parametric in that it admits machines of different composition and scale, especially with respect to the nature and amount of parallelism offered. The architecture admits EPIC, VLIW and superscalar implementations so as to provide a basis for understanding the merits and demerits of these different styles of implementation. This report describes those parts of the architecture that are common to all machines in the family. It introduces the basic concepts such as the structure of an instruction, instruction execution semantics, the types of register files, etc. and describes the semantics of the operation repertoire.
High-Level Synthesis of Nonprogrammable Hardware Accelerators
- JOURNAL OF VLSI SIGNAL PROCESSING
, 2000
"... The PICO-N system automatically synthesizes embedded nonprogrammable accelerators to be used as co-processors for functions expressed as loop nests in C. The output is synthesizable VHDL that defines the accelerator at the register transfer level (RTL). The system generates a synchronous array of cu ..."
Abstract
-
Cited by 51 (5 self)
- Add to MetaCart
The PICO-N system automatically synthesizes embedded nonprogrammable accelerators to be used as co-processors for functions expressed as loop nests in C. The output is synthesizable VHDL that defines the accelerator at the register transfer level (RTL). The system generates a synchronous array of customized VLIW (very-long instruction word) processors, their controller, local memory, and interfaces. The system also modifies the user's application software to make use of the generated accelerator. The user indicates the throughput to be achieved by specifying the number of processors and their initiation interval. In experimental comparisons, PICO-N designs are slightly more costly than hand-designed accelerators with the same performance.
Instruction Scheduling for Clustered VLIW DSPs
- In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques
, 2000
"... Recent digital signal processors (DSPs) show a homogeneous VLIW-like data path architecture, which allows C compilers to generate efficient code. However, still some special restrictions have to be obeyed in code generation for VLIW DSPs. In order to reduce the number of register file ports needed t ..."
Abstract
-
Cited by 44 (0 self)
- Add to MetaCart
Recent digital signal processors (DSPs) show a homogeneous VLIW-like data path architecture, which allows C compilers to generate efficient code. However, still some special restrictions have to be obeyed in code generation for VLIW DSPs. In order to reduce the number of register file ports needed to provide data for multiple functional units working in parallel, the DSP data path may be clustered into several sub-paths, with very limited capabilities of exchanging values between the different clusters. An example is the well-known Texas Instruments C6201 DSP. For such an architecture, the tasks of scheduling and partitioning instructions between the clusters are highly interdependent. This paper presents a new instruction scheduling approach, which in contrast to earlier work, integrates partitioning and scheduling into a single technique, so as to achieve a high code quality. We show experimentally that the proposed technique is capable of generating more efficient code than a commer...
Automatic architectural synthesis of VLIW and EPIC processors
"... architecture specification Architecting a VLIW processor is considerably more complex than a sequential one. In addition to picking an operation repertoire, one must specify the extent and nature of the processor's ILP. A VLIW processor, when designed by an expert architect, exhibits certain featu ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
architecture specification Architecting a VLIW processor is considerably more complex than a sequential one. In addition to picking an operation repertoire, one must specify the extent and nature of the processor's ILP. A VLIW processor, when designed by an expert architect, exhibits certain features which we want PICO-VLIW to emulate. For example, the processor may use heterogeneous functional units -- although one might include the ability to issue two adds every cycle, which requires two integer units, only one unit may be capable of shifting and the other unit able to do multiplication. The register file ports may be shared -- a multiply-add operation, which requires three register read ports, may be accommodated by "borrowing" one of the ports of another functional unit which cannot, now, be used in parallel with the multiply-accumulate. Likewise, instruction bits may be shared -- a load or store operation, which requires a long displacement field, might use the instruction bits that would otherwise have been used to specify an operation on some other functional unit. In order for PICO-VLIW to yield well-architected processors, the Spacewalker needs to be able to specify such architectures to the VLIW synthesis sub-system.
Generating Cache Hints for Improved Program Efficiency
- JOURNAL OF SYSTEMS ARCHITECTURE
, 2004
"... One of the new extensions in EPIC architectures are cache hints. On each memory instruction, two kinds of hints can be attached: a source cache hint and a target cache hint. The source hint indicates the true latency of the instruction, which is used by the compiler to improve the instruction schedu ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
One of the new extensions in EPIC architectures are cache hints. On each memory instruction, two kinds of hints can be attached: a source cache hint and a target cache hint. The source hint indicates the true latency of the instruction, which is used by the compiler to improve the instruction schedule. The target hint indicates at which cache levels it is profitable to retain data, allowing to improve cache replacement decisions at run time. A compile-time method is presented which calculates appropriate cache hints. Both kind of hints are based on the locality of the instruction, measured by the reuse distance metric. Two
Embedded Computing: New Directions in Architecture and Automation
- In 7th International Conference on High-Performance Computing (HiPC2000
, 2000
"... this report, we elaborate on these claims and provide, as an example, an overview of PICO, the architecture synthesis system that the authors and their colleagues have been developing over the past five years ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
this report, we elaborate on these claims and provide, as an example, an overview of PICO, the architecture synthesis system that the authors and their colleagues have been developing over the past five years
Cycle-time Aware Architecture Synthesis of Custom Hardware Accelerators
- In: Proceedings of International Conference on Compilers, Architecture. and Synthesis for Embedded Systems (CASES
, 2002
"... We present the cycle-time aware architecture synthesis methodology used in PICO-NPA that automatically synthesizes minimal cost RT-level designs from high-level specifications to meet a given cycle-time. This allows subsequent physical synthesis to succeed on first pass with predictable performance. ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
We present the cycle-time aware architecture synthesis methodology used in PICO-NPA that automatically synthesizes minimal cost RT-level designs from high-level specifications to meet a given cycle-time. This allows subsequent physical synthesis to succeed on first pass with predictable performance. The core of the methodology is a static timing analysis engine that is used at multiple levels -- program-level, architecture-level and RT-level -- in order to identify, schedule and validate useful operator chains that are incorporated into the design automatically. We present architecture synthesis results for several embedded applications and evaluate the benefits of this technique.
Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures
- In Proc. of CASES
, 2001
"... paper we des)(3 e ades6N exploration methodology forclus(3$W VLI architectures The central idea of this workis as et of three techniques aimed at reducing the cos of expens( e inter-clus2) copy operations InsR6)WA2) s heduling is performed usR$ alis4)R heduling algorithm thatsatW4 operand chains in ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
paper we des)(3 e ades6N exploration methodology forclus(3$W VLI architectures The central idea of this workis as et of three techniques aimed at reducing the cos of expens( e inter-clus2) copy operations InsR6)WA2) s heduling is performed usR$ alis4)R heduling algorithm thatsatW4 operand chains into thesWR regisN] file. Functionalunits areasWRR$N toclus$(4 bas on the application inter-clusN) communication pattern. Finally, a careful insefulW of pipeline bypas43 is usR to increas the number of data-dependencies that can bes(]436W by pipeline regisW( operands Experimental resN]4N us(( the SPEC95 benchmark and the IMPACT compiler, reveal as ubs$] tial reduction in the number of copies between 1.
Code Size Minimization and Retargetable Assembly for custom EPIC and VLIW instruction formats
- EPIC and VLIW Instruction Formats,” HP Labs
, 2000
"... this paper is to describe a series of code size minimization techniques used within PICO, some of which are applied during the automatic design of the instruction format, while others are applied during program assembly. The design of a retargetable assembler to support these techniques also poses c ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
this paper is to describe a series of code size minimization techniques used within PICO, some of which are applied during the automatic design of the instruction format, while others are applied during program assembly. The design of a retargetable assembler to support these techniques also poses certain novel challenges which constitute the second focus of this paper. Contrary to widely held perceptions, we demonstrate that it is entirely possible to design VLIW and EPIC processors, which are capable of issuing large numbers of operations per cycle, but whose code size is only moderately larger than that for a sequential CISC processor
Heuristic Tradeoffs between Latency and Energy Consumption in Register Assignment
, 2000
"... One of the challenging tasks in code generation for embedded systems is register allocation and assignment, wherein one decides on the placement and lifetimes of variables in registers. When there are more live variables than registers, some variables need to be spilled to memory and restored later. ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
One of the challenging tasks in code generation for embedded systems is register allocation and assignment, wherein one decides on the placement and lifetimes of variables in registers. When there are more live variables than registers, some variables need to be spilled to memory and restored later. In this paper we propose a policy that minimizes the number of spills -- which is critical for portable embedded systems since it leads to a decrease in energy consumption. We argue however, that schedules with a minimal number of spills do not necessarily have minimum latency. Accordingly, we propose a class of policies that explore tradeoffs between assignments leading to schedules with low latency versus those leading to low energy consumption and show how to tune them to particular datapath characteristics. Based on experimental results we propose a criterion to select a register assignment policy that for 99% of the cases we considered minimizes both latency and energy consumption associated with spills to memory.

