Results 1  10
of
10
Automated Compilation of Concurrent Programs into Selftimed Circuits
, 1987
"... Contents 1 The Programming Language 2 1.1 Sequential Constructs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 1.2 Procedures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 1.3 Concurrency : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
Contents 1 The Programming Language 2 1.1 Sequential Constructs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 1.2 Procedures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 1.3 Concurrency : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 1.4 Definition and Instantiation : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 1.5 Grammar Rules : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 1.6 New Constructs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 2 Decomposition 12 2.1 Abbreviations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 2.2 Decomposition and SubProcesses : : : : : : : : : : : : : : : : : : : : : : : : 13 2.3 Syntactic Decomposition : : : : : : : : : : : : : : : : : : :
Algorithmic Aspects of Symbolic Switch Network Analysis
 IEEE Trans. CAD/IC
, 1987
"... A network of switches controlled by Boolean variables can be represented as a system of Boolean equations. The solution of this system gives a symbolic description of the conducting paths in the network. Gaussian elimination provides an efficient technique for solving sparse systems of Boolean eq ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
A network of switches controlled by Boolean variables can be represented as a system of Boolean equations. The solution of this system gives a symbolic description of the conducting paths in the network. Gaussian elimination provides an efficient technique for solving sparse systems of Boolean equations. For the class of networks that arise when analyzing digital metaloxide semiconductor (MOS) circuits, a simple pivot selection rule guarantees that most s switch networks encountered in practice can be solved with O(s) operations. When represented by a directed acyclic graph, the set of Boolean formulas generated by the analysis has total size bounded by the number of operations required by the Gaussian elimination. This paper presents the mathematical basis for systems of Boolean equations, their solution by Gaussian elimination, and data structures and algorithms for representing and manipulating Boolean formulas.
A technique to determine powerefficient, highperformance superscalar processors
 In Proceedings of the TwentyEighth Hawaii International Conference on System Sciences
, 1995
"... Processor performance advances are increasingly inhibit(ed by limitations in thermal power dissipation. Part of the problem is the lack of architectural power estimates before implementation. Although highperformance designs exist that dissipate low power, the method for finding these designs has ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Processor performance advances are increasingly inhibit(ed by limitations in thermal power dissipation. Part of the problem is the lack of architectural power estimates before implementation. Although highperformance designs exist that dissipate low power, the method for finding these designs has bc:en through trialanderror. This paper presents systematic techniques to find lowpower, highperformance superscalar processors tailored to specific user benchmarks. The model of power is novel because it separates power into architectural and technology components. The architectural component is found via tracedriven simulation, which also produces performance estimates. An example technology model is presented that estimates the technology component, along with critical delay time and real estate usage. This model is bwed on case studies of actual designs. It is used to solve an important problem: increasing the duplication in superscalar execution units without excessive power consumption. Results are present#ed from runs using simulated annealing to maximize processor performance subject to power and area const#raints. The major contributions of this paper are the separation of architectural and technology components of dynamic power, the use of tracedriven simulation for architectural power measurement, and the use of a nearoptimal search t,o tailor a processor design to a benchmark. 1
Minimizing Energy Dissipation in HighSpeed Multipliers
 Proc. IEEE Symp. on Low Power Electronics and Design
, 1997
"... This paper presents a new twogatedelay implementation of the Booth encoder and partial product generator, which eliminates the unnecessary glitches associated with the Booth multiplier. In addition, a modified signed/unsigned (MSU) and modified signgenerate (MSG) algorithms, suitable especially f ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
This paper presents a new twogatedelay implementation of the Booth encoder and partial product generator, which eliminates the unnecessary glitches associated with the Booth multiplier. In addition, a modified signed/unsigned (MSU) and modified signgenerate (MSG) algorithms, suitable especially for signed/unsigned multipliers, were developed in order to reduce the compression level needed in the Wallace tree, and hence reduce the multiplier hardware. Using these features reduces the multiplier array energy dissipation by about 30 % and increases speed by about 10%. 1.
VLSI Implementation of a Runtime Configurable Computing Integrated Circuit  The Stallion Chip
, 1998
"... Reconfigurable computing architectures are gaining popularity as a replacement for generalpurpose architectures for many high performance embedded applications. These machines support parallel computation and direct the data from the producers of an intermediate result to the consumers over custom p ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Reconfigurable computing architectures are gaining popularity as a replacement for generalpurpose architectures for many high performance embedded applications. These machines support parallel computation and direct the data from the producers of an intermediate result to the consumers over custom pathways. The Wormhole Runtime Reconfigurable (RTR) computing architecture is a concept developed at Virginia Tech to address the weaknesses of contemporary FPGAs for configurable computing. The Stallion chip is a fullcustom configurable computing "FPGA"like integrated circuit with a coarse grained nature. Based on the result of the first generation device, the Colt chip, the Stallion chip is a followup configurable computing chip. This thesis focuses on the VLSI layout implementation of the Stallion chip. E#ort has been made to explain many facts and advantages of the Wormhole Configurable Computing Machine (CCM). Design techniques, strategies, circuit characterization, performance estimation, and ways to solve problems when using CAD layout design tools are illustrated.
SCALABLE TEST GENERATORS FOR HIGHSPEED DATAPATH CIRCUITS
"... This paper explores the design of efficient test sets and testpattern generators for online BIST. The target applications are highperformance, scalable datapath circuits for which fast and complete fault coverage is required. Because of the presence of carrylookahead, most existing BIST methods ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
This paper explores the design of efficient test sets and testpattern generators for online BIST. The target applications are highperformance, scalable datapath circuits for which fast and complete fault coverage is required. Because of the presence of carrylookahead, most existing BIST methods are unsuitable for these applications. Highlevel models are used to identify potential test sets for a small version of the circuit to be tested. Then a regular test set is extracted and a test generator TG is designed to meet the following goals: scalability, small test set size, full fault coverage, and very low hardware overhead. TG takes the form of a twisted ring counter with a small decoder array. We apply our technique to various datapath circuits including a carrylookahead adder, an arithmeticlogic unit, and a multiplieradder.
The HIPERLAN Equalizer ASIC Complexity and its Relationship With the Training Header
"... In this document, an attempt is made to estimate the size of the HIPERLAN equalizer ASIC by extrapolating from existing adaptive equalizer ASICs and making certain assumptions regarding the process and methodology used to design the HIPERLAN equalizer. Two scenarios are considered, first, an equaliz ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this document, an attempt is made to estimate the size of the HIPERLAN equalizer ASIC by extrapolating from existing adaptive equalizer ASICs and making certain assumptions regarding the process and methodology used to design the HIPERLAN equalizer. Two scenarios are considered, first, an equalizer using the LMS algorithm every baud to update the tap coefficients and second, an equalizer using an RLS algorithm performing a set of updates every ten baud intervals. It was discovered that the computational complexity of both approaches are within the same order of magnitude, however, the LMS ASIC will occupy at most half the size of its RLS counterpart. The smaller IC will reduce the overall system cost at the expense of the longer convergence time required by the LMS algorithm. The paper also demonstrates that when the overall packet processing delay is taken into account, the slowerconverging and cheaper LMS type equalizer will actually produce faster turnaround times for short packets than its RLS counter part. Thus, a new training header length is suggested that would allow vendors some flexibility in choosing the structure that best suites their product.
A New Recursive Multibit Recoding Algorithm for HighSpeed and LowPower Multiplier
 ISSN 15461998, American Scientific Publishers (ASP
, 2012
"... Abstract—In this paper, a new recursive multibit recoding multiplication algorithm is introduced. It provides a general spacetime partitioning of the multiplication problem that not only enables a drastic reduction of the number of partial products (n/r), but also eliminates the need of precomputi ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract—In this paper, a new recursive multibit recoding multiplication algorithm is introduced. It provides a general spacetime partitioning of the multiplication problem that not only enables a drastic reduction of the number of partial products (n/r), but also eliminates the need of precomputing odd multiples of the multiplicand in higher radix (ß≥8) multiplication. Based on a mathematical proof that any higher radix ß=2 r can be recursively derived from a combination of two or a number of lower radices, a series of generalized radix ß=2 r multipliers are generated by means of primary radices: 2 1, 2 2, 2 5, and 2 8. A variety of higherradix (2 3 2 32) two’s complement 64x64 bit serial/parallel multipliers are implemented on Virtex6 FPGA and characterized in terms of multiplytime, energy consumption per multiplyoperation, and area occupation for r value varying from 2 to 64. Compared to reference algorithm, savings of 8%, 52%, 63% are respectively obtained in terms of speed, power, and area. In addition, a new lowpower and highlyflexible radix 2 r adapted technique for a multiprecision multiplication is presented.
6.2 Static CMOS Design
"... Indepth discussion of logic families in CMOS—static and dynamic, passtransistor, nonratioed and ratioed logic ..."
Abstract
 Add to MetaCart
Indepth discussion of logic families in CMOS—static and dynamic, passtransistor, nonratioed and ratioed logic
Lecture #6: Memory Systems – Data distribution
"... In our discussion of vector architectures we noticed that having a memory system that allows for two loads and a store concurrently may significantly enhance the performance. Most of today’s microprocessors have only a single port to memory. The first supercomputer, the Cray1, also had only a singl ..."
Abstract
 Add to MetaCart
In our discussion of vector architectures we noticed that having a memory system that allows for two loads and a store concurrently may significantly enhance the performance. Most of today’s microprocessors have only a single port to memory. The first supercomputer, the Cray1, also had only a single channel between the processor and the main memory, while later Cray models have three memory channels for data and instructions. The two load and one store channel architectures nicely supports three operand instructions such as x ← y ×z, or the AXPY operation y ← y+α×x where α is a constant, or instructions with fewer operands. But, even two load and a store channel to memory is not sufficient for certain operations, such as y ← y +x×z, which in vectorized form is common for band matrix operations, as we will see later. If the memory cycle time is slower than the processor cycle time, as is indeed the case today for most commercial computer systems, then the demands on the number of concurrent requests to be served by the memory system increases further. In our description of pipelining we noticed that the memory system for a three operand instruction with a memory cycle time four times longer than the processor cycle time had to support 12 concurrent requests in order to fully support a single processor. In general, for P processors and K operands per instruction, the memory system must support P × K × memory cycle time processor cycle time concurrent requests for a design to be balanced with respect to processing power and memory bandwidth. In addition, the memory system must be able to support instruction fetching. For vector architectures, one instruction fetch suffices for each vector instruction. Thus, for a vector instruction set, the bandwidth required for instructions may add in the order of 1 % to the bandwidth required for data (one or two words for each instruction covering MVL to 4*MVL data items). For a 10,000 core system with all cores executing three operand instructions and a ratio of memory cycle time to processor cycle time of 10, the memory system must support concurrently the request for 300,000 operands if there is no data reuse. For 64–bit operands,