Results 1  10
of
131
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract

Cited by 77 (5 self)
 Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
CommunicationEfficient Parallel Algorithms for Distributed RandomAccess Machines
 Algorithmica
, 1988
"... This paper introduces a model for parallel computation, called the distributed randomaccess machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
This paper introduces a model for parallel computation, called the distributed randomaccess machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages through a communication network. A DRAM explicitly models the congestion of messages across cuts of the network. We introduce the notion of a conservative algorithm as one whose communication requirements at each step can be bounded by the congestion of pointers of the input data structure across cuts of a DRAM. We give a simple lemma that shows how to "shortcut" pointers in a data structure so that remote processors can communicate without causing undue congestion. We give O(lg n)step, linearprocessor, linearspace, conservative algorithms for a variety of problems on n node trees, such as computing treewalk numberings, finding the separator of a tree, and evaluating all subexpressions ...
A Family of Adders
 In Proceedings of 14th IEEE Symposium on Computer Arithmetic
, 1999
"... Binary carrypropagating addition can be efficiently expressed as a prefix computation. Several examples of adders based on such a formulation have been published, and efficient implementations are numerous. Chief among the known constructions are those of Kogge & Stone and Ladner & Fischer. In this ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
Binary carrypropagating addition can be efficiently expressed as a prefix computation. Several examples of adders based on such a formulation have been published, and efficient implementations are numerous. Chief among the known constructions are those of Kogge & Stone and Ladner & Fischer. In this work we show that these are end cases of a large family of addition structures, all of which share the attractive property of minimum logical depth. The intermediate structures allow tradeoffs between the amount of internal wiring and the fanout of intermediate nodes, and can thus usually achieve a more attractive combination of speed and area/power cost than either of the known endcases. Rules for the construction of such adders are given, as are examples of realistic 32b designs implemented in an industrial 0u25 CMOS process. 1. Introduction There are many ways of formulating the process of binary addition. Each different way provides different insight and thus suggests different impl...
Speculative completion for the design of highperformance asynchronous dynamic adders
 In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems. IEEE Computer
, 1997
"... This paper presents an indepth case study in highperformance asynchronous adder design. A recent method, called “speculative completion”, is used. This method uses singlerail bundled datapaths but also allows early completion. Five new dynamic designs are presented for BrentKung and CarryBypass ..."
Abstract

Cited by 33 (8 self)
 Add to MetaCart
This paper presents an indepth case study in highperformance asynchronous adder design. A recent method, called “speculative completion”, is used. This method uses singlerail bundled datapaths but also allows early completion. Five new dynamic designs are presented for BrentKung and CarryBypass adders. Furthermore, two new architectures are introduced, which target (i) small number addition, and (ii) hybrid operation. Initial SPICE simulation and statistical analysis show performance improvements up to 19 % on random inputs and 14 % on actual programs for 32bit adders, and up to 29 % on random inputs for 64bit adders, over comparable synchronous designs. 1
Cost Reduction and Evaluation of a Temporary Faults Detecting Technique
, 2000
"... IC technologies are approaching the ultimate limits of silicon in terms of channel width, power supply and speed. By approaching these limits, circuits are becoming increasingly sensitive to noise, which will result on unacceptable rates of softerrors. Furthermore, defect behavior is becoming incre ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
IC technologies are approaching the ultimate limits of silicon in terms of channel width, power supply and speed. By approaching these limits, circuits are becoming increasingly sensitive to noise, which will result on unacceptable rates of softerrors. Furthermore, defect behavior is becoming increasingly complex resulting on increasing number of timing faults that can escape detection by fabrication testing. Thus, fault tolerant techniques will become necessary even for commodity applications. This work considers the implementation and improvements of a new soft error and timing error detecting technique based on time redundancy. Arithmetic circuits were used as test vehicle to validate the approach. Simulations and performance evaluations of the proposed detection technique were made using time and logic simulators. The obtained results show that detection of such temporal faults can be achieved by means of meaningful hardware and performance cost.
RSA Hardware Implementation
, 1995
"... Introduction to Arithmetic for Digital System Designers. New York, NY: Holt, Rinehart and Winston, 1982. 28 #14# C#. K. Ko#c and C. Y. Hung. Multioperand modulo addition using carry save adders. Electronics Letters, 26#6#:361#363, 15th March 1990. #15# C# . K. Ko#c and C. Y. Hung. Bitlevel syst ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Introduction to Arithmetic for Digital System Designers. New York, NY: Holt, Rinehart and Winston, 1982. 28 #14# C#. K. Ko#c and C. Y. Hung. Multioperand modulo addition using carry save adders. Electronics Letters, 26#6#:361#363, 15th March 1990. #15# C# . K. Ko#c and C. Y. Hung. Bitlevel systolic arrays for modular multiplication. Journal of VLSI Signal Processing, 3#3#:215#223, 1991. #16# M. Kochanski. Developing an RSA chip. In H. C. Williams, editor, Advances in Cryptology CRYPTO 85, Proceedings, Lecture Notes in Computer Science, No. 218, pages 350#357. New York, NY: SpringerVerlag, 1985. #17# I. Koren. Computer Arithmetic Algorithms. Englewood Cli#s, NJ: PrenticeHall, 1993. #18# D. C. Kozen. The Design and Analysis of Algorithms. New York, NY: SpringerVerlag, 1992. #19# R. Ladner and M. Fischer. Parallel pre#x computation. Journal of the ACM, 27#4#:831# 838, October 1980. #20# S.
DigitSet Conversions: Generalizations and Applications
 IEEE Transactions on Computers
, 1995
"... The problem of digit set conversion for fixed radix is investigated for the case of converting into a nonredundant, as well as into a redundant digit set. Conversion may be from very general digit sets, and covers as special cases multiplier recodings, additions and certain multiplications. We gene ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
The problem of digit set conversion for fixed radix is investigated for the case of converting into a nonredundant, as well as into a redundant digit set. Conversion may be from very general digit sets, and covers as special cases multiplier recodings, additions and certain multiplications. We generalize known algorithms for conversions into nonredundant digit sets, as well as apply conversion to generalize the O(log n) time algorithm for conditional sum addition using parallel prefix computation, and a comparison is made with standard carrylookahead techniques. Examples on multioperand addition are used to illustrate the generality of this approach. O(1) time algorithms for converting into redundant digit sets are generalized based on a very simple lemma, which provides a framework for all conversions into redundant digit sets. Applications in multiplier recoding and partial product accumulation are used here as exemplifications. Keywords: Computer arithmetic, digit set conversio...
Wired: Wireaware circuit design
 In Proc. of Conference on Correct Hardware Design and Verification Methods (CHARME
, 2005
"... Abstract. Routing wires are dominant performance stoppers in deep submicron technologies, and there is an urgent need to take them into account already at higher levels of abstraction. However, the normal design flow gives the designer only limited control over the details of the lower levels, risk ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
Abstract. Routing wires are dominant performance stoppers in deep submicron technologies, and there is an urgent need to take them into account already at higher levels of abstraction. However, the normal design flow gives the designer only limited control over the details of the lower levels, risking the quality of the final result. We propose a language, called Wired, which lets the designer express circuit function together with layout, in order to get more precise control over the result. The complexity of larger designs is managed by using parameterised connection patterns. The resulting circuit descriptions are compact, and yet capture detailed layout, including the size and positions of wires. We are able to analyse nonfunctional properties of these descriptions, by “running ” them using nonstandard versions of the wire and gate primitives. The language is relational, which means that we can build forwards, backwards and bidirectional analyses. Here, we show the description and analysis of various parallel prefix circuits, including a novel structure with small depth and low fanout. 1
NonHeuristic Optimization and Synthesis of ParallelPrefix Adders
 In Proc. Int. Workshop on Logic and Architecture Synthesis
, 1996
"... The class of parallelprefix adders comprises the most areadelay efficient adder architectures  such as the ripplecarry, the carryincrement, and the carrylookahead adders  for the entire range of possible areadelay tradeoffs. The generic description of these adders as prefix structures all ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
The class of parallelprefix adders comprises the most areadelay efficient adder architectures  such as the ripplecarry, the carryincrement, and the carrylookahead adders  for the entire range of possible areadelay tradeoffs. The generic description of these adders as prefix structures allows their simple and consistent area optimization and synthesis under given timing constraints, including nonuniform input and output signal arrival times. This paper presents an efficient nonheuristic algorithm for the generation of sizeoptimal parallelprefix structures under arbitrary depth constraints. Keywords Parallelprefix adders, nonheuristic synthesis algorithm, circuit timing and area optimization, computer arithmetic, cellbased VLSI. 1 Introduction Cellbased design techniques, such as standardcells and FPGAs, together with versatile hardware synthesis are prerequisites for a high productivity in ASIC design. For the implementation of arithmetic components, the designer ...
The Impact of 3Dimensional Integration on the Design of Arithmetic Units
, 2006
"... 3Dimensional integration technology stacks multiple die on top of each other with a dense dietodie interface. This enables a circuit designer to replace long wires with short vertical interconnects, thus reducing wirerelated delay and power consumption. In this research, we evaluate the impact ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
3Dimensional integration technology stacks multiple die on top of each other with a dense dietodie interface. This enables a circuit designer to replace long wires with short vertical interconnects, thus reducing wirerelated delay and power consumption. In this research, we evaluate the impact of a 3D fabrication technology on the latency and power of arithmetic functional units. Specifically, we study integer adders and shifters as they have very different delay characteristics. An adder’s critical path latency is dominated by logic/gate delays, while a shifter’s latency is more greatly affected by wire delay. We demonstrate that the potential benefits of a 3D technology are the greatest when applied to wirebound circuits. In particular, a barrel shifter implemented in 3D exhibits a 9 % reduction in latency with a simultaneous 8 % reduction in energy.