Results 1  10
of
143
Translating pseudoboolean constraints into SAT
 Journal on Satisfiability, Boolean Modeling and Computation
, 2006
"... In this paper, we describe and evaluate three different techniques for translating pseudoboolean constraints (linear constraints over boolean variables) into clauses that can be handled by a standard SATsolver. We show that by applying a proper mix of translation techniques, a SATsolver can perfor ..."
Abstract

Cited by 181 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we describe and evaluate three different techniques for translating pseudoboolean constraints (linear constraints over boolean variables) into clauses that can be handled by a standard SATsolver. We show that by applying a proper mix of translation techniques, a SATsolver can perform on a par with the best existing native pseudoboolean solvers. This is particularly valuable in those cases where the constraint problem of interest is naturally expressed as a SAT problem, except for a handful of constraints. Translating those constraints to get a pure clausal problem will take full advantage of the latest improvements in SAT research. A particularly interesting result of this work is the efficiency of sorting networks to express pseudoboolean constraints. Although tangential to this presentation, the result gives a suggestion as to how synthesis tools may be modified to produce arithmetic circuits more suitable for SAT based reasoning. Keywords: pseudoBoolean, SATsolver, SAT translation, integer linear programming
Optimal Circuits for Parallel Multipliers
 IEEE Transaction on Computers
, 1998
"... Abstract—We present new design and analysis techniques for the synthesis of parallel multiplier circuits that have smaller predicted delay than the best current multipliers. In [4], Oklobdzija et al. suggested a new approach, the ThreeDimensional Method (TDM), for Partial Product Reduction Tree (PP ..."
Abstract

Cited by 45 (1 self)
 Add to MetaCart
Abstract—We present new design and analysis techniques for the synthesis of parallel multiplier circuits that have smaller predicted delay than the best current multipliers. In [4], Oklobdzija et al. suggested a new approach, the ThreeDimensional Method (TDM), for Partial Product Reduction Tree (PPRT) design that produces multipliers that outperform the current best designs. The goal of TDM is to produce a minimum delay PPRT using full adders. This is done by carefully modeling the relationship of the output delays to the input delays in an adder and, then, interconnecting the adders in a globally optimal way. Oklobdzija et al. suggested a good heuristic for finding the optimal PPRT, but no proofs about the performance of this heuristic were given. We provide a formal characterization of optimal PPRT circuits and prove a number of properties about them. For the problem of summing a set of input bits within the minimum delay, we present an algorithm that produces a minimum delay circuit in time linear in the size of the inputs. Our techniques allow us to prove tight lower bounds on multiplier circuit delays. These results are combined to create a program that finds optimal TDM multiplier designs. Using this program, we can show that, while the heuristic used in [4] does not always find the optimal TDM circuit, it performs very well in terms of overall PPRT circuit delay. However, our search algorithms find better PPRT circuits for reducing the delay of the entire multiplier. Index Terms—Multiplier design, partial product reduction, algorithms, circuit design.
The SumAbsoluteDifference Motion Estimation Accelerator
 In Proceedings of the 24 th Euromicro Conference
"... In this paper we investigate the Sum Absolute Difference (SAD) operation, an operation frequently used by a number of algorithms for digital motion estimation. For such operation, we propose a single vector instruction that can be performed (in hardware) on an entire block of data in parallel. We in ..."
Abstract

Cited by 29 (15 self)
 Add to MetaCart
In this paper we investigate the Sum Absolute Difference (SAD) operation, an operation frequently used by a number of algorithms for digital motion estimation. For such operation, we propose a single vector instruction that can be performed (in hardware) on an entire block of data in parallel. We investigate possible implementations for such an instruction. Assuming a machine cycle comparable to the cycle of a two cycle multiply, we show that for a block of 16x1 or 16x16, the SAD operation can be performed in 3 or 4 machine cycles respectively. The proposed implementation operates as follows: first we determine in parallel which of the operands is the smallest in a pair of operands. Second we compute the absolute value of the difference of each pairs by subtracting the smallest value from the largest and finally we compute the accumulation. The operations associated with the second and the third step are performed in parallel resulting in a multiply (accumulate) type of operation. Our approach covers also the Mean Absolute Difference (MAD) operation at the exclusion of a shifting (division) operation.
Reduced Power Dissipation Through Truncated Multiplication
 in IEEE Alessandro Volta Memorial Workshop on Low Power Design
, 1999
"... Reducing the power dissipation of parallel multipliers is important in the design of digital signal processing systems. In many of these systems, the products of parallel multipliers are rounded to avoid growth in word size. The power dissipation and area of rounded parallel multipliers can be signi ..."
Abstract

Cited by 26 (7 self)
 Add to MetaCart
(Show Context)
Reducing the power dissipation of parallel multipliers is important in the design of digital signal processing systems. In many of these systems, the products of parallel multipliers are rounded to avoid growth in word size. The power dissipation and area of rounded parallel multipliers can be significantly reduced by a technique known as truncated multiplication. With this technique, the least significant columns of the multiplication matrix are not used. Instead, the carries generated by these columns are estimated. This estimate is added with the most significant columns to produce the rounded product. This paper presents the design and implementation of parallel truncated multipliers. Simulations indicate that truncated parallel multipliers dissipate between 29 and 40 percent less power than standard parallel multipliers for operand sizes of 16 and 32 bits. 1: Introduction Highspeed parallel multipliers are fundamental building blocks in digital signal processing systems [1]. In...
Design and implementation of the morphoSys reconfigurable computing processor
 J. Very Large Scale Integr. Signal Process. Syst
, 2000
"... Abstract. In this paper, we describe the implementation of MorphoSys, a reconfigurable processing system targeted at dataparallel and computationintensive applications. The MorphoSys architecture consists of a reconfigurable component (an array of reconfigurable cells) combined with a RISC control ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper, we describe the implementation of MorphoSys, a reconfigurable processing system targeted at dataparallel and computationintensive applications. The MorphoSys architecture consists of a reconfigurable component (an array of reconfigurable cells) combined with a RISC control processor and a high bandwidth memory interface. We briefly discuss the systemlevel model, array architecture, and control processor. Next, we present the detailed design implementation and the various aspects of physical layout of different subblocks of MorphoSys. The physical layout was constrained for 100 MHz operation, with low power consumption, and was implemented using 0.35 m, four metal layer CMOS (3.3 Volts) technology. We provide simulation results for the MorphoSys architecture (based on VHDL model) for some typical dataparallel applications (video compression and automatic target recognition). The results indicate that the MorphoSys system can achieve significantly better performance for most of these applications in comparison with other systems and processors. 1.
The SNAP Project: Design of Floating Point Arithmetic Units
 In Proceedings of Arith13
, 1997
"... In recent years computer applications have increased in their computational complexity. The industrywide usage of performance benchmarks, such as SPECmarks, and the popularity of 3D graphics applications forces processor designers to pay particular attention to implementation of the floating p ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
(Show Context)
In recent years computer applications have increased in their computational complexity. The industrywide usage of performance benchmarks, such as SPECmarks, and the popularity of 3D graphics applications forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. This paper presents results of the Stanford subnanosecond arithmetic processor (SNAP) research effort in the design of hardware for floating point addition, multiplication and division. We show that one cycle FP addition is achievable 32% of the time using a variable latency algorithm. For multiplication, a binary tree is often inferior to a Wallacetree designed using an algorithmic layout approach for contemporary feature sizes (0.3m). Further, in most cases twobit Booth encoding of the multiplier is preferable to nonBooth encoding for partial product generation. It appears that for division, optimum areaperformance is achieved using functional iteration, ...
The design of a high speed ASIC unit for the hash function
 SHA256 (384, 512),” in Proc. DATE
"... After recalling the basic algorithms published by NIST for implementing the hash functions SHA256 (384, 512), a basic circuit characterized by a cascade of full adder arrays is given. Implementation options are discussed and two methods for improving speed are exposed: the delay balancing and the p ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
After recalling the basic algorithms published by NIST for implementing the hash functions SHA256 (384, 512), a basic circuit characterized by a cascade of full adder arrays is given. Implementation options are discussed and two methods for improving speed are exposed: the delay balancing and the pipelining. An application of the former is first given, obtaining a circuit that reduces the length of the critical path by a full adder array. A pipelined version is then given, obtaining a reduction of two full adder arrays in the critical path. The two methods are afterwards combined and the results obtained through hardware synthesis are exposed, where a comparison between the new circuits is also given. 1.
Modular range reduction: A new algorithm for fast and accurate computation of the elementary functions
 Journal of Universal Computer Science
, 1995
"... Abstract: A new range reduction algorithm, called Modular Range Reduction (MRR), brie y introduced by the authors in [Daumas et al. 1994] is deeply analyzed. It is used to reduce the arguments to exponential and trigonometric function algorithms to be within the small range for which the algorithms ..."
Abstract

Cited by 13 (10 self)
 Add to MetaCart
Abstract: A new range reduction algorithm, called Modular Range Reduction (MRR), brie y introduced by the authors in [Daumas et al. 1994] is deeply analyzed. It is used to reduce the arguments to exponential and trigonometric function algorithms to be within the small range for which the algorithms are valid. MRR reduces the arguments quickly and accurately. A fast hardwired implementation of MRR operates in time O(log ( n)), where n is the number of bits of the binary input value. For example, with MRR it becomes possible to compute the sine and cosine of a very large number accurately. We propose two possible architectures implementing this algorithm.
Automatic synthesis of compressor trees: reevaluating large counters
 Design Automation and Test in Europe (DATE ’07
"... Despite the progress of the last decades in electronic design automation, arithmetic circuits have always received way less attention than other classes of digital circuits. Logic synthesisers, which play a fundamental role in design today, play a minor role on most arithmetic circuits, performing s ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
(Show Context)
Despite the progress of the last decades in electronic design automation, arithmetic circuits have always received way less attention than other classes of digital circuits. Logic synthesisers, which play a fundamental role in design today, play a minor role on most arithmetic circuits, performing some local optimisations but hardly improving the overall structure of arithmetic components. Architectural optimisations have been often studied manually, and only in the case of very common building blocks such as fast adders and multiinput adders, adhoc techniques have been developed. A notable case is multiinput addition, which is the core of many circuits such as multipliers, etc. The most common technique to implement multiinput addition is using compressor trees, which are often composed of carrysave adders (based on (3: 2) counters, i.e., full adders). A large body of literature exists to implement compressor trees using large counters. However, all the large counters were built by using full and half adders recursively. In this paper we give some definite answers to issues related to the use of large counters. We present a general technique to implement large counters whose performance is much better than the ones composed of full and half adders. Also we show that it is not always useful to use larger optimised counters and sometimes a combination of various size counters gives the best performance. Our results show 15 % improvement in the critical path delay. In some cases even hardware area is reduced by using our counters. 1.