Results 1 
6 of
6
A Mechanically Checked Proof of the Correctness of the Kernel of the AMD5K86 FloatingPoint Division Algorithm
 IEEE Transactions on Computers
, 1996
"... We describe a mechanically checked proof of the correctness of the kernel of the floating point division algorithm used on the AMD5K 86 microprocessor. The kernel is a nonrestoring division algorithm that computes the floating point quotient of two double extended precision floating point numbers, ..."
Abstract

Cited by 30 (11 self)
 Add to MetaCart
We describe a mechanically checked proof of the correctness of the kernel of the floating point division algorithm used on the AMD5K 86 microprocessor. The kernel is a nonrestoring division algorithm that computes the floating point quotient of two double extended precision floating point numbers, p and d (d 6= 0), with respect to a rounding mode, mode. The algorithm is defined in terms of floating point addition and multiplication. First, two NewtonRaphson iterations are used to compute a floating point approximation of the reciprocal of d. The result is used to compute four floating point quotient digits in the 24,,17 format (24 bits of precision and 17 bit exponents) which are then summed using appropriate rounding modes. We prove that if p and d are 64,,15 (possibly denormal) floating point numbers, d 6= 0 and mode specifies one of six rounding procedures and a desired precision 0 ! n 64, then the output of the algorithm is p=d rounded according to mode. We prove that every int...
ABSTRACT Adaptive Aggregation on Chip Multiprocessors
"... The recent introduction of commodity chip multiprocessors requires that the design of core database operations be carefully examined to take full advantage of onchip parallelism. In this paper we examine aggregation in a multicore environment, the Sun UltraSPARC T1, a chip multiprocessor with eigh ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
The recent introduction of commodity chip multiprocessors requires that the design of core database operations be carefully examined to take full advantage of onchip parallelism. In this paper we examine aggregation in a multicore environment, the Sun UltraSPARC T1, a chip multiprocessor with eight cores and a shared L2 cache. Aggregation is an important aspect of query processing that is seemingly easy to understand and implement. Our research, however, demonstrates that a chip multiprocessor adds new dimensions to understanding hashbased aggregation performance— concurrent sharing of aggregation data structures and contentious accesses to frequently used values. We also identify a trade off between private data structures assigned to each thread versus shared data structures for aggregation. Depending on input characteristics, different aggregation strategies are optimal and choosing the wrong strategy can result in a performance penalty of over an order of magnitude. We provide a thorough explanation of the factors affecting aggregation performance on chip multiprocessors and identify three key input characteristics that dictate performance: (1) average run length of identical groupby values, (2) locality of references to the aggregation hash table, and (3) frequency of repeated accesses to the same hash table location. We then introduce an adaptive aggregation operator that performs lightweight sampling of the input to choose the correct aggregation strategy with high accuracy. Our experiments verify that our adaptive algorithm chooses the highest performing aggregation strategy on a number of common input distributions. 1.
ABSTRACT Parallel Buffers for Chip Multiprocessors
"... Chip multiprocessors (CMPs) present new opportunities for improving database performance on large queries. Because CMPs often share execution, cache, or bandwidth resources among many hardware threads, implementing parallel database operators that efficiently share these resources is key to maximizi ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Chip multiprocessors (CMPs) present new opportunities for improving database performance on large queries. Because CMPs often share execution, cache, or bandwidth resources among many hardware threads, implementing parallel database operators that efficiently share these resources is key to maximizing performance. A crucial aspect of this parallelism is managing concurrent, shared input and output to the parallel operators. In this paper we propose and evaluate a parallel buffer that enables intraoperator parallelism on CMPs by avoiding contention between hardware threads that need to concurrently read or write to the same buffer. The parallel buffer handles parallel input and output coordination as well as load balancing so individual operators do not need to reimplement that functionality. 1.
ACCELERATION AND ENERGY EFFICIENCY OF GEOMETRIC ALGEBRA COMPUTATIONS USING RECONFIGURABLE COMPUTERS AND GPUS
"... Geometric algebra (GA) is a mathematical framework that allows the compact description of geometric relationships and algorithms in many fields of science and engineering. The execution of these algorithms, however, requires significant computational power that made the use of GA impractical for man ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Geometric algebra (GA) is a mathematical framework that allows the compact description of geometric relationships and algorithms in many fields of science and engineering. The execution of these algorithms, however, requires significant computational power that made the use of GA impractical for many realworld applications. We describe how a GAbased formulation of the inverse kinematics problem from robotics can be accelerated using reconfigurable FPGAbased computing and on a graphics processing unit (GPU). The practical evaluation covers not only the sheer compute performance, but also the energy efficiency of the various solutions. 1.
Concert'02 Architecture Specification and Implementation
"... This document describes a RISC CPU architecture, Concert'02. The Concert'02 architecture is based on an older specification, but has been updated for the intended host system. We continue by implementing the architecture in VHDL. The implementation is called the JAM CPU core. It is a fivestage pipe ..."
Abstract
 Add to MetaCart
This document describes a RISC CPU architecture, Concert'02. The Concert'02 architecture is based on an older specification, but has been updated for the intended host system. We continue by implementing the architecture in VHDL. The implementation is called the JAM CPU core. It is a fivestage pipelined CPU core, with multicycle operations, forwarding and hazard checking. The CPU has been tested in an actual FPGA, and proved to work properly. We have analysed the critical path, the synthesis timing and area report and current performance.
1 System Impact of 3D ProcessorMemory Interconnect: A Limit Study
"... Abstract—3D integration with throughsiliconvias (TSVs) can provide enormous bandwidth between processor die and memory die. The central goal of our work is to explore the limits of performance improvement that can be achieved with such integration. Towards this end we propose a model of the impact ..."
Abstract
 Add to MetaCart
Abstract—3D integration with throughsiliconvias (TSVs) can provide enormous bandwidth between processor die and memory die. The central goal of our work is to explore the limits of performance improvement that can be achieved with such integration. Towards this end we propose a model of the impact of 3D TSVs on system performance. The model leads to several key observations i) increased miss tolerance (smaller caches) and hence improved core scaling for a fixed die size, ii) higher sustained IPC per core, iii) significantly smaller, energy efficient DRAM banks, iv) redistribution of system power to the cores and ondie interconnect, and v) TSV utilization is a function of the relationship between reference locality and the bandwidth properties of the intradie network. These observations are repeated in cycle level simulations of a 64 tile architecture. I.