Results 1  10
of
13
Sizing Router Buffers
, 2004
"... All Internet routers contain buffers to hold packets during times of congestion. Today, the size of the buffers is determined by the dynamics of TCP’s congestion control algorithm. In particular, the goal is to make sure that when a link is congested, it is busy 100 % of the time; which is equivalen ..."
Abstract

Cited by 350 (18 self)
 Add to MetaCart
(Show Context)
All Internet routers contain buffers to hold packets during times of congestion. Today, the size of the buffers is determined by the dynamics of TCP’s congestion control algorithm. In particular, the goal is to make sure that when a link is congested, it is busy 100 % of the time; which is equivalent to making sure its buffer never goes empty. A widely used ruleofthumb states that each link needs a buffer of size B = RT T × C, where RT T is the average roundtrip time of a flow passing across the link, and C is the data rate of the link. For example, a 10Gb/s router linecard needs approximately 250ms × 10Gb/s = 2.5Gbits of buffers; and the amount of buffering grows linearly with the linerate. Such large buffers are challenging for router manufacturers, who must use large, slow, offchip DRAMs. And queueing delays can be long, have high variance, and may destabilize the congestion control algorithms. In this paper we argue that the ruleofthumb (B = RT T ×C) is now outdated and incorrect for backbone routers. This is because of the large number of flows (TCP connections) multiplexed together on a single backbone link. Using theory, simulation and experiments on a network of real routers, we show that a link with n flows requires no more than B = (RT T × C) / √ n, for longlived or shortlived TCP flows. The consequences on router design are enormous: A 2.5Gb/s link carrying 10,000 flows could reduce its buffers by 99 % with negligible difference in throughput; and a 10Gb/s link carrying 50,000 flows requires only 10Mbits of buffering, which can easily be implemented using fast, onchip SRAM.
A Mechanically Checked Proof of the Correctness of the Kernel of the AMD5K86 FloatingPoint Division Algorithm
 IEEE Transactions on Computers
, 1996
"... We describe a mechanically checked proof of the correctness of the kernel of the floating point division algorithm used on the AMD5K 86 microprocessor. The kernel is a nonrestoring division algorithm that computes the floating point quotient of two double extended precision floating point numbers, ..."
Abstract

Cited by 37 (11 self)
 Add to MetaCart
(Show Context)
We describe a mechanically checked proof of the correctness of the kernel of the floating point division algorithm used on the AMD5K 86 microprocessor. The kernel is a nonrestoring division algorithm that computes the floating point quotient of two double extended precision floating point numbers, p and d (d 6= 0), with respect to a rounding mode, mode. The algorithm is defined in terms of floating point addition and multiplication. First, two NewtonRaphson iterations are used to compute a floating point approximation of the reciprocal of d. The result is used to compute four floating point quotient digits in the 24,,17 format (24 bits of precision and 17 bit exponents) which are then summed using appropriate rounding modes. We prove that if p and d are 64,,15 (possibly denormal) floating point numbers, d 6= 0 and mode specifies one of six rounding procedures and a desired precision 0 ! n 64, then the output of the algorithm is p=d rounded according to mode. We prove that every int...
ABSTRACT Adaptive Aggregation on Chip Multiprocessors
"... The recent introduction of commodity chip multiprocessors requires that the design of core database operations be carefully examined to take full advantage of onchip parallelism. In this paper we examine aggregation in a multicore environment, the Sun UltraSPARC T1, a chip multiprocessor with eigh ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
The recent introduction of commodity chip multiprocessors requires that the design of core database operations be carefully examined to take full advantage of onchip parallelism. In this paper we examine aggregation in a multicore environment, the Sun UltraSPARC T1, a chip multiprocessor with eight cores and a shared L2 cache. Aggregation is an important aspect of query processing that is seemingly easy to understand and implement. Our research, however, demonstrates that a chip multiprocessor adds new dimensions to understanding hashbased aggregation performance— concurrent sharing of aggregation data structures and contentious accesses to frequently used values. We also identify a trade off between private data structures assigned to each thread versus shared data structures for aggregation. Depending on input characteristics, different aggregation strategies are optimal and choosing the wrong strategy can result in a performance penalty of over an order of magnitude. We provide a thorough explanation of the factors affecting aggregation performance on chip multiprocessors and identify three key input characteristics that dictate performance: (1) average run length of identical groupby values, (2) locality of references to the aggregation hash table, and (3) frequency of repeated accesses to the same hash table location. We then introduce an adaptive aggregation operator that performs lightweight sampling of the input to choose the correct aggregation strategy with high accuracy. Our experiments verify that our adaptive algorithm chooses the highest performing aggregation strategy on a number of common input distributions. 1.
Comparing the Programming Demands of SingleUser and MultiUser Applications . ln
 Proceedings of the fourth Symposium on User Interface Software and Technology (UIST'91
, 1991
"... Synchronous multiuser applications are designed to support two or more simultaneous users. The RENDEZVOUSTM1 system is an infrastructure for building such multiuser applications. Several multiuser applications, such as a tictactoe game. a multiuser CardTable application, and a multiuser white ..."
Abstract

Cited by 29 (0 self)
 Add to MetaCart
Synchronous multiuser applications are designed to support two or more simultaneous users. The RENDEZVOUSTM1 system is an infrastructure for building such multiuser applications. Several multiuser applications, such as a tictactoe game. a multiuser CardTable application, and a multiuser whiteboard have been or are being constructed with the RENDEZVOUS system. We argue that there are at least three dimensions of programming complexity that are differentially affected by the programming of multiuser applications as compared to the programming of singleuser applications. The first, concurrency, addresses the need to cope with parallel activities. The second dimension, abstraction, addresses the need to separate the userint~rface from an underlying application abstraction, The thmd dimension, roles, addresses the need to differentially characterize users and customize the userinterface appropriately. Certainly, singleuser applications often deal with these complexities; we argue that multiuser applications cannot avoid them.
ACCELERATION AND ENERGY EFFICIENCY OF GEOMETRIC ALGEBRA COMPUTATIONS USING RECONFIGURABLE COMPUTERS AND GPUS
, 2009
"... Geometric algebra (GA) is a mathematical framework that allows the compact description of geometric relationships and algorithms in many fields of science and engineering. The execution of these algorithms, however, requires significant computational power that made the use of GA impractical for man ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
Geometric algebra (GA) is a mathematical framework that allows the compact description of geometric relationships and algorithms in many fields of science and engineering. The execution of these algorithms, however, requires significant computational power that made the use of GA impractical for many realworld applications. We describe how a GAbased formulation of the inverse kinematics problem from robotics can be accelerated using reconfigurable FPGAbased computing and on a graphics processing unit (GPU). The practical evaluation covers not only the sheer compute performance, but also the energy efficiency of the various solutions.
ABSTRACT Parallel Buffers for Chip Multiprocessors
"... Chip multiprocessors (CMPs) present new opportunities for improving database performance on large queries. Because CMPs often share execution, cache, or bandwidth resources among many hardware threads, implementing parallel database operators that efficiently share these resources is key to maximizi ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Chip multiprocessors (CMPs) present new opportunities for improving database performance on large queries. Because CMPs often share execution, cache, or bandwidth resources among many hardware threads, implementing parallel database operators that efficiently share these resources is key to maximizing performance. A crucial aspect of this parallelism is managing concurrent, shared input and output to the parallel operators. In this paper we propose and evaluate a parallel buffer that enables intraoperator parallelism on CMPs by avoiding contention between hardware threads that need to concurrently read or write to the same buffer. The parallel buffer handles parallel input and output coordination as well as load balancing so individual operators do not need to reimplement that functionality. 1.
A Parallel Page Cache: IOPS and Caching for Multicore Systems
"... We present a setassociative page cache for scalable parallelism of IOPS in multicore systems. The design eliminates lock contention and hardware cache misses by partitioning the global cache into many independent page sets, each requiring a small amount of metadata that fits in few processor cache ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
We present a setassociative page cache for scalable parallelism of IOPS in multicore systems. The design eliminates lock contention and hardware cache misses by partitioning the global cache into many independent page sets, each requiring a small amount of metadata that fits in few processor cache lines. We extend this design with message passing among processors in a nonuniform memory architecture (NUMA). We evaluate the setassociative cache on 12core processors and a 48core NUMA to show that it realizes the scalable IOPS of direct I/O (no caching) and matches the cache hits rates of Linux’s page cache. Setassociative caching maintains IOPS at scale in contrast to Linux for which IOPS crash beyond eight parallel threads. 1
Determining software investment lag
 Journal of Universal Computer Science
, 2008
"... Abstract: The investments needed to bring a software project to the market are substantial and can extend over several years. Managing software development requires not only technical expertise, but communication with funders and economists. This paper presents methods to estimate a parameter which ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract: The investments needed to bring a software project to the market are substantial and can extend over several years. Managing software development requires not only technical expertise, but communication with funders and economists. This paper presents methods to estimate a parameter which captures the effective investment time, lag. The lag parameter is useful in assessing progress towards the goal of having a quality product, while scheduling resources, assessing the risk, considering options, capitalization of investments, and predicting taxation consequences. The paper presents the lag estimation methods for a new product, for additional versions of a product, and for complete product replacement.
1 System Impact of 3D ProcessorMemory Interconnect: A Limit Study
"... Abstract—3D integration with throughsiliconvias (TSVs) can provide enormous bandwidth between processor die and memory die. The central goal of our work is to explore the limits of performance improvement that can be achieved with such integration. Towards this end we propose a model of the impact ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—3D integration with throughsiliconvias (TSVs) can provide enormous bandwidth between processor die and memory die. The central goal of our work is to explore the limits of performance improvement that can be achieved with such integration. Towards this end we propose a model of the impact of 3D TSVs on system performance. The model leads to several key observations i) increased miss tolerance (smaller caches) and hence improved core scaling for a fixed die size, ii) higher sustained IPC per core, iii) significantly smaller, energy efficient DRAM banks, iv) redistribution of system power to the cores and ondie interconnect, and v) TSV utilization is a function of the relationship between reference locality and the bandwidth properties of the intradie network. These observations are repeated in cycle level simulations of a 64 tile architecture. I.
Correctness of the AMD K
, 1996
"... We describe a mechanically checked proof of the correctness of the kernel of the
oating point division algorithm used on the AMD5 K 86 microprocessor. The kernel is a nonrestoring division algorithm that computes the
oating point quotient of two double extended precision oating point numbers, p a ..."
Abstract
 Add to MetaCart
(Show Context)
We describe a mechanically checked proof of the correctness of the kernel of the
oating point division algorithm used on the AMD5 K 86 microprocessor. The kernel is a nonrestoring division algorithm that computes the
oating point quotient of two double extended precision oating point numbers, p and d (d 6 = 0), with respect to a rounding mode, mode. The algorithm is dened in terms of
oating point addition and multiplication. First, two NewtonRaphson iterations are used to compute a
oating point approximation of the reciprocal of d. The result is used to compute four
oating point quotient digits in the 24,,17 format (24 bits of precision and 17 bit exponents) which are then summed using appropriate rounding modes. We prove that if p and d are 64,,15 (possibly denormal)
oating point numbers, d 6 = 0 and mode species one of six rounding procedures and a desired precision 0 < n 64, then the output of the algorithm is p=d rounded according to mode. We prove that every intermediate result is a
oating point number in the format required by the resources allocated to it. Our claims have been mechanically checked using the ACL2 theorem prover. 1