Results 1 - 10
of
104
Performance Analysis of k-ary n-cube Interconnection Networks
- IEEE Transactions on Computers
, 1988
"... VLSI communication networks are wire limited. The cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct the network. This paper analyzes communication networks of varying dimension under the assumption of constant wi ..."
Abstract
-
Cited by 271 (14 self)
- Add to MetaCart
VLSI communication networks are wire limited. The cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct the network. This paper analyzes communication networks of varying dimension under the assumption of constant wire bisection. Expressions for the latency, average case throughput, and hot-spot throughput of k-ary n- cube networks with constant bisection are derived that agree closely with experimental measurements. It is shown that low-dimensional networks (e.g., tori) have lower latency and higher hot-spot throughput than high-dimensional networks (e.g., binary n-cubes) with the same bisection width. Keywords Communication networks, interconnection networks, concurrent computing, message-passing multiprocessors, parallel processing, VLSI. 1 Introduction The critical component of a concurrent computer is its communication network. Many algorithms are communication rather than processing limited. Fi...
Algorithms for Parallel Memory I: Two-Level Memories
, 1992
"... We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our two-level memory model is n ..."
Abstract
-
Cited by 226 (32 self)
- Add to MetaCart
We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our two-level memory model is new and gives a realistic treatment of parallel block transfer, in which during a single I/O each of the P secondary storage devices can simultaneously transfer a contiguous block of B records. The model pertains to a large-scale uniprocessor system or parallel multiprocessor system with P disks. In addition, the sorting, FFT, permutation network, and standard matrix multiplication algorithms are typically optimal in terms of the amount of internal processing time. The difficulty in developing optimal algorithms is to cope with the partitioning of memory into P separate physical devices. Our algorithms' performance can be significantly better than those obtained by the well-known but nonopti...
Architecture of Field-Programmable Gate Arrays: The Effect of Logic Block Functionality on Area Efficiency
- Proceedings of the IEEE
, 1990
"... tures and the programming technologies used to customize them is presented. Programming technologies are compared on the basis of their vola fility, size, parasitic capacitance, resistance, and process technology complexity. FPGA architectures are divided into two constituents: logic block architect ..."
Abstract
-
Cited by 97 (12 self)
- Add to MetaCart
tures and the programming technologies used to customize them is presented. Programming technologies are compared on the basis of their vola fility, size, parasitic capacitance, resistance, and process technology complexity. FPGA architectures are divided into two constituents: logic block architectures and routing architectures. A classijcation of logic blocks based on their granularity is proposed and several logic blocks used in commercially available FPGA ’s are described. A brief review of recent results on the effect of logic block granularity on logic density and pe$ormance of an FPGA is then presented. Several commercial routing architectures are described in the contest of a general routing architecture model. Finally, recent results on the tradeoff between the fleibility of an FPGA routing architecture its routability and density are reviewed. I.
A Complexity Theory for VLSI
- TECHNICAL REPORT
, 1980
"... The established methodologies for studying computational complexity can be applied to the new problems posed by very large-scale integrated (VLSI) circuits. This thesis develops a “VLSI model of computation” and derives upper and lower bounds on the silicon area and time required to solve the proble ..."
Abstract
-
Cited by 92 (0 self)
- Add to MetaCart
The established methodologies for studying computational complexity can be applied to the new problems posed by very large-scale integrated (VLSI) circuits. This thesis develops a “VLSI model of computation” and derives upper and lower bounds on the silicon area and time required to solve the problems of sorting and discrete Fourier transformation. In particular, the area A and time T taken by any VLSI chip using any algorithm to perform an $N$-point Fourier transform must satisfy $AT^2 \geq c N^2 \log^2 N$, for some fixed $c > 0$. A more general result for both sorting and Fourier transformation is that $AT^{2x} = \Omega(N^{1+x} \log^{2x} N)$ for any $x$ in the range $0 < x < 1$. Also, the energy dissipated by a VLSI chip during the solution of either of these problems is at least $\Omega(N^{3/2} \log N)$. The tightness of these bounds is demonstrated by the existence of nearly optimal circuits for both sorting and Fourier transformation. The circuits based on the shuffle-exchange interconnection pattern are fast but large: $T = O(\log^2 N)$ for Fourier transformation, $T = O(\log^3 N)$ for sorting; both have area $A$ of at most $O(N^2 / \log{1/2} N)$. The circuits based on the mesh interconnection pattern are slow but small: $T = O(N^{1/2} \log\log N)$, $A = O(N \log^2 N)$.
Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors
, 1981
"... In this paper we implement several basic operating system primitives by using a "replace-add" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also pr ..."
Abstract
-
Cited by 84 (2 self)
- Add to MetaCart
In this paper we implement several basic operating system primitives by using a "replace-add" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also present a hardware implementation of replace-add that permits multiple replace-adds to be processed nearly as efficiently as loads and stores. Moreover, the crucial special case of concurrent replace-adds updating the same variable is handled particularly well: If every PE simultaneously addresses a replace-add at the same variable, all these requests are satisfied in the time required to process just one request.
Special Purpose Parallel Computing
- Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract
-
Cited by 77 (5 self)
- Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures
- IEEE Trans. on Circuits and Systems
, 1990
"... ..."
Deterministic Sorting in Nearly Logarithmic Time on the Hypercube and Related Computers
- Journal of Computer and System Sciences
, 1996
"... This paper presents a deterministic sorting algorithm, called Sharesort, that sorts n records on an n-processor hypercube, shuffle-exchange, or cube-connected cycles in O(log n (log log n) 2 ) time in the worst case. The algorithm requires only a constant amount of storage at each processor. Th ..."
Abstract
-
Cited by 67 (10 self)
- Add to MetaCart
This paper presents a deterministic sorting algorithm, called Sharesort, that sorts n records on an n-processor hypercube, shuffle-exchange, or cube-connected cycles in O(log n (log log n) 2 ) time in the worst case. The algorithm requires only a constant amount of storage at each processor. The fastest previous deterministic algorithm for this problem was Batcher's bitonic sort, which runs in O(log 2 n) time. Supported by an NSERC postdoctoral fellowship, and DARPA contracts N00014--87--K--825 and N00014-- 89--J--1988. 1 Introduction Given n records distributed uniformly over the n processors of some fixed interconnection network, the sorting problem is to route the record with the ith largest associated key to processor i, 0 i ! n. One of the earliest parallel sorting algorithms is Batcher's bitonic sort [3], which runs in O(log 2 n) time on the hypercube [10], shuffle-exchange [17], and cube-connected cycles [14]. More recently, Leighton [9] exhibited a bounded-degree,...
MULTIPROCESSOR SCHEDULING TO ACCOUNT FOR INTERPROCESSOR COMMUNICATION
, 1991
"... Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essenti ..."
Abstract
-
Cited by 64 (11 self)
- Add to MetaCart
Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essential for attaining efficient hardware utilization. This thesis introduces two new compile-time heuristics for scheduling precedence graphs onto multiprocessor architectures, which account for interprocessor communication overheads and interconnection constraints in the architecture. These algorithms perform scheduling and routing simultaneously to account for irregular interprocessor interconnections, and schedule all communications as well as all computations to eliminate shared resource contention. The first technique, called dynamic-level scheduling, modifies the classical HLFET list scheduling strategy to account for IPC and synchronization overheads. By using dynamically changing priorities to match nodes and processors at each step, this technique attains an equitable tradeoff between load balancing and interprocessor communication cost. This method is fast, flexible, widely targetable, and displays promising perforrnance. The second technique, called declustering, establishes a parallelism hierarchy upon the precedence graph using graph-analysis techniques which explicitly address the tradeoff between exploiting parallelism and incurring communication cost. By systematically decomposing this hierarchy, the declustering process exposes parallelism instances in order of importance, assuring efficient use of the available processing resources. In contrast with traditional clustering schemes, this technique can adjust the level of cluster granularity to suit the characteristics of the specified architecture, leading to a more effective solution.
Local parallel biomolecular computing
- DNA Based Computers III, volume 48 of DIMACS
, 1999
"... Biomolecular Computation(BMC) is computation at the molecular scale, using biotechnology engineering techniques. Most proposed methods for BMC used distributed (molecular) parallelism (DP); where operations are executed in parallel on large numbers of distinct molecules. BMC done exclusively by DP r ..."
Abstract
-
Cited by 47 (13 self)
- Add to MetaCart
Biomolecular Computation(BMC) is computation at the molecular scale, using biotechnology engineering techniques. Most proposed methods for BMC used distributed (molecular) parallelism (DP); where operations are executed in parallel on large numbers of distinct molecules. BMC done exclusively by DP requires that the computation execute sequentially within any given molecule (though done in parallel for multiple molecules). In contrast, local parallelism (LP) allows operations to be executed in parallel on each given molecule. Winfree, et al [W96, WYS96]) proposed an innovative method for LP-BMC, that of computation by unmediated self-assembly of � arrays of DNA molecules, applying known domino tiling techniques (see Buchi [B62], Berger [B66], Robinson [R71], and Lewis and Papadimitriou [LP81]) in combination with the DNA self-assembly techniques of Seeman et al [SZC94]. The likelihood for successful unmediated self-assembly of computations has not been determined (we discuss a simple model of assembly where there may be blockages in self-assembly, but more sophisticated models may have a higher likelihood of success). We develop improved techniques to more fully exploit the potential power of LP-BMC. To increase

