Results 11 - 20
of
82
A Tutorial on Lava: A Hardware Description and Verification System
, 2000
"... Contents 1 Introduction 4 2 Getting Started 6 2.1 Your First Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 The Lava Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Your Second Circuit . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Generating VHDL . . ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
Contents 1 Introduction 4 2 Getting Started 6 2.1 Your First Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 The Lava Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Your Second Circuit . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Generating VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Bigger Circuits 12 3.1 Recursion over Lists . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Connection Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Verification 18 4.1 Simple Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Quanti
A Multilevel Algorithm For Solving Boundary Integral Equation
- Micro. Opt. Tech. Lett
, 1994
"... In the solution of an integral equation using the Conjugate Gradient (CG) method, the most expansive part is the matrix-vector multiplication, requiring O(N 2 ) floating point operations. The Fast Multipole Method (FMM) reduced the operation to N 1:5 . In this paper, we apply a multilevel algor ..."
Abstract
-
Cited by 20 (12 self)
- Add to MetaCart
In the solution of an integral equation using the Conjugate Gradient (CG) method, the most expansive part is the matrix-vector multiplication, requiring O(N 2 ) floating point operations. The Fast Multipole Method (FMM) reduced the operation to N 1:5 . In this paper, we apply a multilevel algorithm to this problem and show that the complexity of a matrix-vector multiplication is proportional to N(log(N)) 2 . y This work was supported by NASA under contract NASA NAG 2-871, Office of Naval Research under grant N00014-89-J1286, and the Army Research Office under contract DAAL03-91-G-0339, and the National Science Foundation under grant NSF-ECS-9224466. The computer time was provided by the National Center for Supercomputing Applications (NCSA) at the University of Illinois, Urbana-Champaign. Published in Micro. Opt. Tech. Lett., Vol. 7, No. 10, pp. 466-470, July, 1994. File:mlfma1.tex, January 13, 1995 1. Introduction Multilevel algorithms have been used to generate fast algorit...
Scheduling Threads for Low Space Requirement and Good Locality
- In Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1999
"... The running time and memory requirement of a parallel program with dynamic, lightweight threads depends heavily on the underlying thread scheduler. In this paper, we present a simple, asynchronous, space-efficient scheduling algorithm for shared memory machines that combines the low scheduling overh ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
The running time and memory requirement of a parallel program with dynamic, lightweight threads depends heavily on the underlying thread scheduler. In this paper, we present a simple, asynchronous, space-efficient scheduling algorithm for shared memory machines that combines the low scheduling overheads and good locality of work stealing with the low space requirements of depth-first schedulers. For a nested-parallel program with depth D and serial space requirement S 1 , we show that the expected space requirement is S 1 +O(K \Delta p \Delta D) on p processors. Here, K is a user-adjustable runtime parameter, which provides a tradeoff between running time and space requirement. Our algorithm achieves good locality and low scheduling overheads by automatically increasing the granularity of the work scheduled on each processor. We have implemented the new scheduling algorithm in the context of a native, user-level implementation of Posix standard threads or Pthreads, and evaluated its p...
An Adaptive Software Library for Fast Fourier Transforms
- In Proceedings of the International Conference on Supercomputing
, 2000
"... In this paper we present an adaptive and portable software library for the fast Fourier transform (FFT). The library consists of a number of composable blocks of code called codelets, each computing a part of the transform. The actual FFT algorithm used by the code is determined at run-time by selec ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
In this paper we present an adaptive and portable software library for the fast Fourier transform (FFT). The library consists of a number of composable blocks of code called codelets, each computing a part of the transform. The actual FFT algorithm used by the code is determined at run-time by selecting the fastest strategy among all possible strategies, given available codelets, for a given transform size. We also presentanefficient automatic method of generating the library modules by using a special--purpose compiler. The code generator is written in C and it generates a library of C codelets. The code generator is shown to be flexible and extensible and the entire library can be generated in a matter of seconds. Wehaveevaluated the library for performance on the IBM--SP2, SGI--2000, HP--Exemplar and Intel Pentium systems. We use the results from these evaluations to build performance models for the FFT library on different platforms. The library is shown to be portable, adaptive and efficient. 1.
Portable High-Performance Programs
, 1999
"... right notice and this permission notice are preserved on all copies. ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
right notice and this permission notice are preserved on all copies.
Workload Analysis and Demand Prediction of Enterprise Data Center Applications
"... the creation of resource pools of servers that permit multiple application workloads to share each server in the pool. Understanding the nature of enterprise workloads is crucial to properly designing and provisioning current and future services in such pools. This paper considers issues of workload ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
the creation of resource pools of servers that permit multiple application workloads to share each server in the pool. Understanding the nature of enterprise workloads is crucial to properly designing and provisioning current and future services in such pools. This paper considers issues of workload analysis, performance modeling, and capacity planning. Our goal is to automate the efficient use of resource pools when hosting large numbers of enterprise services. We use a trace based approach for capacity management that relies on i) the characterization of workload demand patterns, ii) the generation of synthetic workloads that predict future demands based on the patterns, and iii) a workload placement recommendation service. The accuracy of capacity planning predictions depends on our ability to characterize workload demand patterns, to recognize trends for expected changes in future demands, and to reflect business forecasts for otherwise unexpected changes in future demands. A workload analysis demonstrates the burstiness and repetitive nature of enterprise workloads. Workloads are automatically classified according to their periodic behavior. The similarity among repeated occurrences of patterns is evaluated. Synthetic workloads are generated from the patterns in a manner that maintains the periodic nature, burstiness, and trending behavior of the workloads. A case study involving six months of data for 139 enterprise applications is used to apply and evaluate the enterprise workload analysis and related capacity planning methods. The results show that when consolidating to 8 processor systems, we predicted future per-server required capacity to within one processor 95 % of the time. The accuracy of predictions for required capacity suggests that such resource savings can be achieved with little risk. I.
A Modified Split-Radix FFT With Fewer Arithmetic Operations
, 2007
"... Recent Results by Van Buskirk et al. have broken the record set by Yavne in 1968 for the lowest exact count of real additions and multiplications to compute a power-of-two discrete Fourier transform (DFT). Here, we present a simple recursive modification of the split-radix algorithm that computes th ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
Recent Results by Van Buskirk et al. have broken the record set by Yavne in 1968 for the lowest exact count of real additions and multiplications to compute a power-of-two discrete Fourier transform (DFT). Here, we present a simple recursive modification of the split-radix algorithm that computes the DFT with asymptotically about 6 % fewer operations than Yavne, matching the count achieved by Van Buskirk’s program-generation framework. We also discuss the application of our algorithm to real-data and real-symmetric (discrete cosine) transforms, where we are again able to achieve lower arithmetic counts than previously published algorithms.
Oblivious algorithms for multicores and network of processors
, 2009
"... We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multi-level caching model for multic ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multi-level caching model for multicores, and we propose a multicore-oblivious approach to algorithms and schedulers for HM. We instantiate this approach with provably efficient multicore-oblivious algorithms for matrix and prefix sum computations, FFT, the Gaussian Elimination paradigm (which represents an important class of computations including Floyd-Warshall’s all-pairs shortest paths, Gaussian Elimination and LU decomposition without pivoting), sorting, list ranking, Euler tours and connected components. We then use the network oblivious framework proposed earlier as an oblivious framework for a network of processors, and we present provably efficient network-oblivious algorithms for sorting, the Gaussian Elimination paradigm, list ranking, Euler tours and connected components. Many of these networkoblivious algorithms perform efficiently also when executed on the Decomposable-BSP.
Cache-oblivious algorithms (Extended Abstract)
- In Proc. 40th Annual Symposium on Foundations of Computer Science
, 1999
"... This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cach ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cache-line length L where Z � Ω � L 2 � the number of cache misses for an m � n matrix transpose is Θ � 1 � mn � L �. The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ � 1 �� � n � L � � 1 � log Z n �� �. We also give an Θ � mnp �-work algorithm to multiply an m � n matrix by an n � p matrix that incurs Θ � 1 �� � mn � np � mp � � L � mnp � L � Z � cache faults. We introduce an “ideal-cache ” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on the effectiveness of cache-oblivious algorithms in practice.
On the Conversion between Binary Code and Binary-Reflected Gray Code on Binary Cubes
- IEEE Trans. Computers
, 1991
"... We present a new algorithm for conversion between binary code and binary--reflected Gray code that requires approximately 2K 3 element transfers in sequence for K elements per node, compared to K element transfers for previously known algorithms. For a binary cube of n = 2 dimensions the new algor ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
We present a new algorithm for conversion between binary code and binary--reflected Gray code that requires approximately 2K 3 element transfers in sequence for K elements per node, compared to K element transfers for previously known algorithms. For a binary cube of n = 2 dimensions the new algorithm degenerates to yield a complexity of K 2 + 1 element transfers, which is optimal. The new algorithm is optimal within a factor of 1 3 with respect to the best known lower bound for any routing strategy. We show that the minimum number of element transfers for minimum path length routing is K with concurrent communication on all channels of every node of a binary cube. 1 Introduction. Minimizing the required data motion in memory hierarchies has been crucial in achieving high performance almost since the beginning of modern computer technology. In conventional memory hierarchies, minimizing data motion takes the form of preserving temporal and spatial locality of reference in sche...

