Results 1  10
of
83
Sparse matrix solvers on the GPU: conjugate gradients and multigrid
 ACM Trans. Graph
, 2003
"... Permission to make digital/hard copy of part of all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given ..."
Abstract

Cited by 216 (3 self)
 Add to MetaCart
Permission to make digital/hard copy of part of all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission
minimization in IC Design: Principles and applications
 ACM Transactions on Design Automation of Electronic Systems
, 1996
"... Low power has emerged as a principal theme in today’s electronics industry. The need for low power has caused a major paradigm shift in which power dissipation is as important as performance and area. This article presents an indepth survey of CAD methodologies and techniques for designing low powe ..."
Abstract

Cited by 158 (29 self)
 Add to MetaCart
Low power has emerged as a principal theme in today’s electronics industry. The need for low power has caused a major paradigm shift in which power dissipation is as important as performance and area. This article presents an indepth survey of CAD methodologies and techniques for designing low power digital CMOS circuits and systems and describes the many issues facing designers at architectural, logic and physical levels of design abstraction. It reviews some of the techniques and tools that have been proposed to overcome these difficulties and outlines the future challenges that must be met to design low power, high performance systems. 1.
PathBased Scheduling for Synthesis
 IEEE TRANSACTIONS ON COMPUTERAIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
, 1991
"... In the context of synthesis, scheduling assigns operations to control steps. Operations are the atomic components used for describing behavior, for example, arithmetic and Boolean operations. They are ordered partially by data dependencies (dataflow graph) and by control constructs such as condit ..."
Abstract

Cited by 87 (0 self)
 Add to MetaCart
In the context of synthesis, scheduling assigns operations to control steps. Operations are the atomic components used for describing behavior, for example, arithmetic and Boolean operations. They are ordered partially by data dependencies (dataflow graph) and by control constructs such as conditional branches and loops (controlflow graph). A control step usually corresponds to one state, one clock cycle, or one microprogram step. This paper presents a new, pathbased scheduling algorithm. It yields solutions with the minimum number of control steps, taking into account arbitrary constraints that limit the amount of operations in each control step. The result is a finite state machine that implements the control. Although the complexity of the algorithm is proportional to the number of paths in the controlflow graph, it is shown to be practical for large examples with thousands of nodes.
MULTIPROCESSOR SCHEDULING TO ACCOUNT FOR INTERPROCESSOR COMMUNICATION
, 1991
"... Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essenti ..."
Abstract

Cited by 67 (11 self)
 Add to MetaCart
Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essential for attaining efficient hardware utilization. This thesis introduces two new compiletime heuristics for scheduling precedence graphs onto multiprocessor architectures, which account for interprocessor communication overheads and interconnection constraints in the architecture. These algorithms perform scheduling and routing simultaneously to account for irregular interprocessor interconnections, and schedule all communications as well as all computations to eliminate shared resource contention. The first technique, called dynamiclevel scheduling, modifies the classical HLFET list scheduling strategy to account for IPC and synchronization overheads. By using dynamically changing priorities to match nodes and processors at each step, this technique attains an equitable tradeoff between load balancing and interprocessor communication cost. This method is fast, flexible, widely targetable, and displays promising perforrnance. The second technique, called declustering, establishes a parallelism hierarchy upon the precedence graph using graphanalysis techniques which explicitly address the tradeoff between exploiting parallelism and incurring communication cost. By systematically decomposing this hierarchy, the declustering process exposes parallelism instances in order of importance, assuring efficient use of the available processing resources. In contrast with traditional clustering schemes, this technique can adjust the level of cluster granularity to suit the characteristics of the specified architecture, leading to a more effective solution.
Logic Decomposition during Technology Mapping. submitted to
 IEEE Trans. CAD
, 1995
"... A problem in technology mapping is that quality of the final implementation depends significantly on the initially provided circuit structure. To resolve this problem, conventional techniques iteratively but separately apply technology independent transformations and technology mapping. In this pape ..."
Abstract

Cited by 55 (0 self)
 Add to MetaCart
A problem in technology mapping is that quality of the final implementation depends significantly on the initially provided circuit structure. To resolve this problem, conventional techniques iteratively but separately apply technology independent transformations and technology mapping. In this paper, we propose a procedure which performs logic decomposition and technology mapping simultaneously. We show that the procedure effectively explores all possible algebraic decompositions. It finds an optimal tree implementation over all the circuit structures examined, while the run time is typically logarithmic in the number of decompositions. 1
Optimizing TwoPhase, LevelClocked Circuitry (Extended Abstract)
"... We investigate two strategies for reducing the clock period of a twophase, levelclocked circuit: clock tuning, which adjusts the waveforms that clock the circuit, and retiming, which relocates circuit latches. These methods can be used to convert a circuit with edgetriggered latches into a faster ..."
Abstract

Cited by 54 (16 self)
 Add to MetaCart
We investigate two strategies for reducing the clock period of a twophase, levelclocked circuit: clock tuning, which adjusts the waveforms that clock the circuit, and retiming, which relocates circuit latches. These methods can be used to convert a circuit with edgetriggered latches into a faster levelclocked one. We model a twophase circuit as a graph whose vertex set V is a collection of combinational logic blocks, and whose edge set E is a set of interconnections. Each interconnection passes through 0 or more latches, where each latch is clocked by one of two periodic, nonoverlapping waveforms, or phases. We give efficient polynomialtime algorithms for problems involving the timing verification and optimization of twophase circuitry. Included are algorithms for ffl verifyi...
Scheduling And Behavioral Transformations For Parallel Systems
, 1993
"... In a parallel system, either a VLSI architecture in hardware or a parallel program in software, the quality of the final design depends on the ability of a synthesis system to exploit the parallelism hidden in the input description of applications. Since iterative or recursive algorithms are usually ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
In a parallel system, either a VLSI architecture in hardware or a parallel program in software, the quality of the final design depends on the ability of a synthesis system to exploit the parallelism hidden in the input description of applications. Since iterative or recursive algorithms are usually the most timecritical parts of an application, the parallelism embedded in the repetitive pattern of an iterative algorithm needs to be explored. This thesis studies techniques and algorithms to expose the parallelism in an iterative algorithm so that the designer can find an implementation achieving a desired execution rate. In particular, the objective is to find an efficient schedule to be executed iteratively. A form of dataflow graphs is used to model the iterative part of an application, e.g. a digital signal filter or the while/for loop of a program. Nodes in the graph represent operations to be performed and edges represent both intraiteration and interiteration precedence relat...
Understanding Retiming through Maximum AverageDelay Cycles
 Mathematical Systems Theory
, 1994
"... A synchronous circuit built of functional elements and registers is a simple implementation of the semisystolic model of computation that can be used to design parallel algorithms. Retiming is a wellknown technique that transforms a given circuit into a faster circuit by relocating its registers. W ..."
Abstract

Cited by 31 (8 self)
 Add to MetaCart
A synchronous circuit built of functional elements and registers is a simple implementation of the semisystolic model of computation that can be used to design parallel algorithms. Retiming is a wellknown technique that transforms a given circuit into a faster circuit by relocating its registers. We give tight bounds on the minimum clock period that can be achieved by retiming a synchronous circuit. These bounds are expressed in terms of the maximum delaytoregister ratio of the cycles in the circuit graph and the maximum propagation delay d max of the circuit components. Our bounds do not depend on the size of the circuit, and they are of theoretical as well as practical interest. They characterize exactly the minimum clock period that can be achieved by retiming a unitdelay circuit, and they lead to more efficient algorithms for several important problems related to retiming. Specifically, we give an O(V 1=2 E lg V ) algorithm for minimum clock period retiming of unitdelay circu...
Optimal Retiming of MultiPhase, LevelClocked Circuits
 In Advanced Research in VLSI and Parallel Systems: Proc. of the 1992 Brown/MIT Conference
, 1991
"... Using levelsensitive latches instead of edgetriggered registers for storage elements in a synchronous system can lead to faster and less expensive circuit implementations. This advantage derives from an increased flexibility in scheduling the computations performed by the circuit. In edgeclocked ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
Using levelsensitive latches instead of edgetriggered registers for storage elements in a synchronous system can lead to faster and less expensive circuit implementations. This advantage derives from an increased flexibility in scheduling the computations performed by the circuit. In edgeclocked circuits the amount of time available for the computation between two registers is precisely the length of the clock cycle, while in circuits using levelsensitive latches a computation can borrow time across latches thus reducing the amount of dead time in the clock cycle. In either type of circuit, achieving maximum performance requires locating the storage elements in such a way as to spread the computation uniformly across a number of clock cycles. Retiming is the process of rearranging the storage elements in a circuit to reduce the cycle time or the number of storage elements without changing the interface behavior of the circuit as viewed by an outside host. Retiming in effect resched...
The Practical Application of Retiming to the Design of HighPerformance Systems
 In Proceedings of the 1993 IEEE/ACM International Conference on Computer Aided Design
, 1993
"... Many advances have been made recently in the theory of circuit retiming, especially for circuits that use levelsensitive latches. In spite of this, automatic retiming tools have seen relatively little use in practice. One reason for this is the lack of good speedup results when retiming has been ap ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
Many advances have been made recently in the theory of circuit retiming, especially for circuits that use levelsensitive latches. In spite of this, automatic retiming tools have seen relatively little use in practice. One reason for this is the lack of good speedup results when retiming has been applied to real circuits. Another reason is that retiming has used a rather simple circuit model which reduces its utility in practice. This paper addresses both of these issues. We suggest that the reason for the poor results reported for retiming is that retiming has been applied too late in the design process when there is little flexibility for performance improvement. We give an example of using retiming early in the design process to achieve better performance while at the same time simplifying the design process itself. We then describe an extension to the retiming circuit model that includes clock skew as well as latch propagation delay, setup and hold parameters. Including these param...