Results 1 - 10
of
49
`C: A Language for High-Level, Efficient, and Machine-independent Dynamic Code Generation
, 1996
"... Dynamic code generation allows specialized code sequences to be crafted using runtime information. Since this information is by definition not available statically, the use of dynamic code generation can achieve performance inherently beyond that of static code generation. Previous attempts to sup ..."
Abstract
-
Cited by 119 (9 self)
- Add to MetaCart
Dynamic code generation allows specialized code sequences to be crafted using runtime information. Since this information is by definition not available statically, the use of dynamic code generation can achieve performance inherently beyond that of static code generation. Previous attempts to support dynamic code generation have been low-level, expensive, or machine-dependent. Despite the growing use of dynamic code generation, no mainstream language provides flexible, portable, and efficient support for it. We describe
A Linear Algebra Framework for Static HPF Code Distribution
, 1995
"... High Performance Fortran (hpf) was developed to support data parallel programming for simd and mimd machines with distributed memory. The programmer is provided a familiar uniform logical address space and specifies the data distribution by directives. The compiler then exploits these directives to ..."
Abstract
-
Cited by 81 (11 self)
- Add to MetaCart
High Performance Fortran (hpf) was developed to support data parallel programming for simd and mimd machines with distributed memory. The programmer is provided a familiar uniform logical address space and specifies the data distribution by directives. The compiler then exploits these directives to allocate arrays in the local memories, to assign computations to elementary processors and to migrate data between processors when required. We show here that linear algebra is a powerful framework to encode Hpf directives and to synthesize distributed code with space-efficient array allocation, tight loop bounds and vectorized communications for INDEPENDENT loops. The generated code includes traditional optimizations such as guard elimination, message vectorization and aggregation, overlap analysis... The systematic use of an affine framework makes it possible to prove the compilation scheme correct. An early version of this paper was presented at the Fourth International Workshop on Comp...
DCG: An Efficient, Retargetable Dynamic Code Generation System
- IN PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS
, 1994
"... Dynamic code generation allows aggressive optimization through the use of runtime information. Previous systems typically relied on ad hoc code generators that were not designed for retargetability, and did not shield the client from machine-specific details. We present a system, dcg, that allows cl ..."
Abstract
-
Cited by 64 (9 self)
- Add to MetaCart
Dynamic code generation allows aggressive optimization through the use of runtime information. Previous systems typically relied on ad hoc code generators that were not designed for retargetability, and did not shield the client from machine-specific details. We present a system, dcg, that allows clients to specify dynamically generated code in a machineindependent manner. Our one-pass code generator is easily retargeted and extremely efficient (code generation costs approximately 350 instructions per generated instruction). Experiments show that dynamic code generation increases some application speeds by over an order of magnitude.
Spatial Computation
- in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS
, 2004
"... This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the ..."
Abstract
-
Cited by 57 (12 self)
- Add to MetaCart
(Show Context)
This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the expense of computation units. In this paper we investigate a particular implementation of SC: ASH (Application-Specific Hardware). Under the assumption that computation is cheaper than communication, ASH replicates computation units to simplify interconnect, building a system which uses very simple, completely dedicated communication channels. As a consequence, communication on the datapath never requires arbitration; the only arbitration required is for accessing memory. ASH relies on very simple hardware primitives, using no associative structures, no multiported register files, no scheduling logic, no broadcast, and no clocks. As a consequence, ASH hardware is fast and extremely power efficient.
Operator Strength Reduction
, 1995
"... This paper presents a new al gS ithm for operator strengM reduction, called OSR. OSR improves upon an earlier alg orithm due to Allen, Cocke, and Kennedy [Allen et al. 1981]. OSR operates on the static sing e assig4 ent (SSA) form of a procedure [Cytron et al. 1991]. By taking advantag of the pr ..."
Abstract
-
Cited by 36 (9 self)
- Add to MetaCart
This paper presents a new al gS ithm for operator strengM reduction, called OSR. OSR improves upon an earlier alg orithm due to Allen, Cocke, and Kennedy [Allen et al. 1981]. OSR operates on the static sing e assig4 ent (SSA) form of a procedure [Cytron et al. 1991]. By taking advantag of the properties of SSA form, we have derived an alg--- ithm that is simple to understand, quick to implement, and, in practice, fast to run. Its asymptotic complexity is, in the worst case, the same as the Allen, Cocke, and Kennedy al gS ithm (ACK). OSR achieves optimization results that are equivalent to those obtained with the ACK alg orithm. OSR has been implemented in several research and production compilers
Primality proving using elliptic curves: An update
- In Proceedings of ANTS III
, 1998
"... Abstract. In 1986, following the work of Schoof on counting points on elliptic curves over finite fields, new algorithms for primality proving emerged, due to Goldwasser and Kilian on the one hand, and Atkin on the other. The latter algorithm uses the theory of complex multiplication. The algorithm, ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In 1986, following the work of Schoof on counting points on elliptic curves over finite fields, new algorithms for primality proving emerged, due to Goldwasser and Kilian on the one hand, and Atkin on the other. The latter algorithm uses the theory of complex multiplication. The algorithm, now called ECPP, has been used for nearly ten years. The purpose of this paper is to give an account of the recent theoretical and practical improvements of ECPP, as well as new benchmarks for integers of various sizes and a new primality record. 1
SSE implementation of multivariate pkcs on modern x86 cpus
- CHES 2009, LNCS
, 2009
"... Multivariate Public Key Cryptosystems (MPKCs) are often touted as future-proo ng the advent of the Quantum Computer. It also has been known for e ciency compared to traditional alternatives. However, this advantage seems to be eroding with the increase of arithmetic resources in modern CPUs and impr ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
Multivariate Public Key Cryptosystems (MPKCs) are often touted as future-proo ng the advent of the Quantum Computer. It also has been known for e ciency compared to traditional alternatives. However, this advantage seems to be eroding with the increase of arithmetic resources in modern CPUs and improved algorithms, especially with respect to ECC. We show that the same hardware advances do not necessarily just favor ECC. The same modern commodity CPUs also have an overabundance of small integer arithmetic/logic resources, embodied by SSE2 or other vector instruction set extensions, that are also useful for MPKCs. On CPUs supporting Intel's SSSE3 instructions, we achieve a 4 × speed-up over prior implementations of Rainbow-type systems (such as the ones implemented in hardware by Bogdanov et al. at CHES 2008) in both public and private map operations. Furthermore, if we want to implement MPKCs for all general purpose 64-bit CPUs from Intel and AMD, we can switch to MPKC over elds of relatively small odd prime characteristics. For example, by taking advantage of SSE2 instructions, Rainbow over F31 can be up to 2 × faster than prior implementations of same-sized systems over F16. A key advance is in implementing Wiedemann instead of Gaussian system solvers. We explain the techniques and design choices in implementing our chosen MPKC instances, over representative elds such as F31, F16 and F256. We believe that our results can easily carry over to modern FPGAs, which often contain a large number of multipliers in the form of DSP slices, o ering superior computational power to odd- eld MPKCs.
N-bit unsigned division via n-bit multiply-add
- in 17th IEEE Symposium on Computer Arithmetic
, 2005
"... Integer division on modern processors is expensive compared to multiplication. Previous algorithms for performing unsigned division by an invariant divisor, via reciprocal approximation, suffer in the worst case from a common requirement for n+1 bit multiplication, which typically must be synthesize ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Integer division on modern processors is expensive compared to multiplication. Previous algorithms for performing unsigned division by an invariant divisor, via reciprocal approximation, suffer in the worst case from a common requirement for n+1 bit multiplication, which typically must be synthesized from n-bit multiplication and extra arithmetic operations. This paper presents, and proves, a hybrid of previous algorithms that replaces n+1 bit multiplication with a single fused multiply-add operation on n-bit operands, thus reducing any n-bit unsigned division to the upper n bits of a multiply-add, followed by a single right shift. An additional benefit is that the prerequisite calculations are simple and fast. On the Itanium ® 2 processor, the technique is advantageous for as few as two quotients that share a common run-time divisor.
Strength Reduction of Integer Division and Modulo Operations
, 2001
"... Integer division, modulo, and remainder operations are expressive and useful operations. They are logical candidates to express complex data accesses such as the wrap-around behavior in queues using ring buers. In addition, they appear frequently in address computations as a result of compiler optim ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Integer division, modulo, and remainder operations are expressive and useful operations. They are logical candidates to express complex data accesses such as the wrap-around behavior in queues using ring buers. In addition, they appear frequently in address computations as a result of compiler optimizations that improve data locality, perform data distribution, or enable parallelization. Experienced application programmers, however, avoid them because they are slow. Furthermore, while advances in both hardware and software have improved the performance of many parts of a program, few are applicable to division and modulo operations. This trend makes these operations increasingly detrimental to program performance.
Improved division by invariant integers
- IEEE Transactions on Computers
"... Abstract-This paper considers the problem of dividing a two-word integer by a single-word integer, together with a few extensions and applications. Due to lack of efficient division instructions in current processors, the division is performed as a multiplication using a precomputed single-word app ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract-This paper considers the problem of dividing a two-word integer by a single-word integer, together with a few extensions and applications. Due to lack of efficient division instructions in current processors, the division is performed as a multiplication using a precomputed single-word approximation of the reciprocal of the divisor, followed by a couple of adjustment steps. There are three common types of unsigned multiplication instructions; we define full word multiplication (umul) which produces the two-word product of two single-word integers, low multiplication (umullo) which produces only the least significant word of the product, and high multiplication (umulhi), which produces only the most significant word. We describe an algorithm which produces a quotient and remainder using one umul and one umullo. This is an improvement over earlier methods, since the new method uses cheaper multiplication operations. It turns out we also get some additional savings from simpler adjustment conditions. The algorithm has been implemented in version 4.3 of the GMP library. When applied to the problem of dividing a large integer by a single word, the new algorithm gives a speedup of roughly 30%, benchmarked on AMD and Intel processors in the x86 64 family.