Results 1  10
of
17
T.: Exploiting the Power of GPUs for Asymmetric Cryptography
, 2008
"... Abstract. Modern Graphics Processing Units (GPU) have reached a dimension with respect to performance and gate count exceeding conventional Central Processing Units (CPU) by far. Many modern computer systems include – beside a CPU – such a powerful GPU which runs idle most of the time and might be u ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Modern Graphics Processing Units (GPU) have reached a dimension with respect to performance and gate count exceeding conventional Central Processing Units (CPU) by far. Many modern computer systems include – beside a CPU – such a powerful GPU which runs idle most of the time and might be used as cheap and instantly available coprocessor for general purpose applications. In this contribution, we focus on the efficient realisation of the computationally expensive operations in asymmetric cryptosystems on such offtheshelf GPUs. More precisely, we present improved and novel implementations employing GPUs as accelerator for RSA and DSA cryptosystems as well as for Elliptic Curve Cryptography (ECC). Using a recent Nvidia 8800GTS graphics card, we are able to compute 813 modular exponentiations per second for RSA or DSAbased systems with 1024 bit integers. Moreover, our design for ECC over the prime field P224 even achieves the throughput of 1412 point multiplications per second.
Sslshader: cheap ssl acceleration with commodity processors
 In Proceedings of the 8th USENIX conference on Networked systems and implementation, NSDI’11
, 2011
"... Secure endtoend communication is becoming increasingly important as more private and sensitive data is transferred on the Internet. Unfortunately, today’s SSL deployment is largely limited to security or privacycritical domains. The low adoption rate is mainly attributed to the heavy cryptographic ..."
Abstract

Cited by 33 (7 self)
 Add to MetaCart
(Show Context)
Secure endtoend communication is becoming increasingly important as more private and sensitive data is transferred on the Internet. Unfortunately, today’s SSL deployment is largely limited to security or privacycritical domains. The low adoption rate is mainly attributed to the heavy cryptographic computation overhead on the server side, and the cost of good privacy on the Internet is tightly bound to expensive hardware SSL accelerators in practice. In this paper we present highperformance SSL acceleration using commodity processors. First, we show that modern graphics processing units (GPUs) can be easily converted to generalpurpose SSL accelerators. By exploiting the massive computing parallelism of GPUs, we accelerate SSL cryptographic operations beyond what stateoftheart CPUs provide. Second, we build a transparent SSL proxy, SSLShader, that carefully leverages the tradeoffs of recent hardware features such as AESNI and NUMA and achieves both high throughput and low latency. In our evaluation, the GPU implementation of RSA shows a factor of 22.6 to 31.7 improvement over the fastest CPU implementation. SSLShader achieves 29K transactions per second for small files while it transfers large files at 13 Gbps on a commodity server machine. These numbers are comparable to highend commercial SSL appliances at a fraction of their price.
Toward Acceleration of RSA Using 3D Graphics Hardware
"... Abstract. Demand in the consumer market for graphics hardware that accelerates rendering of 3D images has resulted in commodity devices capable of astonishing levels of performance. These results were achieved by specifically tailoring the hardware for the target domain. As graphics accelerators bec ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Demand in the consumer market for graphics hardware that accelerates rendering of 3D images has resulted in commodity devices capable of astonishing levels of performance. These results were achieved by specifically tailoring the hardware for the target domain. As graphics accelerators become increasingly programmable however, this performance has made them an attractive target for other domains. Specifically, they have motivated the transformation of costly algorithms from a general purpose computational model into a form that executes on said graphics hardware. We investigate the implementation and performance of modular exponentiation using a graphics accelerator, with the view of using it to execute operations required in the RSA public key cryptosystem. 1
Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware
 AFRICACRYPT 2009
, 2009
"... Graphics processing units (GPU) are increasingly being used for general purpose computing. We present implementations of large integer modular exponentiation, the core of publickey cryptosystems such as RSA, on a DirectX 10 compliant GPU. DirectX 10 compliant graphics processors are the latest gene ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
(Show Context)
Graphics processing units (GPU) are increasingly being used for general purpose computing. We present implementations of large integer modular exponentiation, the core of publickey cryptosystems such as RSA, on a DirectX 10 compliant GPU. DirectX 10 compliant graphics processors are the latest generation of GPU architecture, which provide increased programming flexibility and support for integer operations. We present high performance modular exponentiation implementations based on integers represented in both standard radix form and residue number system form. We show how a GPU implementation of a 1024bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput. We present how an adaptive approach to modular exponentiation involving implementations based on both a radix and a residue number system gives the best allaround performance on the GPU both in terms of latency and throughput. We also highlight the usage criteria necessary to allow the GPU to reach peak performance on public key cryptographic operations.
Modular Multiplication and Base Extensions in Residue Number Systems
 IN 15TH IEEE SYMPOSIUM ON COMPUTER ARITHMETIC
, 2001
"... We present a new RNS modular multiplication for very large operands. The algorithm is based on Montgomery's method adapted to residue arithmetic. By choosing the moduli of the RNS system reasonably large, an eect corresponding to a redundant highradix implementation is achieved, due to the car ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
We present a new RNS modular multiplication for very large operands. The algorithm is based on Montgomery's method adapted to residue arithmetic. By choosing the moduli of the RNS system reasonably large, an eect corresponding to a redundant highradix implementation is achieved, due to the carryfree nature of residue arithmetic. The actual computation in the multiplication takes place in constant time, where the unit of time is a few simple residue operations. However, it is necessary twice to convert values from one residue system into another, operations which take O(n) time on O(n) processors, where n is the number of moduli in the RNS systems. Thus these conversions are the bottlenecks of the method, and any future improvements in RNS base conversions, or the use of particular residue systems, can immediately be applied.
A Hardware Algorithm for Modular Multiplication/ Division
 IEEE TRANSACTIONS ON COMPUTERS
, 2005
"... A mixed radix4/2 algorithm for modular multiplication/division suitable for VLSI implementation is proposed. The algorithm is based on Montgomery method for modular multiplication and on the extended Binary GCD algorithm for modular division. Both algorithms are modified and combined into the propo ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
A mixed radix4/2 algorithm for modular multiplication/division suitable for VLSI implementation is proposed. The algorithm is based on Montgomery method for modular multiplication and on the extended Binary GCD algorithm for modular division. Both algorithms are modified and combined into the proposed algorithm so that almost all the hardware components are shared. The new algorithm carries out both calculations using simple operations such as shifts, additions, and subtractions. The radix2 signeddigit representation is used to avoid carry propagation in all additions and subtractions. A modular multiplier/divider based on the algorithm performs an nbit modular multiplication/division in OðnÞ clock cycles where the length of the clock cycle is constant and independent of n. The modular multiplier/divider has a linear array structure with a bitslice feature and can be implemented with much smaller hardware than that necessary to implement both multiplier and divider separately.
Acceleration of composite order bilinear pairing on graphics hardware
 Information and Communications Security, volume 7618 of LNCS
, 2012
"... Abstract. Recently, compositeorder bilinear pairing has been shown to be useful in many cryptographic constructions. However, it is timecostly to evaluate. This is because the composite order should be at least 1024bit and, hence, the elliptic curve group order n and base field become too large, r ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Recently, compositeorder bilinear pairing has been shown to be useful in many cryptographic constructions. However, it is timecostly to evaluate. This is because the composite order should be at least 1024bit and, hence, the elliptic curve group order n and base field become too large, rendering the bilinear pairing algorithm itself too slow to be practical (e.g., the Miller loop is Ω(n)). Thus, compositeorder computation easily becomes the bottleneck of a cryptographic construction, especially, in the case where many pairings need to be evaluated at the same time. The existing solution to this problem that converts compositeorder pairings to primeorder ones is only valid for certain constructions. In this paper, we leverage the huge number of threads available on Graphics Processing Units (GPUs) to speed up compositeorder pairing computation. We investigate suitable SIMD algorithms for base field, extension field, elliptic curve and bilinear pairing computation as well as mapping these algorithms into GPUs with careful considerations. Experimental results show that our method achieves a record of 8.7ms per pairing on a 1024bit security level, which is a 20fold speedup compared to stateoftheart CPU implementation. This result also opens the road to adopting higher security levels and using richresource parallel platforms, which for example are available in cloud computing. In fact, we can achieve more than 24 times speedup on a 2048bit security level and a record of 7 × 10 −6 USD per pairing on the Amazon cloud computing environment. 1
Leak Resistant Arithmetic
"... In this paper we show how the usage of Residue Number Systems (RNS) can easily be turned into a natural defense against many sidechannel attacks (SCA). We introduce a Leak Resistant Arithmetic (LRA), and present its capacities to defeat timing, power (SPA, DPA) and electromagnetic (EMA) attacks. ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
In this paper we show how the usage of Residue Number Systems (RNS) can easily be turned into a natural defense against many sidechannel attacks (SCA). We introduce a Leak Resistant Arithmetic (LRA), and present its capacities to defeat timing, power (SPA, DPA) and electromagnetic (EMA) attacks.
Improving Modular Inversion in RNS using the PlusMinus Method
, 2013
"... Abstract. The paper describes a new RNS modular inversion algorithm based on the extended Euclidean algorithm and the plusminus trick. In our algorithm, comparisons over large RNS values are replaced by cheap computations modulo 4. Comparisons to an RNS version based on Fermat’s little theorem were ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract. The paper describes a new RNS modular inversion algorithm based on the extended Euclidean algorithm and the plusminus trick. In our algorithm, comparisons over large RNS values are replaced by cheap computations modulo 4. Comparisons to an RNS version based on Fermat’s little theorem were carried out. The number of elementary modular operations is significantly reduced: a factor 12 to 26 for multiplications and 6 to 21 for additions. Virtex 5 FPGAs implementations show that for a similar area, our plusminus RNS modular inversion is 6 to 10 times faster.
A high speed pairing coprocessor using RNS and lazy reduction. Cryptology ePrint Archive, Available from http://eprint.iacr.org
, 2011
"... Abstract. In this paper, we present a high speed pairing coprocessor using Residue Number System (RNS) and lazy reduction. We show that combining RNS, which are naturally suitable for parallel architectures, and lazy reduction, which performs one reduction for more than one multiplication, the compu ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper, we present a high speed pairing coprocessor using Residue Number System (RNS) and lazy reduction. We show that combining RNS, which are naturally suitable for parallel architectures, and lazy reduction, which performs one reduction for more than one multiplication, the computational complexity of pairings can be largely reduced. The design is prototyped on a Xilinx Virtex6 FPGA, which utilizes 7023 slices and 32 DSPs, and finishes one 254bit optimal ate pairing computation in 0.664 ms.