Results 1  10
of
21
T.: Exploiting the Power of GPUs for Asymmetric Cryptography
, 2008
"... Abstract. Modern Graphics Processing Units (GPU) have reached a dimension with respect to performance and gate count exceeding conventional Central Processing Units (CPU) by far. Many modern computer systems include – beside a CPU – such a powerful GPU which runs idle most of the time and might be u ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
Abstract. Modern Graphics Processing Units (GPU) have reached a dimension with respect to performance and gate count exceeding conventional Central Processing Units (CPU) by far. Many modern computer systems include – beside a CPU – such a powerful GPU which runs idle most of the time and might be used as cheap and instantly available coprocessor for general purpose applications. In this contribution, we focus on the efficient realisation of the computationally expensive operations in asymmetric cryptosystems on such offtheshelf GPUs. More precisely, we present improved and novel implementations employing GPUs as accelerator for RSA and DSA cryptosystems as well as for Elliptic Curve Cryptography (ECC). Using a recent Nvidia 8800GTS graphics card, we are able to compute 813 modular exponentiations per second for RSA or DSAbased systems with 1024 bit integers. Moreover, our design for ECC over the prime field P224 even achieves the throughput of 1412 point multiplications per second.
Practical symmetric key cryptography on modern graphics hardware
 In Proc. USENIX ’08
"... Graphics processors are continuing their trend of vastly outperforming CPUs while becoming more general purpose. The latest generation of graphics processors have introduced the ability handle integers natively. This has increased the GPU’s applicability to many fields, especially cryptography. This ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Graphics processors are continuing their trend of vastly outperforming CPUs while becoming more general purpose. The latest generation of graphics processors have introduced the ability handle integers natively. This has increased the GPU’s applicability to many fields, especially cryptography. This paper presents an application oriented approach to block cipher processing on GPUs. A new block based conventional implementation of AES on an Nvidia G80 is shown with 410x speed improvements over CPU implementations and 24x speed increase over the previous fastest AES GPU implementation. We outline a general purpose data structure for representing cryptographic client requests which is suitable for execution on a GPU. We explore the issues related to the mapping of this general structure to the GPU. Finally we present the first analysis of the main encryption modes of operation on a GPU, showing the performance and behavioural implications of executing these modes under the outlined general purpose data model. Our AES implementation is used as the underlying block cipher to show the overhead of moving from an optimised hardcoded approach to a generalised one. 1
ECM on Graphics Cards
"... Abstract. This paper reports recordsetting performance for the ellipticcurve method of integer factorization: for example, 604.99 curves/second for ECM stage 1 with B1 = 8192 for 280bit integers on a single PC. The stateoftheart GMPECM software handles 171.42 curves/second for ECM stage 1 with ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Abstract. This paper reports recordsetting performance for the ellipticcurve method of integer factorization: for example, 604.99 curves/second for ECM stage 1 with B1 = 8192 for 280bit integers on a single PC. The stateoftheart GMPECM software handles 171.42 curves/second for ECM stage 1 with B1 = 8192 for 280bit integers using all four cores of a 2.4GHz Core 2 Quad Q6600. The extra speed takes advantage of extra hardware, specifically two NVIDIA GTX 280 graphics cards, using a new ECM implementation introduced in this paper. Our implementation uses Edwards curves, relies on new parallel addition formulas, and is carefully tuned for the highly parallel GPU architecture. On a single GTX 280 the implementation performs 22.66 million modular multiplications per second for a general 280bit modulus. GMPECM, using all four cores of a Q6600, performs 17.91 million multiplications per second. This paper also reports speeds on other graphics processors: for example,
Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware
 AFRICACRYPT 2009
, 2009
"... Graphics processing units (GPU) are increasingly being used for general purpose computing. We present implementations of large integer modular exponentiation, the core of publickey cryptosystems such as RSA, on a DirectX 10 compliant GPU. DirectX 10 compliant graphics processors are the latest gene ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Graphics processing units (GPU) are increasingly being used for general purpose computing. We present implementations of large integer modular exponentiation, the core of publickey cryptosystems such as RSA, on a DirectX 10 compliant GPU. DirectX 10 compliant graphics processors are the latest generation of GPU architecture, which provide increased programming flexibility and support for integer operations. We present high performance modular exponentiation implementations based on integers represented in both standard radix form and residue number system form. We show how a GPU implementation of a 1024bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput. We present how an adaptive approach to modular exponentiation involving implementations based on both a radix and a residue number system gives the best allaround performance on the GPU both in terms of latency and throughput. We also highlight the usage criteria necessary to allow the GPU to reach peak performance on public key cryptographic operations.
Parallel Shortest Lattice Vector Enumeration on Graphics Cards
, 2010
"... In this paper we present an algorithm for parallel exhaustive search for short vectors in lattices. This algorithm can be applied to a wide range of parallel computing systems. To illustrate the algorithm, it was implemented on graphics cards using CUDA, a programming framework for NVIDIA graphics ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
In this paper we present an algorithm for parallel exhaustive search for short vectors in lattices. This algorithm can be applied to a wide range of parallel computing systems. To illustrate the algorithm, it was implemented on graphics cards using CUDA, a programming framework for NVIDIA graphics cards. We gain large speedups compared to previous serial CPU implementations. Our implementation is almost 5 times faster in high lattice dimensions. Exhaustive search is one of the main building blocks for lattice basis reduction in cryptanalysis. Our work results in an advance in practical lattice reduction.
A gpu accelerated storage system
 In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (2010), Proceedings of the International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC’10
"... Massively multicore processors, like, for example, Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any orderofmagnitude drop in the cost per unit of performance for a ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Massively multicore processors, like, for example, Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any orderofmagnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the costtoperformance relation. In this context, we focus on data storage: We explore the feasibility of harnessing the GPUs ’ computational power to improve the performance, reliability, or security of distributed storage systems. In this context we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications’ performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications. Further, this work sheds light on the use of heterogeneous multicore processors for enhancing lowlevel system primitives, and introduces techniques to efficiently leverage the processing power of GPUs.
Acceleration of composite order bilinear pairing on graphics hardware
 Information and Communications Security, volume 7618 of LNCS
, 2012
"... Abstract. Recently, compositeorder bilinear pairing has been shown to be useful in many cryptographic constructions. However, it is timecostly to evaluate. This is because the composite order should be at least 1024bit and, hence, the elliptic curve group order n and base field become too large, r ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. Recently, compositeorder bilinear pairing has been shown to be useful in many cryptographic constructions. However, it is timecostly to evaluate. This is because the composite order should be at least 1024bit and, hence, the elliptic curve group order n and base field become too large, rendering the bilinear pairing algorithm itself too slow to be practical (e.g., the Miller loop is Ω(n)). Thus, compositeorder computation easily becomes the bottleneck of a cryptographic construction, especially, in the case where many pairings need to be evaluated at the same time. The existing solution to this problem that converts compositeorder pairings to primeorder ones is only valid for certain constructions. In this paper, we leverage the huge number of threads available on Graphics Processing Units (GPUs) to speed up compositeorder pairing computation. We investigate suitable SIMD algorithms for base field, extension field, elliptic curve and bilinear pairing computation as well as mapping these algorithms into GPUs with careful considerations. Experimental results show that our method achieves a record of 8.7ms per pairing on a 1024bit security level, which is a 20fold speedup compared to stateoftheart CPU implementation. This result also opens the road to adopting higher security levels and using richresource parallel platforms, which for example are available in cloud computing. In fact, we can achieve more than 24 times speedup on a 2048bit security level and a record of 7 × 10 −6 USD per pairing on the Amazon cloud computing environment. 1
Speed records for NTRU
"... Abstract. In this paper NTRUEncrypt is implemented for the first time on a GPU using the CUDA platform. As is shown, this operation lends itself excellently for parallelization and performs extremely well compared to similar security levels for ECC and RSA giving speedups of around three to four ord ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. In this paper NTRUEncrypt is implemented for the first time on a GPU using the CUDA platform. As is shown, this operation lends itself excellently for parallelization and performs extremely well compared to similar security levels for ECC and RSA giving speedups of around three to four orders of magnitude. The focus is on achieving a high throughput, in this case performing a large number of encryptions/decryptions in parallel. Using a modern GTX280 GPU a throughput of up to 200 000 encryptions per second can be reached at a security level of 256 bits. This gives a theoretical data throughput of 47.8 MB/s. Comparing this to a symmetric cipher (not a very common comparison), this is only around 20 times slower than a recent AES implementations on a GPU. 1
Solving Discrete Logarithms in SmoothOrder Groups with CUDA 1
"... This paper chronicles our experiences using CUDA to implement a parallelized variant of Pollard’s rho algorithm to solve discrete logarithms in groups with cryptographically large moduli but smooth order using commodity GPUs. We first discuss some key design constraints imposed by modern GPU archite ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This paper chronicles our experiences using CUDA to implement a parallelized variant of Pollard’s rho algorithm to solve discrete logarithms in groups with cryptographically large moduli but smooth order using commodity GPUs. We first discuss some key design constraints imposed by modern GPU architectures and the CUDA framework, and then explain how we were able to implement efficient arbitraryprecision modular multiplication within these constraints. Our implementation can execute roughly 51.9 million 768bit modular multiplications per second — or a whopping 840 million 192bit modular multiplications per second — on a single Nvidia Tesla M2050 GPU card, which is a notable improvement over all previous results on comparable hardware. We leverage this fast modular multiplication in our implementation of the parallel rho algorithm, which can solve discrete logarithms modulo a 1536bit RSA number with a 2 55smooth totient in less than two minutes. We conclude the paper by discussing implications to discrete logarithmbased cryptosystems, and by pointing out how efficient implementations of parallel rho (or related algorithms) lead to trapdoor discrete logarithm groups; we also point out two potential cryptographic applications for the latter. Our code is written in C for CUDA and PTX; it is open source and freely available for download online. 1
Efficient Multiplication of Polynomials on Graphics Hardware
"... Abstract. We present the algorithm to multiply univariate polynomials with integer coefficients efficiently using the Number Theoretic transform (NTT) on Graphics Processing Units (GPU). The same approach can be used to multiply large integers encoded as polynomials. Our algorithm exploits fused mul ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. We present the algorithm to multiply univariate polynomials with integer coefficients efficiently using the Number Theoretic transform (NTT) on Graphics Processing Units (GPU). The same approach can be used to multiply large integers encoded as polynomials. Our algorithm exploits fused multiplyadd capabilities of the graphics hardware. NTT multiplications are executed in parallel for a set of distinct primes followed by reconstruction using the Chinese Remainder theorem (CRT) on the GPU. Our benchmarking experiences show the NTT multiplication performance up to 77 GMul/s 1. We compared our approach with CPUbased implementations of polynomial and large integer multiplication provided by NTL and GMP 2 libraries.