Results 1 - 10
of
13
Practical symmetric key cryptography on modern graphics hardware
- In Proc. USENIX ’08
"... Graphics processors are continuing their trend of vastly outperforming CPUs while becoming more general purpose. The latest generation of graphics processors have introduced the ability handle integers natively. This has increased the GPU’s applicability to many fields, especially cryptography. This ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
Graphics processors are continuing their trend of vastly outperforming CPUs while becoming more general purpose. The latest generation of graphics processors have introduced the ability handle integers natively. This has increased the GPU’s applicability to many fields, especially cryptography. This paper presents an application oriented approach to block cipher processing on GPUs. A new block based conventional implementation of AES on an Nvidia G80 is shown with 4-10x speed improvements over CPU implementations and 2-4x speed increase over the previous fastest AES GPU implementation. We outline a general purpose data structure for representing cryptographic client requests which is suitable for execution on a GPU. We explore the issues related to the mapping of this general structure to the GPU. Finally we present the first analysis of the main encryption modes of operation on a GPU, showing the performance and behavioural implications of executing these modes under the outlined general purpose data model. Our AES implementation is used as the underlying block cipher to show the overhead of moving from an optimised hardcoded approach to a generalised one. 1
T.: Exploiting the Power of GPUs for Asymmetric Cryptography
, 2008
"... Abstract. Modern Graphics Processing Units (GPU) have reached a dimension with respect to performance and gate count exceeding conventional Central Processing Units (CPU) by far. Many modern computer systems include – beside a CPU – such a powerful GPU which runs idle most of the time and might be u ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Abstract. Modern Graphics Processing Units (GPU) have reached a dimension with respect to performance and gate count exceeding conventional Central Processing Units (CPU) by far. Many modern computer systems include – beside a CPU – such a powerful GPU which runs idle most of the time and might be used as cheap and instantly available co-processor for general purpose applications. In this contribution, we focus on the efficient realisation of the computationally expensive operations in asymmetric cryptosystems on such off-the-shelf GPUs. More precisely, we present improved and novel implementations employing GPUs as accelerator for RSA and DSA cryptosystems as well as for Elliptic Curve Cryptography (ECC). Using a recent Nvidia 8800GTS graphics card, we are able to compute 813 modular exponentiations per second for RSA or DSA-based systems with 1024 bit integers. Moreover, our design for ECC over the prime field P-224 even achieves the throughput of 1412 point multiplications per second.
Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware
- AFRICACRYPT 2009
, 2009
"... Graphics processing units (GPU) are increasingly being used for general purpose computing. We present implementations of large integer modular exponentiation, the core of public-key cryptosystems such as RSA, on a DirectX 10 compliant GPU. DirectX 10 compliant graphics processors are the latest gene ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Graphics processing units (GPU) are increasingly being used for general purpose computing. We present implementations of large integer modular exponentiation, the core of public-key cryptosystems such as RSA, on a DirectX 10 compliant GPU. DirectX 10 compliant graphics processors are the latest generation of GPU architecture, which provide increased programming flexibility and support for integer operations. We present high performance modular exponentiation implementations based on integers represented in both standard radix form and residue number system form. We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput. We present how an adaptive approach to modular exponentiation involving implementations based on both a radix and a residue number system gives the best all-around performance on the GPU both in terms of latency and throughput. We also highlight the usage criteria necessary to allow the GPU to reach peak performance on public key cryptographic operations.
Speed records for NTRU
"... Abstract. In this paper NTRUEncrypt is implemented for the first time on a GPU using the CUDA platform. As is shown, this operation lends itself excellently for parallelization and performs extremely well compared to similar security levels for ECC and RSA giving speedups of around three to four ord ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. In this paper NTRUEncrypt is implemented for the first time on a GPU using the CUDA platform. As is shown, this operation lends itself excellently for parallelization and performs extremely well compared to similar security levels for ECC and RSA giving speedups of around three to four orders of magnitude. The focus is on achieving a high throughput, in this case performing a large number of encryptions/decryptions in parallel. Using a modern GTX280 GPU a throughput of up to 200 000 encryptions per second can be reached at a security level of 256 bits. This gives a theoretical data throughput of 47.8 MB/s. Comparing this to a symmetric cipher (not a very common comparison), this is only around 20 times slower than a recent AES implementations on a GPU. 1
Parallel Shortest Lattice Vector Enumeration on Graphics Cards
"... In this paper we present an algorithm for parallel exhaustive search for short vectors in lattices. This algorithm can be applied to a wide range of parallel computing systems. To illustrate the algorithm, it was implemented on graphics cards using CUDA, a programming framework for NVIDIA graphics c ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper we present an algorithm for parallel exhaustive search for short vectors in lattices. This algorithm can be applied to a wide range of parallel computing systems. To illustrate the algorithm, it was implemented on graphics cards using CUDA, a programming framework for NVIDIA graphics cards. We gain large speedups compared to previous serial CPU implementations. Our implementation is almost 5 times faster in high lattice dimensions. Exhaustive search is one of the main building blocks for lattice basis reduction in cryptanalysis. Our work results in an advance in practical lattice reduction.
Solving Discrete Logarithms in Smooth-Order Groups with CUDA 1
"... This paper chronicles our experiences using CUDA to implement a parallelized variant of Pollard’s rho algorithm to solve discrete logarithms in groups with cryptographically large moduli but smooth order using commodity GPUs. We first discuss some key design constraints imposed by modern GPU archite ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper chronicles our experiences using CUDA to implement a parallelized variant of Pollard’s rho algorithm to solve discrete logarithms in groups with cryptographically large moduli but smooth order using commodity GPUs. We first discuss some key design constraints imposed by modern GPU architectures and the CUDA framework, and then explain how we were able to implement efficient arbitrary-precision modular multiplication within these constraints. Our implementation can execute roughly 51.9 million 768-bit modular multiplications per second — or a whopping 840 million 192-bit modular multiplications per second — on a single Nvidia Tesla M2050 GPU card, which is a notable improvement over all previous results on comparable hardware. We leverage this fast modular multiplication in our implementation of the parallel rho algorithm, which can solve discrete logarithms modulo a 1536-bit RSA number with a 2 55-smooth totient in less than two minutes. We conclude the paper by discussing implications to discrete logarithm-based cryptosystems, and by pointing out how efficient implementations of parallel rho (or related algorithms) lead to trapdoor discrete logarithm groups; we also point out two potential cryptographic applications for the latter. Our code is written in C for CUDA and PTX; it is open source and freely available for download online. 1
Public Key Cryptography on Modern Graphics Hardware
"... Abstract. Graphics processing units (GPUs) are increasingly being used for general purpose processing. We present implementations of large integer modular exponentiation, the core of public-key cryptosystems such as RSA, on a DirectX 10 compliant GPU. DirectX 10 compliant graphics processors are the ..."
Abstract
- Add to MetaCart
Abstract. Graphics processing units (GPUs) are increasingly being used for general purpose processing. We present implementations of large integer modular exponentiation, the core of public-key cryptosystems such as RSA, on a DirectX 10 compliant GPU. DirectX 10 compliant graphics processors are the latest generation of GPU architecture, which provide increased programming flexibility and support for integer operations. We present high performance modular exponentiation implementations based on integers represented in both standard radix form and residue number system (RNS) form. We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform for the first time a comparable CPU implementation by up to 4 times. We present how an adaptive approach to modular exponentiation involving implementations based on both a radix and a residue number system gives the best all-around performance on the GPU. We also highlight the criteria necessary to allow the GPU to improve the performance of public key cryptographic operations. 1 1
Shortest Lattice Vector Enumeration on Graphics Cards ⋆
"... Abstract. In this paper we make a first feasibility analysis for implementing lattice reduction algorithms on GPU using CUDA, a programming framework for NVIDIA graphics cards. The enumeration phase of the BKZ lattice reduction algorithm is chosen as a good candidate for massive parallelization on G ..."
Abstract
- Add to MetaCart
Abstract. In this paper we make a first feasibility analysis for implementing lattice reduction algorithms on GPU using CUDA, a programming framework for NVIDIA graphics cards. The enumeration phase of the BKZ lattice reduction algorithm is chosen as a good candidate for massive parallelization on GPU. Given the nature of the problem we gain large speedups compared to previous CPU implementations. Our implementation saves more than 50 % of the time in high lattice dimensions. Among other impacts, this result influences the security of lattice based cryptosystems.
GPU Accelerated Cryptography as an OS Service
"... Abstract. Graphics processing units (GPUs) have become popular devices for accelerating general purpose computing. In recent years there has been a surge in research involving the use of GPUs as cryptographic accelerators. Research has shown that contemporary GPU architectures can achieve higher thr ..."
Abstract
- Add to MetaCart
Abstract. Graphics processing units (GPUs) have become popular devices for accelerating general purpose computing. In recent years there has been a surge in research involving the use of GPUs as cryptographic accelerators. Research has shown that contemporary GPU architectures can achieve higher throughput in the context of both symmetric and asymmetric key cryptography than a traditional CPU. Despite the existence of these new approaches, there remains no way for OS kernel services or userspace applications to make use of these implementations in a practical manner. To overcome this shortcoming, this paper investigates the integration of GPU accelerated cryptographic algorithms with an established service virtualisation layer within the Linux kernel, the OCF-Linux framework. This paper demonstrates that it is feasible to use a centralised kernel service to provide a standardised abstraction to GPU accelerated cryptographic functions for both kernelspace and userspace components. 1
High-Speed Single-Database PIR Implementation
"... Abstract. In this HotPETs session we would like to present an implementation of a singledatabase Private Information Retrieval (PIR) scheme that can process a database at 2 Gbits/s using a commodity Graphics Processing Unit (GPU). This session will have three goals: – Dispel the idea that single-dat ..."
Abstract
- Add to MetaCart
Abstract. In this HotPETs session we would like to present an implementation of a singledatabase Private Information Retrieval (PIR) scheme that can process a database at 2 Gbits/s using a commodity Graphics Processing Unit (GPU). This session will have three goals: – Dispel the idea that single-database PIR schemes are unusable because too expensive from a computational point of view – Provide a tool to do fast single-database PIR for higher-level applications and tests – Highlight that "Lattices + GPUs = Huge speedup " compared to number-theory schemes In order to do this we will first give a quick introduction to single-database PIR schemes and highlight the computational issues. Then after a one slide presentation of how GPUs can be used to do general purpose computations, we will present in a very schematic way the scheme implemented and why it is well adapted to GPUs. Finally, we will present a performance comparison over different database sizes with mean and variance values. One or two demos are possible if the organizers agree with them. IMPORTANT NOTE: Our implementation can be donwloaded from

