## T.: Exploiting the Power of GPUs for Asymmetric Cryptography (2008)

Citations: | 25 - 0 self |

### BibTeX

@MISC{Szerwinski08t.:exploiting,

author = {Robert Szerwinski and Tim Güneysu},

title = {T.: Exploiting the Power of GPUs for Asymmetric Cryptography},

year = {2008}

}

### OpenURL

### Abstract

Abstract. Modern Graphics Processing Units (GPU) have reached a dimension with respect to performance and gate count exceeding conventional Central Processing Units (CPU) by far. Many modern computer systems include – beside a CPU – such a powerful GPU which runs idle most of the time and might be used as cheap and instantly available co-processor for general purpose applications. In this contribution, we focus on the efficient realisation of the computationally expensive operations in asymmetric cryptosystems on such off-the-shelf GPUs. More precisely, we present improved and novel implementations employing GPUs as accelerator for RSA and DSA cryptosystems as well as for Elliptic Curve Cryptography (ECC). Using a recent Nvidia 8800GTS graphics card, we are able to compute 813 modular exponentiations per second for RSA or DSA-based systems with 1024 bit integers. Moreover, our design for ECC over the prime field P-224 even achieves the throughput of 1412 point multiplications per second.

### Citations

2898 | A method for obtaining digital signatures and public key cryptosystems
- Rivest, Shamir, et al.
- 1978
(Show Context)
Citation Context ...first publication making use of the CUDA framework for GPGPU processing of asymmetric cryptosystems. We will start with implementing the extremely wide-spread Rivest Shamir Adleman (RSA) cryptosystem =-=[30]-=-. The same implementation based on modular exponentiation for large integers can be used to implement the Digital Signature Algorithm (DSA), which has been published by the US National Institute of St... |

411 |
Modular Multiplication without Trial Division
- Montgomery
- 1985
(Show Context)
Citation Context ...s several multiplication strategies to identify an optimal method for implementation on GPUs. 4.1 Modular Multiplication Using Montgomery’s Technique In 1985 Peter L. Montgomery proposed an algorithm =-=[23]-=- to remove the costly division operation from the modular reduction. Koç et al. [6] give a survey of different implementation options. As all multi-precision Montgomery multiplication algorithms featu... |

371 | A Guide to Elliptic Curve Cryptography
- Hankerson, Menezes, et al.
- 2004
(Show Context)
Citation Context ... the GPU for use with RSA, DSA and similar systems. Second, for ECC-based cryptosystems we present an efficient point multiplication method which is the fundamental operation, e.g., for ECDSA or ECDH =-=[16]-=-. 5.1 Modular Exponentiation Using the CIOS Method We implemented the CIOS Method as introduced in Algorithm 1 for sequential execution since it does not include any inherent parallelism. Fan et al. d... |

107 |
Handbook of elliptic and hyperelliptic curve cryptography. Discrete mathematics and its applications
- Cohen, Frey, et al.
- 2006
(Show Context)
Citation Context ...ard (B) ◦ ◦ ◦ ◦92 R. Szerwinski and T. Güneysu 5.3 Point Multiplication Using Generalised Mersenne Primes For realising the elliptic curve group operation, we chose mixed affine-Jacobian coordinates =-=[8]-=- to avoid costly inversions in the underlying field and thus concentrated on efficient implementation of modular multiplication, the remaining time critical operation. For this, we used a straightforw... |

31 |
The Hessian form of an elliptic curve
- Smart
- 2001
(Show Context)
Citation Context ...rithmetic with Kawamura’s base extension mechanism. 6.3 Further Work Elliptic curves in Hessian form feature highly homogeneous formulae to compute all three projective coordinates in point additions =-=[19,34]-=-. However, the curves standardised by ANSI and NIST cannot be transformed to Hessian form. Furthermore, point doublings can be converted to point additions by simple coordinate rotations. Thus, it is ... |

25 |
Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography
- CUDA
- 2007
(Show Context)
Citation Context ...rate cryptographic algorithms using the GPU. For example, various authors looked at the feasibility of the current industry standard for symmetric cryptography, the Advanced Encryption Standard (AES) =-=[21,31,18,9]-=-. Only two groups, namely Moss et al. and Fleissner, have aimed for the efficient implementation of modular exponentiation on the GPU [24,14]. Their results were not promising, as they were limited by... |

20 |
The mpFq library and implementing curve-based key exchanges
- Gaudry, Thomé
- 2007
(Show Context)
Citation Context ...RNS arithmetic 413.9 [36] Suzuki Xilinx xc4fx12 FPGA, using DSPs 584.8 79.4 [26] Nozaki 0.25μm CMOS, 80 MHz, 221k GE 238.1 34.2 [11] eBATS Intel Core2 2.13 GHz 1447.5 300.4 2623.4 a 1868.5 a 1494.8 a =-=[15]-=- Gaudry Intel Core2 2.66 GHz 6900 b a Performance for ECDSA operation including additional modular inversion and multiplication operation. b Special elliptic curve in Montgomery form, non-compliant to... |

19 | Toward acceleration of rsa using 3d graphics hardware. volume 4887
- Moss, Page, et al.
- 2007
(Show Context)
Citation Context ... cryptography, the Advanced Encryption Standard (AES) [21,31,18,9]. Only two groups, namely Moss et al. and Fleissner, have aimed for the efficient implementation of modular exponentiation on the GPU =-=[24,14]-=-. Their results were not promising, as they were limited by the legacy GPU architecture and interface (cf. the next section). To the best of our knowledge there are neither publications about the impl... |

15 | Cox-Rower architecture for fast parallel montgomery multiplication
- Kawamura, Koike, et al.
- 2000
(Show Context)
Citation Context ...three different options into account: the method based on a Mixed Radix System (MRS) according to Szabó and Tanaka [37], as well as CRT-based methods due to Shenoy and Kumaresan [33], Kawamura et al. =-=[20]-=- and Bajard et al. [3]. We present a brief introduction of these methods, but for more detailed information about base extensions, please see the recent survey at [5]. 2 Inner-RNS operations still con... |

15 |
R.: Fast base extension using a redundant modulus in RNS
- Shenoy, Kumaresan
- 1989
(Show Context)
Citation Context ...m, is needed. We take three different options into account: the method based on a Mixed Radix System (MRS) according to Szabó and Tanaka [37], as well as CRT-based methods due to Shenoy and Kumaresan =-=[33]-=-, Kawamura et al. [20] and Bajard et al. [3]. We present a brief introduction of these methods, but for more detailed information about base extensions, please see the recent survey at [5]. 2 Inner-RN... |

12 | Gpu-accelerated montgomery exponentiation
- Fleissner
- 2007
(Show Context)
Citation Context ... cryptography, the Advanced Encryption Standard (AES) [21,31,18,9]. Only two groups, namely Moss et al. and Fleissner, have aimed for the efficient implementation of modular exponentiation on the GPU =-=[24,14]-=-. Their results were not promising, as they were limited by the legacy GPU architecture and interface (cf. the next section). To the best of our knowledge there are neither publications about the impl... |

11 |
How to Maximize the Potential of FPGA Resources for Modular Exponentiation
- Suzuki
- 2007
(Show Context)
Citation Context ...thmetic 175.4 [10] Costigan Sony Playstation 3, 1 PPU, 6 SPUs 909.2 401.4 [22] Mentens Xilinx xc2vp30 FPGA 471.7 1724.1 235.8 1000.0 440.5 [32] Schinianakis Xilinx xc2vp125 FPGA, RNS arithmetic 413.9 =-=[36]-=- Suzuki Xilinx xc4fx12 FPGA, using DSPs 584.8 79.4 [26] Nozaki 0.25μm CMOS, 80 MHz, 221k GE 238.1 34.2 [11] eBATS Intel Core2 2.13 GHz 1447.5 300.4 2623.4 a 1868.5 a 1494.8 a [15] Gaudry Intel Core2 2... |

10 | AES Encryption Implementation and Analysis on Commodity Graphics Processing Units
- Harrison, Waldron
- 2007
(Show Context)
Citation Context ...rate cryptographic algorithms using the GPU. For example, various authors looked at the feasibility of the current industry standard for symmetric cryptography, the Advanced Encryption Standard (AES) =-=[21,31,18,9]-=-. Only two groups, namely Moss et al. and Fleissner, have aimed for the efficient implementation of modular exponentiation on the GPU [24,14]. Their results were not promising, as they were limited by... |

9 |
Accelerating SSL Using the Vector Processors in IBM’s Cell Broadband Engine for Sony’s PlayStation 3. Cryptology ePrint Archive
- Costigan, Scott
(Show Context)
Citation Context ...nce solely to the CPU of his host system. Costigan and Scott implemented modular exponentiation on IBM’s Cell platform, i.e., a Sony Playstation 3 and an IBM MPM blade server, both running at 3.2 GHz =-=[10]-=-. We only quote the best figures for the Playstation 3 as they call the results for the MPM blade preliminary. The Playstation features one PowerPC core (PPU) and 6 Synergistic Processing Elements (SP... |

7 | P.: Modular Multiplication and Base Extensions in Residue Number Systems
- Bajard, Didier, et al.
- 2001
(Show Context)
Citation Context ... into account: the method based on a Mixed Radix System (MRS) according to Szabó and Tanaka [37], as well as CRT-based methods due to Shenoy and Kumaresan [33], Kawamura et al. [20] and Bajard et al. =-=[3]-=-. We present a brief introduction of these methods, but for more detailed information about base extensions, please see the recent survey at [5]. 2 Inner-RNS operations still contain carries.86 R. Sz... |

7 |
Implementation of RSA algorithm based on RNS montgomery multiplication
- Nozaki, Motoyama, et al.
- 2001
(Show Context)
Citation Context ...6 SPUs 909.2 401.4 [22] Mentens Xilinx xc2vp30 FPGA 471.7 1724.1 235.8 1000.0 440.5 [32] Schinianakis Xilinx xc2vp125 FPGA, RNS arithmetic 413.9 [36] Suzuki Xilinx xc4fx12 FPGA, using DSPs 584.8 79.4 =-=[26]-=- Nozaki 0.25μm CMOS, 80 MHz, 221k GE 238.1 34.2 [11] eBATS Intel Core2 2.13 GHz 1447.5 300.4 2623.4 a 1868.5 a 1494.8 a [15] Gaudry Intel Core2 2.66 GHz 6900 b a Performance for ECDSA operation includ... |

5 |
Efficient RNS bases for cryptography
- Bajard, Meloni, et al.
- 2005
(Show Context)
Citation Context ...ing pre-computed constants ˜c (k,i) = ∣ ∏i−1 l=0 ml ∣ .But ˜mk instead of creating a table for all ˜ck, a recursive approach is more efficient in our situation, eliminating the need for table-lookups =-=[4]-=-, and allowing to compute all residues in the target base in parallel: ∣ (5) |x| ˜mk =∣∣ ′ (...((x n−1mn−2+x ′ n−2 )mn−3+x ′ n−3 )mn−4+ ···+ x ′ 1 )m0 + x0∣ ˜mk 4.4 Base Extension Using the Chinese Re... |

4 |
A new approach to elliptic curve cryptography: An RNS architecture
- Schinianakis, Kakarountas, et al.
- 2006
(Show Context)
Citation Context ...PU, ECC NIST-224 1412.6 [24] Moss Nvidia 7800GTX GPU, RNS arithmetic 175.4 [10] Costigan Sony Playstation 3, 1 PPU, 6 SPUs 909.2 401.4 [22] Mentens Xilinx xc2vp30 FPGA 471.7 1724.1 235.8 1000.0 440.5 =-=[32]-=- Schinianakis Xilinx xc2vp125 FPGA, RNS arithmetic 413.9 [36] Suzuki Xilinx xc4fx12 FPGA, using DSPs 584.8 79.4 [26] Nozaki 0.25μm CMOS, 80 MHz, 221k GE 238.1 34.2 [11] eBATS Intel Core2 2.13 GHz 1447... |

4 |
R.I.: Residue Arithmetic and its Applications to Computer Technology
- Szabo, Tanaka
- 1967
(Show Context)
Citation Context ...method to convert between both bases, a base extension mechanism, is needed. We take three different options into account: the method based on a Mixed Radix System (MRS) according to Szabó and Tanaka =-=[37]-=-, as well as CRT-based methods due to Shenoy and Kumaresan [33], Kawamura et al. [20] and Bajard et al. [3]. We present a brief introduction of these methods, but for more detailed information about b... |

3 |
eBATS: ECRYPT Benchmarking of Asymmetric Systems
- ECRYPT
- 2007
(Show Context)
Citation Context ...he results for the MPM blade preliminary. The Playstation features one PowerPC core (PPU) and 6 Synergistic Processing Elements (SPUs). Software results have been attained from ECRYPT’s eBATS project =-=[11]-=-. Here, we picked a recent Intel Core2 Duo with 2.13 GHz clock frequency. Since mostly all figures for software relate to cycles, we assumed that repeated computations can be96 R. Szerwinski and T. G... |

3 | Using Graphic Processing Unit in Block Cipher Calculations
- Rosenberg
- 2007
(Show Context)
Citation Context ...rate cryptographic algorithms using the GPU. For example, various authors looked at the feasibility of the current industry standard for symmetric cryptography, the Advanced Encryption Standard (AES) =-=[21,31,18,9]-=-. Only two groups, namely Moss et al. and Fleissner, have aimed for the efficient implementation of modular exponentiation on the GPU [24,14]. Their results were not promising, as they were limited by... |

2 |
Kaliski Jr, B.S.: Analyzing and comparing montgomery multiplication algorithms
- Koc, Acar
- 1996
(Show Context)
Citation Context ... on GPUs. 4.1 Modular Multiplication Using Montgomery’s Technique In 1985 Peter L. Montgomery proposed an algorithm [23] to remove the costly division operation from the modular reduction. Koç et al. =-=[6]-=- give a survey of different implementation options. As all multi-precision Montgomery multiplication algorithms feature no inherent parallelism except the possibility to pipeline, we do not consider t... |

2 | modular multiplication algorithm on multi-core systems
- Fan, Sakiyama, et al.
- 2007
(Show Context)
Citation Context ...troduced in Algorithm 1 for sequential execution since it does not include any inherent parallelism. Fan et al. describe efficient ways to pipeline such an algorithm for the use on multi-core systems =-=[13]-=-.Exploiting the Power of GPUs for Asymmetric Cryptography 89 This would however need fairly complex coordination and memory techniques and thus will not be considered further for our implementation, ... |

2 |
Secure and efficient coprocessor design for cryptographic applications on FPGAs
- Mentens
- 2007
(Show Context)
Citation Context ....3 Nvidia 8800GTS GPU, RNS arithmetic 439.8 57.9 Nvidia 8800GTS GPU, ECC NIST-224 1412.6 [24] Moss Nvidia 7800GTX GPU, RNS arithmetic 175.4 [10] Costigan Sony Playstation 3, 1 PPU, 6 SPUs 909.2 401.4 =-=[22]-=- Mentens Xilinx xc2vp30 FPGA 471.7 1724.1 235.8 1000.0 440.5 [32] Schinianakis Xilinx xc2vp125 FPGA, RNS arithmetic 413.9 [36] Suzuki Xilinx xc4fx12 FPGA, using DSPs 584.8 79.4 [26] Nozaki 0.25μm CMOS... |

2 |
Nvidia Compute Unified Device Architecture (CUDA). http://www.nvidia.com/object/ cuda_home.html
- CORPORATION
- 2009
(Show Context)
Citation Context ...ake advantage of the presented hierarchical memory model. In the following, we enumerate the key criteria necessary for gaining the most out of the GPU by loosely following the CUDA programming guide =-=[27]-=- and a talk given by Mark Harris of Nvidia [17]. A. Maximise use of available processing power A1. Maximise independent parallelism in the algorithm to enable easy partitioning in threads and blocks. ... |

1 | T.: RNS bases and conversions
- Bajard, Plantard
- 2004
(Show Context)
Citation Context ... Kumaresan [33], Kawamura et al. [20] and Bajard et al. [3]. We present a brief introduction of these methods, but for more detailed information about base extensions, please see the recent survey at =-=[5]-=-. 2 Inner-RNS operations still contain carries.86 R. Szerwinski and T. Güneysu Algorithm 2. Modular Multiplication Algorithm for Residue Number Systems [20] Require: Modulus M, twoRNSbasesAand B comp... |

1 |
Optimizing CUDA. In: Supercomputing 2007 Tutorial
- Harris
(Show Context)
Citation Context ...ory model. In the following, we enumerate the key criteria necessary for gaining the most out of the GPU by loosely following the CUDA programming guide [27] and a talk given by Mark Harris of Nvidia =-=[17]-=-. A. Maximise use of available processing power A1. Maximise independent parallelism in the algorithm to enable easy partitioning in threads and blocks. A2. Keep resource usage low to allow concurrent... |

1 | E.: Faster group operations on special elliptic curves. Cryptology ePrint Archive
- Hisil, Carter, et al.
(Show Context)
Citation Context ...rithmetic with Kawamura’s base extension mechanism. 6.3 Further Work Elliptic curves in Hessian form feature highly homogeneous formulae to compute all three projective coordinates in point additions =-=[19,34]-=-. However, the curves standardised by ANSI and NIST cannot be transformed to Hessian form. Furthermore, point doublings can be converted to point additions by simple coordinate rotations. Thus, it is ... |

1 |
seccure – SECCURE elliptic curve crypto utility for reliable encryption, version 0.3
- Poettering
- 2006
(Show Context)
Citation Context ...porary values, nailed to 28 bits to allow schoolbook multiplication without carry propagation. Thus, we need 8 words per coordinate. Point addition and doubling algorithms were inspired by libseccure =-=[29]-=-. With this approach shared memory turns out to be the limiting factor. Precisely, we require 111 words per point multiplication to store 7 temporary coordinates for point addition and modulo arithmet... |

1 |
Theory and Practice, 3rd edn
- Stinson
- 2005
(Show Context)
Citation Context ... message to values that are eligible for global memory coalescing (cf. Criteria B1 and B4). For modular exponentiation based on Algorithm 1, we applied the straightforward binary right-to-left method =-=[35]-=-. During exponentiation, each thread needs three temporary values of (n+2) words each that get used as input and output of Algorithm 1 in a round-robin fashion by pointer arithmetic. Thus, 3(n+2) word... |