#### DMCA

## Efficient Arithmetic on ARM-NEON and Its Application for High-Speed RSA Implementation

### Citations

214 |
Ofman,“Multiplication of multidigit numbers on automata,”
- Karatsuba, Y
- 1963
(Show Context)
Citation Context ...sults available in the non-redundant representation over ARM-NEON. 3 Karatsuba's Multiplication One of the multiplication techniques with sub-quadratic complexity is called Karatsuba's multiplication =-=[14]-=-. Karatsuba's method reduces a multiplication of two n-word operands to three multiplications, which have a length of n2 words. These three half-size multiplications can be performed with any multipli... |

189 | Comparing Elliptic Curve Cryptography and RSA on 8-bit CPUs
- Gura, Patel, et al.
- 2004
(Show Context)
Citation Context ... words. These three half-size multiplications can be performed with any multiplication techniques (e.g. operand-scanning method, product-scanning method, hybridscanning method, operand-caching method =-=[21, 6, 10, 12, 25, 26]-=-). The Karatsuba method can also be scheduled in a recursive way and its asymptotic complexity is (nlog23). There are two typical ways to describe Karatsuba's multiplication such as additive Karatsub... |

33 | Batch binary Edwards.
- Bernstein
- 2009
(Show Context)
Citation Context ...uba's method: AH BH 2n+ [AH BH +AL BL jAH ALj jBH BLj] 2n2 +AL BL (2) Karatsuba's method turns one multiplication of size n into three multiplications and eight additions of size n2 . In =-=[1]-=-, a variant of Karatsuba's method named rened Karatsuba's method was introduced, which saves one addition operation with a length of n2 . Recently, Hutter and Schwabe achieved the speed records on AV... |

32 |
Exponentiation cryptosystems on
- Comba
- 1990
(Show Context)
Citation Context ...rm). Secondly, multiplicationA[0] with re-organized operands ((A[0]; A[4]); (A[2]; A=-=[6]-=-); (A[1]; A[5]); (A[3]; A[7])) is computed, generating the partial product pairs including (C[0]; C[4]); (C[2]; C[6]); (C[1]; C[5]); (C[3]; C[7]) where the results are located from 0 to 264 233 + 1, namely 0xffff fffe 0000 0001. Third, partial products are divided into higher bits (64 33) and lower bits (32 1... |

21 | NEON crypto.
- Bernstein, Schwabe
- 2012
(Show Context)
Citation Context ...SIMD in redundant representations In case of ARM-NEON architecture, the squaring is only considered over the redundant representation for small integers (below 500-bit) of specic ECC implementations =-=[3, 2]-=-. Over the redundant representation, the squaring method is easily established with doubling the operands or intermediate results because the redundant representation can store carry bits into spare c... |

18 | Energy-efficient software implementation of long integer modular arithmetic. - Großschädl, Avanzi, et al. - 2005 |

6 | E.: Fast multi-precision multiplication for public-key cryptography on embedded microprocessors - Hutter, Wenger |

5 |
GMP: The GNU Multiple Precision Arithmetic Library. Available for download at http://www.gmplib.org
- Foundation, Inc
- 2015
(Show Context)
Citation Context ...ly, we re-organized operands by conducting transpose operation, which can efficiently shuffle inner vector by 32-bit wise. Instead of a normal order ((A[0]; A[1]); (A[2]; A[3]); (A[4]; A[5]); (A[6]; A=-=[7]-=-)), we classify the operand as groups ((A[0]; A[4]); (A[2]; A[6]); (A[1]; A[5]); (A[3]; A[7])), for computing two 32-bit wise multiplications where each operand ranges from 0 to 2321 (i.e. 0xffff fff... |

4 |
Montgomery multiplication on the Cell. In
- Bos, Kaihara
- 2010
(Show Context)
Citation Context ...t operation by 1-bit. Since the operation may output 1-bit carry (257th bit), we stored doubled operands into 9 32-bit registers (ACARRY ; ADBL[8 15]). Secondly, multiplication ADBL[8] with (A[0]; A=-=[4]-=-); (A[2]; A[6]); (A[1]; A[5]); (A[3]; A[7]) is computed, generating the partial product pairs including (C[8]; C[12]); (C[9]; C[13]); (C[10]; C[14]); (C[11]; C[15]). Third, partial products are separa... |

4 |
Software implementation of modular exponentiation, using advanced vector instructions architectures.
- Gueron, Krasnov
- 2012
(Show Context)
Citation Context ...emaining capacity of a register can avoid carry propagations. In [4], vector instructions on the CELL microprocessor are used to perform multiplication on operands represented with a radix of 216. In =-=[9]-=-, RSA implementations for the Intel-AVX platform uses 256-bit wide vector instructions and the reduced-radix representation for faster accumulation of partial products. At CHES 2012, Bernstein and Sch... |

4 |
Using streaming SIMD extensions (SSE2) to perform big multiplications. Whitepaper AP-941,
- Corporation
- 2000
(Show Context)
Citation Context ...ctions, traditional cryptography software needs to be rewritten into a vectorized format. The most well known approach is a reduced-radix representation for a better handling of the carry propagation =-=[13]-=-. The redundant representation reduces the number of active bits per register. Keeping thesnal result within remaining capacity of a register can avoid carry propagations. In [4], vector instructions ... |

3 | Curve41417: Karatsuba revisited
- Bernstein, Chuengsatiansup, et al.
- 2014
(Show Context)
Citation Context ...ion by 1-bit. Since the operation may output 1-bit carry (257th bit), we stored doubled operands into 9 32-bit registers (ACARRY ; ADBL[8 15]). Secondly, multiplication ADBL[8] with (A[0]; A[4]); (A=-=[2]-=-; A[6]); (A[1]; A[5]); (A[3]; A[7]) is computed, generating the partial product pairs including (C[8]; C[12]); (C[9]; C[13]); (C[10]; C[14]); (C[11]; C[15]). Third, partial products are separated into... |

3 | Montgomery multiplication using vector instructions
- Bos, Montgomery, et al.
- 2014
(Show Context)
Citation Context ...C'13, Bos et al.sipped the sign of the precomputed Montgomery constant and accumulate the result in two separate intermediate values that are computed concurrently in the non-redundant representation =-=[5]-=-. However, the performance of their implementation suffers from Read-After-Write (RAW) dependencies in the instructionsow. Such dependencies cause pipeline stalls since the instruction to be executed ... |

2 |
Reverse product-scanning multiplication and squaring on 8-bit AVR processors.
- Liu, Seo, et al.
- 2015
(Show Context)
Citation Context ...d in Table 1 and 2. 2. Fast Constant-time Karatsuba multiplication/squaring for ARM-NEON processors. Inspired by subtractive Karatsuba multiplication [11] and constanttime Karatsuba algorithms on AVR =-=[17]-=-, we proposed constant-time Karatsuba multiplication and squaring on ARM-NEON, which integrate the additive/subtractive Karatsuba algorithms and COS/DOS operations. These carefully chosen methods allo... |

1 | Multiprecision multiplication on avr revisited
- Hutter, Schwabe
- 2014
(Show Context)
Citation Context ...formance comparison with related works can be found in Table 1 and 2. 2. Fast Constant-time Karatsuba multiplication/squaring for ARM-NEON processors. Inspired by subtractive Karatsuba multiplication =-=[11]-=- and constanttime Karatsuba algorithms on AVR [17], we proposed constant-time Karatsuba multiplication and squaring on ARM-NEON, which integrate the additive/subtractive Karatsuba algorithms and COS/D... |

1 |
Improved multi-precision squaring for low-end RISC microcontrollers
- Lee, Kim, et al.
(Show Context)
Citation Context ...thods over both SISD and SIMD architectures. Squaring on SISD There are several optimal squaring methods developed by introducing the efficient order of partial products. Lazy-Doubling (LD) method by =-=[15]-=- delays the doubling process to the end of each inner partial product and then double it at once. The method reduces the number of arithmetic operations by conducting doubling computations on accumula... |

1 | New speed records for montgomery modular multiplication on 8-bit avr microcontrollers
- Liu, Groschadl
- 2014
(Show Context)
Citation Context ...ral registers or memory storages. This process is iterated by 7 times more to complete the second inner loop for partial products (A[0 7] ADBL[8 15]). The intermediate results are retained in (C=-=[16]-=-; C[20]); (C[17]; C[21]); (C[18]; C[22]); (C[19]; C[23]) placed within 2332 (i.e. 0x1 ffff fffe). In third inner loop, wesrstly conduct the carry handling by masking the operands with the carry bit (... |