0% found this document useful (0 votes)
8 views4 pages

FPGA Multiplier Accumulator for SIDH

This document discusses hardware implementations of the arithmetic operations needed for the SIDH key exchange protocol on FPGA platforms. It surveys three previous FPGA implementations of SIDH that optimized addition and multiplication units separately. The goal of the new implementation discussed is to create a single hardware unit that can efficiently support both addition and multiplication to better reuse resources. This approach aims to reduce the costs of the arithmetic unit compared to previous works that used separate units for each operation.

Uploaded by

Mircea Petrescu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

FPGA Multiplier Accumulator for SIDH

This document discusses hardware implementations of the arithmetic operations needed for the SIDH key exchange protocol on FPGA platforms. It surveys three previous FPGA implementations of SIDH that optimized addition and multiplication units separately. The goal of the new implementation discussed is to create a single hardware unit that can efficiently support both addition and multiplication to better reuse resources. This approach aims to reduce the costs of the arithmetic unit compared to previous works that used separate units for each operation.

Uploaded by

Mircea Petrescu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Hardware multiplier accumulator for SIDH protocol

Pedro Maat C. Massolino and Lejla Batina


Radboud University
Nijmegen, the Netherlands
Email: [Link]@[Link]
Email: lejla@[Link]

Abstract—Supersingular Isogeny Diffie Hellman is a key ex- This short paper is split into Section 2 explaining about
change protocol candidate for post quantum cryptography. It the characteristics on SIDH finite field arithmetic. Then, in
is a recent cryptosystem with keys smaller than the other post Section 3 will highlight some previous works. Finally, in
quantum cryptosystems, but high latency. In case of FPGA Section 4 our work in progress is redacted.
platforms, there have been only three full implementations in
the literature and one focused on the finite field arithmetic. This
paper survey those FPGA’s implementations and set goals for
a new implementation. In this new implementation the core
2. SIDH arithmetic characteristics
idea is to explore different techniques in order to improve
state-of-the-art performance results.
The first elliptic curve isogeny based cryptosystem with
1. Introduction ordinary elliptic curves was proposed by Rostovtsev and
Stolbunov [5], but later a sub-exponential quantum attack
was found by Andrew et al. [6]. However, the attack
Quantum computers might become standard in the near does not work in case of super singular elliptic curves,
future, and cryptographic algorithms have to withstand their which were later proposed by Jao and De Feo [7] for
capabilities. Even though they are only experienced today, key exchange, called Supersingular Isogeny Diffie Hellman
today’s sensitive and/or confidential information could be (SIDH). Later, several propositions were made to accelerate
stored to later be exploited. Therefore, the standardization the cryptosystems, focused on the prime, the respective
body NIST asked the community to submit cryptographic reduction algorithm and the curves coordinate system as
proposals for public key encryption, key encapsulation and seen by Costello et al. [8]. Finally, the same authors grouped
signatures taken into account quantum adversaries [1]. Even together with some others and created a proposal called
though such adversaries will be able to exploit the future SIKE, which incorporates some discoveries made along the
technologies, the algorithms are expected to execute on way [9], [10]. However, because the cryptosystem is recent
today’s technology. when compared with elliptic curve discrete log or integer
Between all proposals, an interesting one is based on factorization, most likely new contributions will then appear.
Supersingular Isogeny Diffie Hellman (SIDH), and is called
SIKE [2]. This key encapsulation method relies on elliptic The arithmetic of SIDH has changed very little between
curve arithmetic, which can be reused during a transition Costello et al. [8] and the current SIKE proposal [2]. They
between old devices with Elliptic Curve Diffie Hellman are being based on the finite field Fp2 with primes p =
(ECDH) to SIDH. It can also be applied on those called “Hy- 2a 3b − 1 and extension field polynomial x2 + 1. However,
brid modes” where both algorithms are applied together [1]. in the work of Costello et al. [8] some computations were
There exists performance results for SIDH and SIKE done in the base field Fp , while all SIKE operations are
protocols on FPGA and ASIC platforms [2], [3], [4]. In defined in Fp2 . This should be taken into account when
those works, the authors created different optimized units designing hardware architectures or even software libraries,
for addition and multiplication in the underlying finite field. even though adding support for Fp should not be difficult.
As a consequence they can be optimized in the datapath Primes of the shape p = k · 2αw − 1, where α, k ≥ 1
for only one operation. On the other hand, the resources on and w is the target multiplication word size, can be called
each unit could be reused into one single arithmetic unit. Montgomery friendly [13], [14], [15]. Of course, in case
The idea is to approach the problem with a different of w = 1 all primes, except 2, are Montgomery friendly,
strategy. Instead of splitting the addition and multiplication but the term is usually applied in the case of software
units, we propose to create one unit that can easily support implementations where w = 32 or 64. When applying
both operations. In this way resources can be more focused such primes it is possible to jump over Step 3 during the
on one unit, therefore reducing the costs of such unit. multiplication in Figure 1 algorithm. This happens because
Require: z, y ≤ 2p, r = 2wf , p0 = −p−1 (mod 2w ), reduction algorithm. However, because the applied primes
wf > (dlog2 (p)e + 2) do not have the proposed shape then it is not possible to
z = (z[0], . . . , z[f − 1]), y = (y[0], . . . , y[f − 1]) apply the same algorithm or architecture. It is unclear how
Ensure: o = z · y/r (mod p), o = (o[0], . . . , o[f − 1]) to adapt the architecture for the Montgomery multiplication
1: for i ← 0 to f − 1 by 1 do algorithm.
2: h ← z[i] · y Koziel et al. [4] presented the first FPGA implementation
3: q ← p0 · h (mod 2w ) of the SIDH protocol. In a high level view, the entire design
4: o ← (h + q · p)/2w is one processor, with ALU, control unit and it is own
5: end for program. Inside the ALU, the design is more complex,
6: return o because it has multiple modular multiplication units, one
modular addition/subtraction unit and one special inversion
Figure 1. Montgomery multiplication algorithm [11], without final subtrac- unit. The inversion unit was required because the applied
tion [12].
algorithm needed several inversions.
The modular multiplication unit is based on McIvor et
the value of p0 in those cases is 1, since: al. [19], which is a systolic array unit. The systolic array op-
erates with several processing elements, where each element
p = k · 2αw − 1 (mod 2w ) = −1
compute one Montgomery multiplication and reduction. The
p0 = −p−1 (mod 2w ) processing elements transfer the computation between each
p0 = −(−1)−1 (mod 2w ) other, until the last stage, where it goes back to the first
p0 = 1 (mod 2w ) element. Therefore in the final cycles, it is possible to start
the next multiplication while the current is being finalized.
Given p0 = 1, there is another optimization for Montgomery The modular addition unit of Koziel et al. [4] is the
multiplication proposed by Orup [16]. The optimization same one as proposed in Rogawski et al. [20]. It is specially
replaces p by p + 1 in Step 4 of Figure 1 algorithm and as tailored for the FPGA’s internal logic, and optimized for low
consequence Figure 2 algorithm is obtained. In case of mul- latency at the expense of more LUT’s than the manufacturer
tiple words algorithms, as studied by Koç et al. [17], some automatic generated one. It works as a carry-lookahead
multiplications can be avoided because (p + 1) = k · 2αw adder, therefore it works by first solving the carries for
therefore the first αw bits are 0. This final optimization has each bit addition. The carries are solved with “generate”(the
been applied in the SIDH proposal by Costello et al. [8]. logical AND of the addition inputs) and “propagate”(the
logical XOR of the addition inputs) signals, which tells
Require: z, y ≤ 2p, r = 2wf , p0 = −p−1 (mod 2w ), if a certain input will generate a carry, propagate or kill
wf > (dlog2 (p)e + 2) in case of none. Later, the carries are added together with
z = (z[0], . . . , z[f − 1]), y = (y[0], . . . , y[f − 1]) the input bits, thus producing the final addition. With this
Ensure: o = z · y/r (mod p), o = (o[0], . . . , o[f − 1]) optimized adder in hands, it is possible to design a modular
1: for i ← 0 to f − 1 by 1 do addition/subtraction unit through two of these adders and
2: h ← z[i] · y some more resources.
3: o ← (h · (p + 1))/2w The first SIDH design was updated in Koziel et al. [3],
4: end for where the special inversion unit was removed because the
5: return o new algorithm did not needed so many inversions as before.
The most important parts on the design, the adder and the
Figure 2. Montgomery multiplication algorithm [11] with quotient simpli-
fication [16], without final subtraction [12].
modular multiplier were kept the same. Also, results for 2
different primes were added, but the hardware architecture
In case of the most recent proposal SIKE, the primes is proportional with the prime, therefore is necessary to
to be used are 2250 3159 − 1, 2372 3239 − 1, 2486 3301 − 1. reprogram the FPGA according to the chosen parameter.
Therefore, all SIKE primes are Montgomery friendly since In case of both results of Koziel et al. [4] and [3],
a is greater than 64. the target FPGA is a Virtex 7 XC7VX690T which has
108300 Slices, 3600 DSP’s and 1470 BRAM’s. The FPGA
3. Previous works itself has enough space to fit the SIDH implementation,
which makes a good candidate for a first implementation.
Karmakar et al. [18] investigated the SIDH field arith- However, given it is a expensive model [21] than the Spartan
metic. Unfortunately, they proposed it for a different kind 6 family, it would be more interesting to have results on a
of primes p = 2a 3b − 1, where b has to be even, while cheaper FPGA’s, therefore making it more accessible for
in the SIKE proposal it is odd. The authors exploited the other researchers.
prime characteristic to craft an optimized modular reduction
algorithm based on Barret reduction. That option, when 4. Proposal
compared with Montgomery reduction algorithm in soft-
ware, is shown to be better. Also, they proposed a hard- Unlike the work of Koziel et al. [3] where the basic
ware modular multiplication unit based on the new modular units are optimized separately, our idea is to create one
for the 34 bits multiplier was done with the schoolbook
algorithm instead. This was opted to increase the circuit
operating frequency, at the cost of 4 DSP’s instead of 3,
and be able to fit all 8 pipeline stages. Another choice was
to give not only support for unsigned integers, but signed as
well. Signed integers are easier to handle modular additions
and subtractions, since it allows the internal values be on
the negative range.
The optimized adders in Carmela are from Nguyen et
Figure 3. Karatsuba algorithm dataflow graph. (|| means concatenation)
al. [25], more specifically the Add-Add-Multiplex (AAM)
carry select architecture. This architecture is easy to un-
derstand, has a very low latency, and similar results to
multiplier accumulator unit that can support this and other Rogawski et al. [20] architecture [26].
operations. The multiplier accumulator can easily reuse the Carmela in Figure 4 works as a multiplier accumula-
accumulator logic for the modular addition and subtraction, tor and adder/subtractor. During the multiplier accumula-
and be easily programmed to support multiple words multi- tor mode the second stage compressor is not used, and
plication. On the other hand, one disadvantage is that some instead has 0’s as inputs. Then the multiplication happens
direct arithmetic optimizations in modular multiplication or in a Karatsuba algorithm, before the final stage, where the
addition, might not be possible to be done directly in the multiplication output can be added or not to the accumulator
unit. Anyway, since there is already a work with different value. By not adding, the multiplication output can then be
units, we explore this one. registered in “Y” and the accumulator is kept during the
Another goal is to aim for a cheaper family of FPGA’s, entire operation.
such as the Spartan 6. The Spartan 6 family is seen in Carmela in addition/subtraction mode operates with 4
different development boards, and one of them is the Sakura- stages, instead of 8 stages, thus reducing the operation
G [22] which has the XC6SLX75. Unlike the XC7VX690T, latency. In this setting, the compressor in the second stage is
the XC6SLX75 model only has 11662 slices, 132 DSP’s performs an incomplete addition/subtraction of the registers
and 172 BRAM’s, which can be challenging when doing “A” and “B” with the accumulator. After the compression,
the architecture design. The Sakura-G board is made for the values are then transferred to two registers stages, before
side channel attacks, therefore opening future side channel being finalized on the final adder that is also used during
evaluation of SIDH [23]. the multiplication accumulation.
To provide support for primes from at least 503 primes, There is still missing several parts for the SIDH protocol
it is necessary a multiplier of the same size or bigger. or even SIKE itself. Carmela provides only the multiplier
One option is to split the input into small words that can accumulator for 272 bits. To get the required field arithmetic
be multiplied and then summed all together. In case of in Fp2 is still necessary to call Carmela several times itself.
the XC6SLX75 model, the DSP’s has a 17 bits unsigned After giving support to high level arithmetic, it will be
multiplier, which can then make a (4 · 4)17 = 272 bits required to add regular processor instructions for the entire
multiplier with 44 = 256 DSP’s, which is more than the SIDH protocol and some high level functions.
target FPGA supply. The necessary number can be reduce by
apply a different construction algorithm, such as Karatsuba References
multiplication algorithm [24]. The Karatsuba algorithm can
reduce the necessary number of DSP’s to 34 = 81, at [1] National Institute of Standards and Technology,
the cost of more intermediate adders which are then done “Post-quantum cryptography standardization,” 2017,
through the FPGA slices. Also, because of the algorithm [Link]
Post-Quantum-Cryptography-Standardization.
structure it is easy to add intermediate registers for pipelin-
ing, as shown in Figure 3. The register stages can be added [2] D. Jao, R. Azarderakhsh, M. Campagna, C. Costello, L. D.
Feo, B. Hess, A. Jalali, B. Koziel, B. LaMacchia, P. Longa,
before each multiplication and after the final addition. M. Naehrig, J. Renes, V. Soukharev, and D. Urbanik, “SIKE-
One idea is to make the entire 272 bits multiplier accu- Supersingular Isogeny Key Encapsulation,” 2017, [Link]
mulator into a 8 pipeline stages. With 8 stages it is easy to gov/Projects/Post-Quantum-Cryptography/Round-1-Submissions.
perform 8 Fp multiplications in parallel, and as consequence [3] B. Koziel, R. Azarderakhsh, and M. Mozaffari-Kermani, “Fast hard-
2 Fp2 . Since the number of parallel operations seemed ware architectures for supersingular isogeny diffie-hellman key ex-
reasonable, it was opted to aim for a 8 stages multiplier change on fpga,” in Progress in Cryptology – INDOCRYPT 2016.
Springer International Publishing, 2016, pp. 191–206.
accumulator. Other options would be to have 12 stages or
4 stages, but then 3 Fp2 might be more difficult to fit all [4] B. Koziel, R. Azarderakhsh, M. M. Kermani, and D. Jao, “Post-
quantum cryptography on fpga based on isogenies on elliptic curves,”
operations or the lower frequency with less stages would IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 64,
make the control instructions too slow. no. 1, pp. 86–99, 2017.
Figure 4 shows the final multiplier accumulator architec- [5] A. Rostovtsev and A. Stolbunov, “Public-key cryptosystem based on
ture, called Carmela. It requires 108 DSP’s, instead of the isogenies,” Cryptology ePrint Archive, Report 2006/145, 2006, https:
envisioned 81. This is because the first level of Karatasuba //[Link]/2006/145.
Figure 4. Carmela multiplier accumulator, with 272 bits signed multiplication

[6] C. Andrew, J. David, and S. Vladimir, “Constructing elliptic curve [19] C. McIvor, M. McLoone, and J. V. McCanny, “High-radix systolic
isogenies in quantum subexponential time,” Journal of Mathematical modular multiplication on reconfigurable hardware,” in Proceedings.
Cryptology, vol. 8, 1 2013. 2005 IEEE International Conference on Field-Programmable Tech-
nology, 2005., Dec 2005, pp. 13–18.
[7] D. Jao and L. De Feo, “Towards quantum-resistant cryptosystems
from supersingular elliptic curve isogenies,” in Post-Quantum Cryp- [20] M. Rogawski, E. Homsirikamol, and K. Gaj, “A novel modular adder
tography. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. for one thousand bits and more using fast carry chains of modern
19–34. fpgas,” in 2014 24th International Conference on Field Programmable
Logic and Applications (FPL), Sept 2014, pp. 1–8.
[8] C. Costello, P. Longa, and M. Naehrig, “Efficient algorithms for
supersingular isogeny diffie-hellman,” in Advances in Cryptology – [21] Xilinx, “Xilinx Virtex-7 FPGA VC709 Connectivity Kit,” 2011, https:
CRYPTO 2016. Berlin, Heidelberg: Springer Berlin Heidelberg, //[Link]/products/boards-and-kits/[Link].
2016, pp. 572–601. [22] SAKURA Hardware Security Project, “SAKURA-G Side-channel
[9] C. Costello, D. Jao, P. Longa, M. Naehrig, J. Renes, and D. Urbanik, AttacK User Reference Architecture,” MORITA TECH CO.
“Efficient compression of sidh public keys,” in Advances in Cryptol- LTD., Tech. Rep., 2013, [Link]
ogy – EUROCRYPT 2017. Cham: Springer International Publishing, SAKURA-G Spec Ver1.0 [Link].
2017, pp. 679–706. [23] B. Koziel, R. Azarderakhsh, and D. Jao, “Side-channel attacks on
[10] C. Costello and H. Hisil, “A simple and compact algorithm for quantum-resistant supersingular isogeny diffie-hellman,” in Selected
sidh with arbitrary degree isogenies,” in Advances in Cryptology – Areas in Cryptography – SAC 2017. Springer International Publish-
ASIACRYPT 2017. Springer International Publishing, 2017, pp. 303– ing, 2018, pp. 64–81.
329. [24] A. Karatsuba and Y. Ofman, “Multiplication of many-digital numbers
[11] P. L. Montgomery, “Modular Multiplication without Trial Division,” by automatic computers,” in Proceedings of the USSR Academy of
Mathematics of Computation, vol. 44, no. 170, pp. 519–521, 1985. Sciences, 1962, pp. 293–294.

[12] C. Walter, “Montgomery exponentiation needs no final subtractions,” [25] H. D. Nguyen, B. Pasca, and T. B. Preuer, “Fpga-specific arithmetic
Electronics Letters, vol. 35, no. 21, pp. 1831–1832, Oct 1999. optimizations of short-latency adders,” in 2011 21st International
Conference on Field Programmable Logic and Applications, Sept
[13] M. Hamburg, “Fast and compact elliptic-curve cryptography,” Cryp- 2011, pp. 232–237, [Link]
tology ePrint Archive, Report 2012/309, 2012, [Link]
2012/309. [26] T. B. Preuer and M. Krause, “Survey on and re-evaluation of wide
adder architectures on fpgas,” in 2016 International Conference on
[14] J. W. Bos, C. Costello, H. Hisil, and K. Lauter, “Fast cryptography in ReConFigurable Computing and FPGAs (ReConFig), Nov 2016, pp.
genus 2,” in Advances in Cryptology – EUROCRYPT 2013. Berlin, 1–6.
Heidelberg: Springer Berlin Heidelberg, 2013, pp. 194–210.
[15] M. Knežević, F. Vercauteren, and I. Verbauwhede, “Speeding up bi-
partite modular multiplication,” in Arithmetic of Finite Fields. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2010, pp. 166–179.
[16] H. Orup, “Simplifying quotient determination in high-radix modular
multiplication,” in Proceedings of the 12th Symposium on Computer
Arithmetic, Jul 1995, pp. 193–199.
[17] Ç. K. Ko, T. Acar, and B. S. Kaliski Jr., “Analyzing and comparing
Montgomery multiplication algorithms,” Micro, IEEE, vol. 16, no. 3,
pp. 26–33, Jun 1996.
[18] A. Karmakar, S. S. Roy, F. Vercauteren, and I. Verbauwhede, “Ef-
ficient finite field multiplication for isogeny based post quantum
cryptography,” in Arithmetic of Finite Fields. Springer International
Publishing, 2016, pp. 193–207.

You might also like