0% found this document useful (0 votes)
9 views12 pages

Machine Learning in Side Channel Attacks

This conference paper discusses a novel approach to side channel attacks in cryptography using machine learning techniques. The authors explore the relationship between power consumption and encryption keys, formalizing it as a supervised learning task and demonstrating the effectiveness of their methods on the 3DES protocol. Their results indicate that machine learning can significantly improve the speed and accuracy of cryptanalysis compared to traditional statistical techniques.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

Machine Learning in Side Channel Attacks

This conference paper discusses a novel approach to side channel attacks in cryptography using machine learning techniques. The authors explore the relationship between power consumption and encryption keys, formalizing it as a supervised learning task and demonstrating the effectiveness of their methods on the 3DES protocol. Their results indicate that machine learning can significantly improve the speed and accuracy of cryptanalysis compared to traditional statistical techniques.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/269463480

Side channel attack: An approach based on machine learning

Conference Paper · February 2011

CITATIONS READS

124 1,996

3 authors:

Olivier Markowitch Liran Lerman


Université Libre de Bruxelles Université Libre de Bruxelles
117 PUBLICATIONS 2,331 CITATIONS 22 PUBLICATIONS 746 CITATIONS

SEE PROFILE SEE PROFILE

Gianluca Bontempi
Université Libre de Bruxelles
397 PUBLICATIONS 20,451 CITATIONS

SEE PROFILE

All content following this page was uploaded by Gianluca Bontempi on 29 September 2015.

The user has requested enhancement of the downloaded file.


Side channel attack :
an approach based on machine learning

Liran Lerman & Gianluca Bontempi & Olivier Markowitch

Département d’informatique,
Université Libre de Bruxelles,
Boulevard du Triomphe,
1050 Brussels, Belgium.

Abstract. In cryptography, a side channel attack is any attack based on


the analysis of measurements related to the physical implementation of
a cryptosystem. Nowadays, the possibility of collecting a large amount
of observations paves the way to the adoption of machine learning tech-
niques, i.e. techniques able to extract information and patterns from
large datasets. The use of statistical techniques for side channel attacks
is not new. Techniques like Template Based DPA have shown their ef-
fectiveness in recent years. However these techniques rely on parametric
assumptions and are often limited to small dimensionality setting, which
limits their range of application. This paper explores the use of machine
learning techniques to relax such assumption and to deal which large
dimensional feature vectors.
For this purpose, we first formalize the problem of studying the relation
between power consumption and encryption key as a supervised learning
task. Then we compare and assess several classifier and dimensionality
reduction techniques in a large experimental setting. Our promising re-
sults for the 3DES protocol confirms the importance of adopting machine
learning approaches in cryptanalysis.

Keywords: Cryptanalysis, side channel attack, template Based DPA,


machine learning.

1 Introduction

Side channel attacks [8] take advantage of the fact that instantaneous power
consumption, encryption time [7] or/and electromagnetic leaks [5] of a cryp-
tographic device depend on the processed data and the performed operations.
Power analysis attacks are an instance of side-channel attacks which assume that
different encryption keys imply a different power consumptions. The evolution
of the techniques proposed for power analysis attacks along the years has been
characterised by an increase of the sophistication of the statistical analysis.
SPA is the first approach proposed for power attack and it relies on an
interpretation of the power consumption in order to deduce information about
the used key. In other words, it tries to detect a pattern in a trace linked to
2 Liran Lerman & Gianluca Bontempi & Olivier Markowitch

a information about the operation executed. For exemple, Kocher [7] in 1996
showed that RSA implemented with a square and multiply algorithm allows to
recover the key.
Differential Power Analysis (DPA) [8] uses a more advanced statistical anal-
ysis than SPA by modelling theoretic power consumption for each key. The
likelihood of the observed power consumption for each model is then used to
predict the key.
Template Based DPA [3] makes an additional step by estimating the condi-
tional probability of the trace for each key . This method relies on a parametric
Gaussian estimation approach which, though effective in some cases [10], has
several limitations. The method is ill-conditioned if the number of observed
traces is smaller than the number of features used to describe the trace. This is
the case if the trace is modelled by the entire sequence of collected observations.
This paper intends to make one step further in the improvement of the sta-
tistical analysis of power consumption data by taking advantage of machine
learning techniques. The role of machine learning in cryptanalysis has already
been discussed in [13]. An application of machine learning to cryptanalysis is
presented in [2] where a machine learning algorithm is used to find information
about the printed characters of a printer by exploiting the information hidden
into the acoustic noise.
Here we focus on two aspects of power consumption analysis which has been
neglected so far in the litterature: the issue of dimensionality reduction and the
one of model selection. The first aims to extract from the observed data a mini-
mal number of features able to take into account the information that the trace
brings about the key. The second aims to go beyond the parametric assumptions
made in TDPA by using techniques of model assessment and selection to find
in a nonparametric and data-driven way the technique which provides the best
accuracy in predicting the key.
We will show that a machine learning procedure based on dimensionality re-
duction and model selection is able to outperform conventional TDPA by finding
a byte of the 3DES encryption key twice faster.
This paper is organized as follows: Section 2 formalizes the problem as a
supervised learning task. Section 3 presents the machine learning approach.
Section 4 looks the implementation of our method, compare our new method
with others. Section 5 concludes the paper and discusses future work.

2 The TDPA approach

Let us consider a crypto device executing a decryption/encryption algorithm


with the key Ok , k = 1, . . . , K, where K is the number of keys. Let us observe
N times over a time interval of length n the power consumption of such device
(k)
and let us denote by trace the series of observations Ti ∈ ℜn , i = 1, . . . , N .
The state-of-the-art Template Based DPA approaches models the stochastic de-
pendency between the key and a trace by the following multivariate normal
Side channel attack : an approach based on machine learning 3

conditional density
1 1 (k) (k) T
P (T (k) |Ok ; µk , Σk ) = p e− 2 (T −µk )Σk (T −µk )
−1

n
(2π) |Σk |
where µk ∈ ℜn and Σk , k = 1, . . . , K, are respectively the expected value and
the covariance of the n variate trace associated to the kth key.
(k)
Once a set of N traces Ti , i = 1, . . . , N , is observed for each key the TDPA
estimates the expected value µk and the covariance Σk by the sampled mean
N
1 X (k)
µ̂k = T
N i=1 i

and the sample covariance


N
1 X (k) (k)
Σ̂k = (T − µ̂k )T (Ti − µ̂k )
N i=1 i

Then, once an unlabeled trace T is observed, the techniques returns the key
which maximises the likelihood

k̂ = arg max P̂ (T |Ok ) = P (T |Ok ; µ̂k , Σ̂k )


k

This approach makes implicitly the assumption that the distribution of the traces
for a given key follows a parametric Gaussian distribution, whose number of
parameters is equal to (n2 + 3n)/2. Note that the number of parameters can
rapidly become very large, and much larger than N , for observation intervals of
moderate size (e.g. n > 20).

3 Our approach
This paper proposed the adoption of a machine learning approach to estimate
from a set of annotated traces the conditional distribution P (Ok |T ). In order to
learn this model from data we implement a procedure which relies on two steps:
dimensionality reduction and a model selection.
Dimensionality reduction is necessary in order to deal with experimental
settings where the number n of time steps is comparable or larger than the
number N of collected traces. Model selection is used in order to avoid the
parametric assumption made in TDPA and find in a data driven perspective
the model which best fits the stochastic dependency between key and power
consumption.
In particular, three techniques of dimensionality reduction are considered
(i.e. PCA, Ranking, mRMR) and three different learning algorithms (i.e. Self
Organizing Map (SOM), Support Vector Machine (SVM), Random Forest (RF)).
In order to assess the predictive power of our models, we use the method of
leave-one-out.
4 Liran Lerman & Gianluca Bontempi & Olivier Markowitch

4 Experiments and discussion

The device attacked is a FPGA. The product encrypted a random constant


message of 64 bits, with the block cipher 3DES which was used with 3 different
keys.
We used oscilloscope to collect traces of all the deferent keys. Each trace
collected was linked with a key. Each key was linked to 400 traces to reduce
noise through a preprocessing (section 5.1). To determine the best model, a
single byte of key was attacked (section 5.3). The models used are those of
supervised classification. After choosing the best model, we attacked all the
other bytes of the 3 key 3DES (section 5.4). It allowed to compare our model
with the template Based DPA (section 5.5). Finally we discussed on the result
(section 5.6).

4.1 Preprocessing

To find the best model, we collected traces T.,1...N from an oscilloscope Agile
infiniium DSO80204B 2Ghz 40GSa/s.
The measured traces collected were noisy which made it difficult to be certain
of the key used. To reduce the noise, we realized an average of 400 traces linked
to the same key: T1...400,1...N ∼ Oj . We end up with a single trace T̄j,1...N ,
consisted of 20000 points (N = 20000) that were sample by the oscilloscope, for
each key Oj :
1 X
T̄j,1...20000 = T1...20000
400
T1...20000 ∼Oj

There are 8 bits in each byte of a key but we did not attack the parity bit so we
end up with 7 bits to find. We had only 128 (= 27 ) traces in total to find the
best model.

4.2 3D visualization

Our first idea was to see if there is a dependence between Tj,1...N and Oj and
we realized it by projecting each trace in 3D. This visualization provides a first
idea of the clustering of traces linked to keys using the same value of a certain
bit. To project the traces, we used the method PCA explained in the paper
[11] written by Pearson in 1901. All subfigures are in figure 1. Each subfigure
focuses on a bit of last byte of the first key of 3DES. Points of the same color
are linked to traces connected to keys that share the same value for a bit. With
these visualizations, we saw that the more we approach the parity bit, the more
complex it is to predict the value. This way, we can predict more accurately
forecasting the last bit to the 8th byte of the first key.
Side channel attack : an approach based on machine learning 5
.
Figure 1: This figure shows the 128 traces, from the 8th byte of the first key,
projected in 3D. The black dots represent a bit value 1 and the others symbolize a bit
value 0. The different colors indicate a different value of: the 7th bit in A, of the 6th
bit in B, of the 5th bit in C, of the 4th bit in D, of the 3rd bit in E, of the 2nd bit in
F and of the 1st bit in G.

4.3 Select a Model


By knowing the period of encryption (between times 8000 and 17399), we used
only a subset of each trace : T̄.,8000...17399 .
We constructed a model of machine learning for each bit of byte attacked.
Because a small number of traces, we used the method of leave-one-out for
estimating the quality of our models.
We used 3 different types of models and 4 types of feature selection. In the
following, the notation A / B is that we used the model A with the feature
selection B.
6 Liran Lerman & Gianluca Bontempi & Olivier Markowitch

The different types are used:

– SOM(8 × 5) /: a self organizing map with size 8 × 5


– SOM(9 × 5) /: a self organizing map with size 9 × 5
– SOM(8 × 6) /: a self organizing map with size 8 × 6
– SOM(9 × 6) /: a self organizing map with size 9 × 6
– SVM / Rank: a support vector machine with feature selection rank
– SVM / Nosel: a support vector machine with feature selection nosel
– RF / Rank: a random forest with feature selection rank
– RF / Nosel: a random forest with feature selection nosel
– RF / SOM: a random forest with feature selection self organizing map
– RF / PCA: a random forest with feature selection principal component anal-
ysis

Results are available in the table below. It shows the probability to estimate a
bit of the key by each model.

7th bit 6th bit 5th bit 4th bit 3rd bit 2nd bit 1st bit
SOM(8 × 5) 96.09 90.23 87.50 74.22 53.52 50.00 50.00
SOM(9 × 5) 97.27 92.19 83.98 69.53 57.81 51.17 50.00
SOM(8 × 6) 96.48 89.45 83.59 73.44 57.42 50.00 50.00
SOM(9 × 6) 95.70 92.97 85.94 78.52 58.20 51.56 50.00
SVM / Rank 94.53 80.47 72.66 62.5 50.00 50.78 50.00
SVM / Nosel 96.48 90.23 82.81 73.05 64.06 53.52 50.00
RF / Rank 97.66 83.98 81.64 77.34 61.33 57.42 50.00
RF / Nosel 96.09 92.58 89.06 83.98 59.77 55.47 50.00
RF / SOM 96.48 89.06 82.81 76.17 60.94 50.00 50.00
RF / PCA 96.09 92.58 90.63 85.55 75.39 58.98 50.00

We noticed that the fastest method with the most efficient results is the
RF/PCA where the first bit is estimate with 96.09% accuracy, the second bit is
estimate with 92.58, ...

4.4 Implementation of the attack on all bytes

The next step was to obtain new traces on device in order to attack the other
bytes of the 3DES encryption key. Then, a computer sent 400 times a request
for encryption of a random constant message using the same key. This was done
for all of the key bytes, so 128 times per byte, for 21 bytes. In the same time of
each request, an oscilloscope1 measured the power consumption of the FPGA.
Therefore, we measured 1075200 (= 21×128×400) traces, each containing 6000
moments of encryption. After that, a preprocessing was made as it can be seen
in section 5.1. In total, we had more than 2688 (= 21×128) traces. The results,
were attack on each byte individually as you can see in the table below.
1
We used this time another oscilloscope Agile infiniium 1GHz 4GSa/s.
Side channel attack : an approach based on machine learning 7

7th bit 6th bit 5th bit 4th bit 3rd bit 2nd bit 1st bit Dim
1st key
1st byte 78.13 65.63 77.34 60.16 60.16 53.13 50.00 61
2nd byte 85.16 75.00 67.97 50.00 57.03 50.00 50.00 17
3rd byte 78.91 67.97 70.31 69.53 67.97 50.00 51.56 44
4th byte 85.16 73.44 60.94 57.81 50.00 50.00 54.69 25
5th byte 89.84 78.91 65.63 60.16 64.84 52.34 50.00 28
6th byte 82.03 73.44 60.16 59.38 50.78 54.69 60.94 40
7th byte 69.53 67.19 61.72 50.78 54.69 50.00 50.00 24
8th byte 78.91 72.66 56.25 50.00 53.91 50.00 50.00 39
2nd key
1st byte 95.31 67.19 70.31 59.38 53.91 55.47 50.00 18
2nd byte 78.13 75.00 67.19 59.38 50.00 50.00 57.03 51
3rd byte 97.66 85.94 65.63 57.81 50.00 50.00 50.00 28
4th byte 93.75 84.38 63.28 52.34 57.03 52.34 50.00 41
5th byte 92.19 82.81 67.97 63.28 50.00 62.50 50.00 43
6th byte 75.00 71.88 64.06 65.63 50.00 50.00 54.69 68
7th byte 90.63 69.53 70.31 61.72 56.25 51.56 50.00 2
8th byte 91.41 83.59 82.81 67.19 64.84 50.00 50.00 32
3rd key
1st byte 89.84 74.22 62.50 54.69 60.94 50.00 54.69 23
2nd byte 96.09 82.81 64.06 60.16 65.63 50.00 50.00 31
3rd byte 95.31 84.38 76.56 54.69 60.94 50.00 50.00 17
4th byte 84.38 74.22 68.75 64.06 57.03 50.00 50.00 6
5th byte 93.75 81.25 60.94 54.69 57.81 50.00 50.00 16
6th byte 90.63 89.84 72.66 68.75 60.16 50.00 50.00 56
7th byte 96.88 87.50 64.06 61.72 61.72 50.00 50.00 30
8th byte 71.09 66.41 64.06 65.63 50.00 60.94 50.00 5
average 86.66 76.47 66.89 59.54 56.90 51.79 51.40 31.04
s.d.2 8.44 7.39 6.09 5.71 5.71 3.45 2.88 17.38

We find that in average the last bits are more predictable than others. More-
over, on average, the number of variables to consider is about 31 with a standard
deviation of 17.38.

Seeing that the last bits are systematically more predictable than others, we
proposed a strategy of attack called the improved brute force. We execute RF
/ PCA to predict the encryption key. In the case where the key is not predicted
correctly, we substitute the value of the most complex bits to predict. So it is
a brute force but improved by the way that we take into account the rate of
correct predictions for each bit of the key.
8 Liran Lerman & Gianluca Bontempi & Olivier Markowitch

Example
Suppose we need to predict a key of 8 bits and our model has predicted the value
0011 1101. Suppose that the first bits are more complex to predict than others.
The following table summarizes the probabilities to predict each bit:

8th bit 7th bit 6th bit 5th bit 4th bit 3rd bit 2nd bit 1st bit
90.42 86.66 76.47 66.89 59.54 56.90 51.79 51.40

If the model has not correctly predicted the key we first must change the first
bit. Then we will test first 0011 1101, then 0011 1100, then 0011 1111, then
0011 1110, then 0011 1001, ...

4.5 Comparison between RF / PCA and template Based DPA


We will evaluate our model the RF / PCA with the template based DPA. For
this we realize attacks in a byte of the key in the same conditions involving the
use of the same:
1. Oscilloscope
2. Device that implements the 3DES encryption
3. Probes
4. Number of traces
5. Traces
6. Computer that performs the calculations of effectiveness of each attack.
We reduce the number of points for each trace through a feature selection
method.
The mRMR explained in the paper [12] written by Peng, Long & Ding in
2005 was used for the template Based DPA. The result of the template Based
DPA / mRMR are summarized in figure 2.

Figure 2: Probability of predicting the byte according to the amount of dimensions


considered.

We found that with 35 dimensions there is 5.80% chances of finding the byte.
The rate of correct predictions is below the rate of model RF / PCA. There for
the template Based DPA is less effective than our model in this case. The table
below shows the probability of finding each bit considering 35 variables.

7th bit 6th bit 5th bit 4th bit 3rd bit 2nd bit 1st bit
94.53 78.13 78.13 67.19 53.13 55.47 50.78

According to the results of template Based DPA / mRMR we note that be-
yond 35 dimensions we were not able to build a reliable model. Worse, beyond 59
variables taken into account, we get zero value for the determinant of the covari-
ance matrix Σj . This leads to an infinite value in the calculation of probabilities
Side channel attack : an approach based on machine learning 9

Figure 3: Probability of predicting the byte according to the amount of dimensions


considered.

that lead to impossibility of calculation. To resolve the problem of estimating


the covariance matrix we used a method called “Shrinkage estimation cov ”. The
results are summarized in figure 3.
Aside from the fact that we have no problems for the calculation of proba-
bilities, we did not improve the result.

4.6 Discussion

During our research, we use the encryption device FPGA that implement 3DES
with three different keys. Thanks to the method PCA, we showed in 3D a link
between traces and the value of the key. After that, we collected a large set of
traces to estimate the best model. The best one was RF/PCA. It allowed us to
attack all bytes of the 3DES encryption key. We find that in average the last
bits of each byte are more predictable than others. It allowed us to suggest a
strategy of attack called the improved brute force.
After this, we compare our model with template Based DPA. When we use
RF/PCA we see that we can find some of the key byte in 15.33% cases while
it is 5.80% with the best attack. When we use the concept of guessing entropy
explained in the paper [15] written by Standaert, Malkin & Yung in 2009, we
show that to find the right value of one byte of the key:

1. We need to make in average 11 tests with our strategy “improved brute


force” if we use the RF / PCA prediction.
2. We need to make in average 21 tests with our strategy “improved brute
force” if we use the template Based DPA / mRMR prediction.

Besides the increased number of keys in average to test and a bigger problem
of dimensionality, we note that beyond 35 dimensions template Based DPA /
mRMR were not able to build a reliable model. Worse, beyond 59 variables
taken into account we get an infinite value in the calculation of probabilities
that lead to impossibility of calculation. To resolve the problem of estimating
the covariance matrix we used a method called “Shrinkage estimation cov ”. Aside
from the fact that we have no problems for the calculation of probabilities, we
did not improve the result. Furthermore, our model takes less dimensions than
the template Based DPA.
A question that intrigued us since the beginning of the study was to know
whether there is any important information outside the period of encryption. In
other words, is there information about the encryption key outside the period
of encryption?
To be able to answer this question, a machine learning method mRMR was
used that allows to sort moments of the encryption. The sorting done so the
first are those with the greatest dependence with the encryption key and less
redundancy between them.
10 Liran Lerman & Gianluca Bontempi & Olivier Markowitch

The results are that there was a lot of information beyond the period that
the encryption procedure. The reason would be that encryption key is sent
unencrypted to the FPGA before encryption by a co-processor Intel 386.

5 Conclusion

We found that the combination of machine learning and cryptanalysis gave sur-
prising results. Indeed, we found an improvement of the best attack DPA,
namely template Based DPA.
Our model seems to be more effective than the most effective attack method
against cryptographic device. When we compare our model with template Based
DPA, we saw that our model are twice faster than the template Based DPA and
have a smaller problem of dimensionality.

References
[1] D. Agrawal & J.R. Rao & P. Rohatgi, (2003), “Multi-Channel Attacks”, in the
proceedings of Cryptographic Hardware and Embedded Systems (CHES) 2003,
LNCS, vol 2779, pp 2-16, Cologne, Germany.
[2] M. Backes & M. Dürmuth & S. Gerling & M. Pinkal & C. Sporleder, (August 11-
13, 2010), “Acoustic side-channel attacks on printers”, to appear in; Proceedings of
the 19th USENIX Security Symposium, Washington, DC, USA
[3] S. Chari & J. R. Rao & P. Rohatgi, (2002), “Template Attacks ”, in CHES, volume
2523 of LNCS, pages 13–28. Springer.
[4] M.A. El Aabid & S. Guilley & P. Hoogvorst, (2007), “Template Attacks with a
Power Model ”, Cryptology ePrint Archive, Report.
[5] K. Gandolfi & C. Mourtel & F. Olivier, (2001), “Electromagnetic analysis: Concrete
results”, CHES 2001, C¸ . K. Koc¸, D. Naccache, and C. Paar, Eds., vol. 2162 of
LNCS, pp. 255–265, Springer-Verlag.
[6] P. C. Kocher & J. Jaffe & B. Jun. (1998), “Introduction to differential power anal-
ysis and related attacks”, Rapport Technique, Cryptography Research.
[7] P. C. Kocher. (1996), “Timing attacks on implementations of Diffie-Hellman, RSA,
DSS, and other systems”, Neal Koblitz, Advances in Cryptology – CRYPTO’96,
volume 1109 de Lecture Notes in Computer Science, pages 104–113. Springer-
Verlag.
[8] P. C. Kocher & J. Jaffe & B. Jun, (1999), “Differential Power Analysis: Leaking
Secrets”, In Proc. Crypto ’99, Springer-Verlag, LNCS 1666, pages 388–397.
[9] H. Liu & H. Motoda, (2007), “Computational Methods of Feature Selection”, edi-
tors, Chapman and Hall/CRC Press.
[10] S. Mangard & E. Oswald & T. Popp, (2007), “Power Analysis Attacks: Revealing
the Secrets of Smart Cards”, Springer, Cambridge, Massachusetts.
[11] K. Pearson, (1901), “On Lines and Planes of Closest Fit to Systems of Points in
Space”, Philosophical Magazine 2 (6), 559–572.
[12] H. Peng & F. Long & C. Ding, (2005), “Feature Selection based on Mutual In-
formation: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy”,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 27, No 8,
pp 1226-1238.
Side channel attack : an approach based on machine learning 11

[13] R. L. Rivest, (1993), “Cryptography and Machine learning”, Laboratory for Com-
puter Science, Massachusetts Insitute of Technology, Cambridge.
[14] S.S. Shapiro & M.B. Wilk, (1965), “An analysis of variance test for normality
(complete samples)”, Biometrika, 52, 591-611.
[15] F.-X. Standaert & T. G. Malkin & M. Yung, (2009), “A Unified Framework for
the Analysis of Side-Channel Key Recovery Attacks”, in International Conference
on the Theory and Applications of Cryptographic Techniques - Eurocrypt 2009,
ser. LNCS, vol. 5479. Springer, pp. 443–461.

View publication stats

You might also like