Machine Unlearning
Machine Unlearning
Stefan Schoepf
arXiv:2402.01401v4 [[Link]] 2 Dec 2024
ss2823@[Link]
Department of Engineering
University of Cambridge, UK
Abstract
To comply with AI and data regulations, the need to forget private or copyrighted information
from trained machine learning models is increasingly important. The key challenge in
unlearning is forgetting the necessary data in a timely manner, while preserving model
performance. In this work, we address the zero-shot unlearning scenario, whereby an
unlearning algorithm must be able to remove data given only a trained model and the
data to be forgotten. We explore unlearning from an information theoretic perspective,
connecting the influence of a sample to the information gain a model receives by observing
it. From this, we derive a simple but principled zero-shot unlearning method based on
the geometry of the model. Our approach takes the form of minimising the gradient of a
learned function with respect to a small neighbourhood around a target forget point. This
induces a smoothing effect, causing forgetting by moving the boundary of the classifier. We
explore the intuition behind why this approach can jointly unlearn forget samples while
preserving general model performance through a series of low-dimensional experiments.
We perform extensive empirical evaluation of our method over a range of contemporary
benchmarks, verifying that our method is competitive with state-of-the-art performance
under the strict constraints of zero-shot unlearning. Code for the project can be found at
[Link]
1
1 Introduction
Regulations such as the General Data Protection Regulation (GDPR) enshrine an individual’s data autonomy
rights, including the right to be forgotten. While deleting an entry from a database is relatively straightforward,
removing the influence of that data from a trained model is a challenging open problem. The process of
removal, referred to as unlearning, is difficult for several reasons. It is known that neural networks memorise
instance-level information (Arpit et al., 2017; Zhang et al., 2021; Feldman, 2020), and it is practically
intractable to ascribe parameter changes to a specific training sample post-hoc (Kurmanji et al., 2023). At its
core, unlearning is a multi-objective optimization problem with three key desiderata. An effective unlearning
algorithm must remove the influence of the selected subset of data, maintain model performance on retained
data, and minimise computational cost. These goals are antagonistic since inducing forgetting inevitably
disrupts the model’s performance, thus balancing these objectives is key. Naïvely, one may achieve perfect
forgetting by retraining a model on the training data sans the forget samples every time there is a forget
request. However, this is prohibitively expensive thus violating the third desideratum.
Many of the unlearning methods proposed are effective, however they make strong assumptions about
the problem setting that simplifies the task considerably. Primarily, existing methods typically assume
access to all, or a subset of, the training data. This data is used in different ways, such as to fine-tune the
model post-forgetting (Chundawat et al., 2023a; Graves et al., 2021), or to conduct parameter importance
calculations after the initial training period (Foster et al., 2023). In reality, there are many reasons why this
data could be unavailable, such as cost of storage, limited duration access to datasets, or an oversight in
considering machine unlearning during model development. As such, Chundawat et al. (2023b) introduces a
novel problem setting for unlearning, termed zero-shot (ZS) unlearning, whereby only the data to be forgotten
and the trained model are available (Figure 1). This is extremely challenging, since the remaining data is not
available to protect model performance, and thus more delicate methods are required. Insightful treatment
of the ZS scenario can be found in Chen et al. (2023), where unlearning is formulated as reconstructing a
decision boundary that could be reasonably learnt by a model trained without the forget data, achieved
through learning the nearest false label for each forget sample.
In this work, we approach ZS unlearning from an information theoretic perspective. Golatkar et al. (2020)
consider information leakage when observing model weights, whereas we consider the information gained by a
model by training on a given sample. Data points offer a classifier different amounts of information gain
when included in the training data (Lindley, 1956; Houlsby et al., 2011). If a data point can be inferred from
other training data, then it offers little information gain (Jeong and Qiu, 2018). In ZS unlearning, knowing
the contribution a sample has made to a model is hard as we have access to only the model and the sample
to be forgotten. We postulate that the information gain of a sample can serve as a good indicator for how
much influence it has over a model. From this hypothesis, we derive a principled loss to induce forgetting
based on how the geometry of the problem space changes with respect to the softmax classifier. If a sample
offers little information gain then it may be inferred from other data, therefore the classifier’s output should
change minimally over similar samples. This minimal change can be effectively measured via the curvature
of the model should be low in the region surrounding such a point. In contrast, a high information gain
sample cannot be inferred, and therefore one would expect they lie in high curvature regions of space. We
present Just in Time (JiT) unlearning, a novel ZS unlearning algorithm based on minimising the gradients
of a classifier with respect to local neighbourhood around each forget sample. We show in low dimensional
experiments that removing training samples from low-curvature regions yields minimal changes to the learned
decision boundary, whereas removing samples from high curvature regions has a more pronounced effect.
We therefore propose that, with reference to Feldman (2020), samples with low information gain may be
predicted with more generalised knowledge and thus do not infringe on privacy. In contrast, samples with
large information gain have significant impact on the learned classifier, are more likely memorised, and do
infringe on privacy. We demonstrate empirically that following these principles, JiT causes the removal of
influence from the forget set while preserving generalisation performance across the wider space.
2
Figure 1: Visualization of the zero-shot unlearning scenario. Contrary to traditional unlearning there is no
access to, or prior knowledge of, any data other than the forget set or the model at any point beyond its
current state. These constraints make the problem considerably more challenging.
• To the best of our knowledge, JiT is the first unlearning algorithm to be directly informed by the
information gain of a sample.
• We provide extensive empirical analysis of the geometry of JiT unlearning in low dimensions.
• We show our method is competitive with existing SOTA in the zero-shot domain.
2 Related Work
Information theory is concerned with the transmission, quantification, and storage of information (Shannon,
1948), and has seen widespread use in machine learning. Most relevant here is its use in determining the
information gain of an experiment Lindley (1956). This notion has seen uses such as determining splits in
decision trees (Quinlan, 1986), and active learning (Tong and Koller, 2001; Houlsby et al., 2011). Here we use
this concept as a proxy to a training sample’s influence on a learned function.
Machine unlearning was first introduced in Cao and Yang (2015), and a probabilistic perspective of
unlearning was explored in Ginart et al. (2019); Sekhari et al. (2021); Gupta et al. (2021); Neel et al. (2021);
Triantafillou et al. (2023). Here we focus on post-hoc unlearning methods that operate on models that are
already trained. Methods exist that alter the initial training scheme (Bourtoule et al., 2021; Mehta et al., 2022;
Shah et al., 2023), but these are considered out of scope as they do not satisfy the ZS problem constraints.
Current SOTA methods rely on accessing all or a subset of the original dataset that is not to be forgotten
(i.e. the retain set), thus violating the ZS constraints. Bad Teacher unlearning (Chundawat et al., 2023a) and
SCRUB (Kurmanji et al., 2023) leverage a student-teacher framework while Amnesiac unlearning (Graves
et al., 2021) trains with randomised labels for forget data before fine-tuning on the retained data to repair
the model. UNSIR (Tarun et al., 2023) learns an error-maximising noise to induce forgetting of the necessary
data, before also employing a finetuning step. Warnecke et al. (2021) minimise the divergence in model
output over a sample and its noisy perturbations and then finetune. A key limitation of these methods
is that protecting model performance necessitates access to retain-set data for the entire duration of the
model’s lifetime. To address this, Golatkar et al. (2020) and Foster et al. (2023) propose methods that do not
require fine-tuning or repair steps. Golatkar et al. (2020) derives an unlearning algorithm that minimises
information gained about the training data when observing model weights. However, this scales quadratically
with dataset size and often performs considerably worse than state-of-the-art (Tarun et al., 2023; Foster et al.,
2023). Selective Synaptic Dampening is a scalable retrain-free approach, based on inducing forgetting by
selectively dampening parameters that are disproportionately important to the forget-set (Foster et al., 2023).
This requires access to the whole dataset at least once, to calculate the importance over the retained data.
3
Chundawat et al. (2023b) introduces two methods to address ZS unlearning. The first method, an extension
of Tarun et al. (2023), replaces the repair step with an error-minimising noise. The second utilises a generator
network and an attention loss to distil knowledge from an expert teacher, with a band-pass filter preventing
the flow of knowledge for specific classes. Both methods are slow, do not scale well to large problem spaces,
and can only forget entire classes. Chen et al. (2023) present boundary shrinking and boundary expanding.
Shrinking causes unlearning by training over the nearest false label for forget samples, found via a fast
gradient sign attack (Goodfellow et al., 2014). While performant, shrinking scales poorly with model and
input size. Boundary expanding is faster but less performant, remapping forget samples by training them to
fit a new output neuron, before removing the neuron leaving the forget samples in high entropy states.
Membership inference attacks (MIA) are a way of measuring information leakage of a machine learning
model (Shokri et al., 2017). An auxiliary model (e.g. a logistic regression) is trained to infer whether a given
data point was included in a model’s training data. MIAs are used as typical evaluation metrics in machine
unlearning; if a MIA cannot recognise a forgotten sample as an element from the train set, then this presents
empirical evidence that the sample has indeed been forgotten.
3 Preliminaries
We introduce the notation for machine unlearning in a supervised classification task, consistent with the
approach outlined in Chen et al. (2023). Consider some input space X ⊂ Rd and some output label space
Y ⊂ Rc , where d is the dimensionality of the input and c is the number of classes. We define a training
dataset D = {xi , yi }N
i=1 ⊆ X × Y, where xi is a training input sample with the label yi . We denote a subset
Df ⊆ D as the forget set, and Dr = D \ Df as the retain set.
Let fθ : X → Y be a neural network, with parameters θ. We assume that fθ is well trained and generalises to
in distribution test samples well. The objective of ZS unlearning is, given only the model fθ trained on D,
to remove the influence of Df from the learned model such that the unlearnt model fθ′ is approximately
equivalent to a model retrained on only Dr which we define as the optimal solution, fθ∗ . Since direct access
to fθ∗ is, by definition of the unlearning problem, impossible, existing works construct approximate heuristics
to induce forgetting and use a membership inference attack to measure forgetting. These attacks typically
evaluate the difference in the output distributions of the model over train and test samples.
4 Proposed Method
In this section, we introduce our JiT unlearning method and provide intuition for its effectiveness by examining
the geometry of a learned classifier and analyzing how the location of a forget sample in the input space
can impact the decision boundary of a model retrained without it. This section considers the case where
|Df | = 1, noting that larger subsets have an additive effect.
Consider the hypothetical of removing a single image of a black cat from a dataset comprised of 1 million
black cats and 1 million white dogs. This likely has minimal effects on the learning process, since a well
trained classifier should generalise and infer the class of the sample easily. If a sample may be removed from
the training dataset without significant changes in the resultant model, we posit that an unlearning algorithm
should also have minimal effect on the model when unlearning such a sample. As such, it is logical to design
an unlearning algorithm that accounts for the information gain of a sample. However, directly measuring this
quantity is difficult, especially in a ZS setting where there is no access to other data points. We therefore
seek to derive a heuristic that can approximate how much information gain a sample may have offered the
model, based on only that sample and the model itself. We begin by formally introducing the notion of a
neighbourhood of a target sample:
Definition 4.1 (Neighbourhood of a sample). For a data point x ∈ X , let its neighbourhood Br (x) be the
bounded subspace of X containing x, such that ∀x̂ ∈ Br (x) the ∥x̂ − x∥2 ≤ r for some bound r ∈ R.
From this definition, we introduce the concept of the amount of information contained within a target sample
depending upon its neighbourhood.
4
Figure 2: Demonstration of how the boundary of a classifier moves during unlearning. Retrained model is
the gold standard. Removing a sample from a low-curvature region has almost no effect on the retrained
model, whereas removing a sample from high curvature space has significant impact. In this low-dimensional
setting, JiT successfully reconstructs the retrained boundary, whereas naively training to mislabel the forget
sample completely destroys the trained model.
Definition 4.2. Let X be a random variable corresponding to a sample x belonging to class C. Define Br (x)
as the neighbourhood of x ∈ X and let Y be a random variable corresponding to a sample Br (x) belonging
to class C. Finally, let H(X | Y ) denote the conditional entropy of X given Y . Then we say a sample x
has low information if H(X | Y ) ≈ 0 , meaning that x can be well inferred from its neighbourhood Br (x).
Conversely, we say a sample is high information if H(X | Y ) ≫ 0.
Plainly, we can say that a data point may be said to be low information if it can be inferred from its
neighbourhood; and high information if it can not. Consider a low information training sample xl , from
definitions 4.1 and 4.2 we can expect that for some bound r, fθ (xl ) ≈ fθ (x̂l ) ∀x̂l ∈ Br (xl ). In other words,
a model’s predictions over a low information sample and a set of similar data should have similar output
distributions. As such, the curvature of the model in this space will be low. However, for a high information
gain sample, this would not necessarily hold. From this we can describe an unlearning objective; if the
classifier is smooth with respect to a forget sample, then the model’s prediction over this sample can be
viewed as being interpolated or inferred from other data. Hence, we present a method based on minimising
the gradient of the classifier with respect to the forget set. Since taking the gradient of the model with respect
to the input is extremely expensive for larger problems, we instead construct a first order approximation to
the gradient at the target via considering noisy perturbations within its neighbourhood. Formally, ∀x ∈ Df ,
we seek to minimise the loss given below:
N
1 X ∥fθ (x) − fθ (x + ξj )∥2
∥fθ (x) − fθ (x + ξ)∥2
ℓ=E ≈ . (1)
∥x − (x + ξ)∥2 N j=1 ∥ξj ∥2
Where ξ is a noise vector of equivalent dimensionality to x, and each component ξi of ξ is independently drawn
from a Gaussian distribution such that ξi ∼ N (0, σ 2 ). For samples that are highly influential, minimising
this loss will smooth the local region and remove its influence from the model. For low-information samples
5
log
log (H(x))
logUncertainty
(H(x))
Higher
log (H(x)) log (H(x))
Higher Uncertainty
Higher Uncertainty
Uncertainty
Higher Higher Uncertainty Lower Uncertainty
Lower
Lower Uncertainty
Uncertainty
Lower Uncertainty
Lower Uncertainty JiT Unlearning
JiT
JiT Unlearning
Unlearning
JiT Unlearning
JiT Unlearning Retrained
Retrained
Retrained
Retrained Retrained
Baseline
Baseline
Baseline
Baseline Baseline
log (H(x))
Higher Uncertainty
Figure 4: Entropy, H(x), of the Df output distribu-
Figure 3: Change in sigmoid after unlearningLower
with Uncertainty
tions for full-class unlearning on CIFAR-10, showing
JiT. Red dots are unlearnt samples, black dots are JiT exhibits performance similar to the retrained
the location on the new sigmoid post-JiT. JiT Unlearning
model.
Retrained
Baseline
that are generalised knowledge, the neighbourhood will be rather smooth resulting in minimal changes. A
full algorithm for JiT is given in the appendix 10.1.
We now present JiT from a geometric perspective, providing insight into why it causes forgetting and how
it protects the wider function. Consider a simple 2D classification task, as visualised in figure 2. We pose
a simple question: does it make sense to forget all regions of space equally? Naturally, the answer is no.
Completely forgetting a sample from within a low-curvature region of space would necessitate the forgetting
of almost the entire class, even if they are not part of the forget set. Furthermore, removing a sample
from this region would not have significant ramifications on a model retrained from scratch, nor is it likely
infringing on the privacy of an individual. Unlearning in this instance often requires minimal alterations to
the model. In contrast, a sample that lies in a high-curvature region may not only have significant influence
over the position of the learned boundary, but may have been misclassified had it not been included in the
training data. Figure 2 shows that this intuition holds in low dimensions; when the forget sample is within
the low-curvature region, a model retrained on Dr exhibits almost no change, whereas when the sample in
Df is in a high-curvature region, the boundary is shifted considerably. Enshrining such behaviour into a ZS
unlearning algorithm is tantamount; as protection through fine-tuning or regularization is not possible, a ZS
algorithm must be surgical in its forgetting methodology. Figure 2 shows that unlearning using JiT yields a
classifier almost identical to the retrained model in this low-dimensional setting whereas greedily training
over Df with a false label causes complete destruction of the model.
The heuristic behind JiT’s performance is based on the gradient field of the classifier. The crux of this rests
upon the inherent non-linearity of neural networks. By definition, the model will experience a large rate of
change at the decision boundary. As such, given two unit noise vectors ξi , ξj , where ξi points 222towards the
2
decision boundary and ξj points away, the gradient of the classifier between x and x + ξi will be larger than
for x + ξj . As such, minimising equation 1 for samples near a boundary will be biased towards moving the
boundary towards x. This has the consequence of increasing the uncertainty of the prediction and potentially
changing the samples’ predicted class. To further highlight this phenomena, figure 3 shows how unlearning
forget samples (red dots) from a learned sigmoid function (red line) changes the learned function. Two things
should be observed here: first, samples that lie in low curvature regions have relatively small changes and
secondly, the updates to the function have the effect of pulling the the forget sample towards the centre of the
sigmoid, which is the decision boundary. Unlearning in this way increases model uncertainty over forgotten
samples, without destroying the wider function.
6
2
4.2 Entropy Similarity
In the previous section, we demonstrated how in low dimensions JiT can induce forgetting of a single sample
in a way similar to retraining the model from scratch. Now, we demonstrate that the same loss can be used to
forget arbitrary subsets D in higher dimensions, including full classes. Since visualising decision boundaries in
high dimensions is challenging, we instead evaluate the entropy of the model output. We train a 2-layer CNN
on the CIFAR-10 dataset, focusing specifically on the task of forgetting class 0. We compare the entropy of
the unlearned model over class 0 with that of the original model, and a model retrained from scratch without
class 0. Intuitively, low-entropy predictions indicate higher model confidence, and therefore we expect that
the entropy of the model after JiT unlearning is applied will be higher, aligning closely with that of the
retrained model.
Figure 4 shows the entropy of the forget-set output distributions for a CNN trained on CIFAR-10. Our
unlearning approach increases the entropy over the forget set, reducing the divergence between it and that
of a model retrained from scratch on Dr . In fact, under a Wilcoxon signed-rank test (Woolson, 2007), we
find there is no statistically significant difference between the model unlearned with JiT and the
retrained model for p = 0.10. JiT and retraining both increase the entropy over the forget set, suggesting the
resultant models behave in a similar way, possessing less knowledge of the forget samples compared to the
baseline model. Alongside matching the entropy over the forget set, JiT preserve model performance, as the
unlearned model drops only 2% accuracy (From 99% to 97%) on Dr . Our algorithm demonstrates promising
characteristics that are indicative of an effective unlearning algorithm.
5 Experimental Setup
5.1 Benchmarks
We implement the same benchmarks from Foster et al. (2023), which are similar to that of Chundawat et al.
(2023a), Golatkar et al. (2020) and Kurmanji et al. (2023). We run experiments 10 times, reporting the
mean and standard deviation of these performances. Where classes or sub-classes are forgotten, we show
performance over the same class/sub-class as in Foster et al. (2023); performance on additional classes can be
found in the appendix (10.3). We perform a hyper-parameter search across a single forget class/sub-class,
then use these parameters for all classes. This is more realistic, as it cannot be known a priori what future
forget sets may be presented to the method. The reported Dr accuracy refers to accuracy over a test set of
samples from the classes in Dr .
Unlearning scenarios: Typically the three unlearning scenarios are: i) Full-class forgetting, where a full
class from the dataset must be unlearned, ii) Sub-class forgetting, where a related subset from a class (e.g.
all rockets from class vehicle) is forgotten, and iii) Random forgetting, where a subset is sampled uniformly
from the entire training distribution. We evaluate our method in all three scenarios.
Comparison methods: We compare JiT against the following methods: i) Baseline (BSLN): that has
not been unlearnt, ii) Retrain (RTRN): trained on only the retain data, iii) Finetune (FNTN): , where the
model is fine-tuned on Dr for 5 epochs, iv) Selective Synaptic Dampening SSD (Foster et al., 2023), v) GKT
(Chundawat et al., 2023b), vi) EMMN (Chundawat et al., 2023b), vii) SCRUB (Kurmanji et al., 2023), viii)
Bad Teacher (BT) (Chundawat et al., 2023a) ix) Amnesiac (AMN) (Graves et al., 2021), x) UNSIR (Tarun
et al., 2023), xi) Boundary Shrinking (BDSH) (Chen et al., 2023). Since GKT, EMMN, and UNSIR are
theoretically limited to forgetting just a full-class, these cannot be evaluated in sub-class or random scenarios.
Finally, we note that due to VRAM constraints, we could not benchmark SCRUB on ViT.
Datasets: As with previous work, we benchmark JiT on a range of image classification benchmarks. We
make use of the CIFAR suite (Krizhevsky and Hinton, 2010), and the Pins Facial Recognition dataset (Burak,
2020), which consists of 17, 534 images of 105 celebrity faces.
Models: We evaluate methods on Vision Transformer (ViT) (Dosovitskiy et al., 2021) and VGG11
(Simonyan and Zisserman, 2014), trained on an NVidia RTX 4090 using Stochastic Gradient Descent with an
initial learning rate of 0.1, and the OneCycle learning rate scheduler (Smith and Topin, 2019). Additionally,
we compare the performance of JiT to BDSH on a ViT-L (∼ 300m parameters) trained on the ILSVRC
Imagenet dataset to demonstrate our method can scale to larger problem spaces.
Evaluation metrics: We evaluate model performance according to four key metrics: i) Dr accuracy, ii)
7
Tables 7 and 8 show method M ETHOD
M ETHODDr when
performance ADCC A. forgetting
Df D ACC [Link] samples MIA uniformly ZSZSdistributed acrosset
325 r CC .
(b) f 100
299 .
forget theMIA necessary [Link]
ETHOD Dra
M
the training set. As with sub-class, ETHOD D A
JiT66.3±0.0 CC . D
is able to comfortably A CC . rival MIA
existing ZS
non-ZS SOTA methods,
326 BSLN r
77.0±0.0
f
forget 97.4±0.0
GKT 70.1±5.4 the necessary ⇥
fails to protect⇥the model data. M Chundawat
ETHOD etr
D
BSLN 299
BSLNperforman
Dr87.0±0.0 is 92.0±3.6 98.9
300
and despite over-forgettingBSLN slightly,
RTRN the63.2±0.5accuracy 0.0±0.0 almost unchanged
10.4±1.1 a (< 1% ⇥⇥for both models).
RTRN 66.3±0.0 87.7±0.277.0±0.0 itGKT 97.4±0.0
fails to protect theretrained
model performan
327
91.0±2.5
300
301 requires
78.9±3.5 model be RTRN
BSLN from sc
98.6
98.9
FNTN
RTRN 59.7±0.4
63.2±0.5 3.9±3.0
84.4±0.8 0.0±0.0 13.2±4.2
10.4±1.1
it requires
ground truth a model
data), ⇥⇥be whichretrained
does not fromscal sc
328 Compute Comparison. FNTNFNTN 59.7±0.4 86.4±4.4
3.9±3.0
301
302 70.8±4.7
13.2±4.2
FNTN
RTRN
⇥ which does not scal
97.3
98.6
Figure 4 shows the runtime of AMN 64.3±0.4
JiT compared 0.0±0.0
to other ground
of classes
methods. 13.1±2.91.8±0.8truth is
JiT adds very data),
also a barrier. FNTN A single gen
97.3
⇥ little overhead,
329 302
303 ⇥
AMN 86.8±0.3 51.3±4.4 AMN 97.6
especially in comparison to SCRUB
AMN
theSCRUB 66.2±0.1
64.3±0.4
other zero-shot methods.
87.7±0.1 0.0±0.0 of classes
matrices
Sufficiently
303
304
92.7±2.9 8.2±1.7
1.8±0.8
short
71.8±5.2forisruntimes
also
all ⇥a⇥barrier.
classes. areTheanA single pap
original
important gen
330 SSD
AMN 98.0
97.6
331 desideratum of unlearning, oneSSD SSD
SCRUB 63.79±0.0
66.2±0.1
which JiT empirically 0.0±0.0
85.6±2.7 satisfies. 304
90.8±3.7
305 matrices
model 8.6±0.0
8.2±1.7
JiT 66.7±5.9
has for
achieves all classes.
just⇥⇠ 55%
⇥
a computational The
BT
SSD original
accuracy.98.0
complexity pap
Us
97.6
BT
SSD
of O(N |Df |), where N is the number BT 65.5±0.2
63.79±0.0
of perturbed
86.9±0.2samples0.1±0.3
0.0±0.0 305
82.5±4.9 model 0.0±0.1
8.6±0.0
sufficiently
and |Df40.8±6.3 achieves just
⇥
representative
| is the cardinality ⇠ 55% ofaccuracy.
the large
of the forget97.6 Us
tra
332 306 ⇥ BT
333 set. Requiring only Df and
UNSIR
BT
processing
64.6±0.4
65.5±0.2
each sample
42.9±14.3
0.1±0.3
just once
306
307 there
makes
40.7±12.1
0.0±0.1
sufficientlyisJiTlittle representative
⇥
incentive
efficient. inBDSHof the
the large98.0
generator tral
UNSIR BDSH 64.6±0.4 86.9±0.142.9±14.3 92.2±3.4 there
learning
69.8±5.1
40.7±12.1
is little
a ⇥X inBDSH
incentive
single challenging OtheURSgenerator
sample 98.
98.0
wou
GKTOURS 2.3±0.2 0.0±0.0
86.3±0.3 unlearning
307
308 56.2±20.0
88.7±3.9 performance 64.2±5.2 X X
Table 1: VGG Full-class O URSsample wo 98.
EMMN
GKT 26.9±7.7
2.3±0.2 24.3±23.7
0.0±0.0 308
309 learning
significant
58.2±14.5
56.2±20.0 a single Xon
barrier,challenging
since the value of unle
334 7 Discussion Table 4: VGG Full-class
PinsFaceRecognition
BDSH
EMMN unlearning
66.2±0.1
26.9±7.7 performance
class 1
13.0±0.0
24.3±23.7 309
310
on
train, PinsFaceRecognition
significant
2.9±0.1
58.2±14.5 barrier, X since the
or large datasets that are expensive class
value1. of unle
OURS
BDSH 66.2±0.3
66.2±0.1 14.2±0.6
13.0±0.0 310 347
311 train,
well, 2.9±0.3
or large
2.9±0.1
chose
they additive
do not Xnoisethat
datasets
extend as are
our expensive
perturbat
to sub-class or r
OURS 66.2±0.3 11311unlearning
14.2±0.6 well, 2.9±0.3
chose
they additive
do not X noise
extend as
to our perturbat
sub-class or
335 JiT achieves state-of-the-art performance
M ETHOD for
Dr ACC . the zero-shot
Df ACC .312 347
348 modalities.
performs MIA problem.
reasonably; ZS The
Changing entropy
the
dropping experi-
perturbatio
only 2%r
336 ments highlight that it is able to effectively imitate the output 312 348
313 349while modalities.
performs
entropy
random of crops
outperforming Changing
reasonably;
a retrained
withboth colorthe
dropping
model perturbatio
jitter)
UNSIR only
over a 2%
would
and S
337 forget set, while preserving BSLN 94.0±0.0
retain set performance. 93.9±0.0
8 314
When while
313compared
349MIA 13.82±0.0
random to
higher crops
isoutperforming
existing than⇥with colorUNSIR
both
zero-shot
desired, jitter)
methods,
the would
retainandacS
RTRN JiT100.0±0.0 0.0±0.0
8 315 MIA2.6±0.8 is higher than⇥across
desired, the retain
338 JiT has the following advantages: achieves considerably 314 better
aggressive performance unlearning all full-class
could have beenac a
339 benchmarks, is applicableFNTN to random and 97.6±0.7
sub-class 36.9±9.9
problems315
316 350and 8 4.3±2.7
aggressivehas Limitations
a unlearning
significantly
unlearning, JiT generalizes
⇥ could
faster have
runtime.
well, been a
outperf
340 In other words, JiT dominates AMNprevious99.7±0.1zero-shot methods 0.0±0.0 316across
350 8almost
unlearning,
1.4±1.33 Limitations
JiT generalizes
every metric.
⇥ well, outperf
SCRUB 98.8±0.0 97.1±0.0 317 Sub-class
8.8±0.76 Unlearning. ⇥
341 Naturally, JiT is not quite SSD as performant as non-ZS methods, 317 however
351Tables As
Sub-class discussed
5itand Unlearning.
remains6 show in Gulrajani
competitive.
the performance et
Thisal. is[20
of
55.8±0.0 0.0±0.0 318 4.0±0.0 ⇥
342 particularly impressive given BT the massive advantage0.0±0.0
93.7±0.3 fine-tuning
318
319
351
352Tables As
layers,
JiT offers. discussed
5
is0.0±0.0
able toand since
6 showin
the
Wecomfortably
draw Gulrajani
loss
the
⇥ attention is
performance
unlearn. et
calculatedal.
For[20
to JiT’s w
of
Vi
343 ability to preserve model UNSIR performance, which
99.5±0.1 given no
74.4±9.2repair
319
320
352
353JiT
steps
problemlayers,
This
is able
or
13.6±8.9 since
limitation
to
quadratic
to correct, the
⇥ can
comfortably loss
penalty
since be is calculated
mitigated
unlearn.
is
more used, For
is
conserv by ws
V
344 noteworthy. 320 353
321 354problem
drops This
from more limitation
atomodel,
correct,
than orcan
usual, bybe
sinceby mitigated
replacing
more
4%, them
conserv
which bya w
ms
GKT 2.0±0.6 0.0±0.0354 23.9±30.3 from a model, Xor by replacing them w
drops
however, more than
the noise,
methodusual, by
is comfortably which a m
moream
321
The appendix 4%,
Figure 5: Median method runtime 345 Finally
for ViT we note that models
full-class behave very
EMMN 51.0±13.5 to the 322
differently 69.3±25.7 additive
355 Gaussian
26.9±17.8 Xprovides
some area far
sensitivity
BDSH 93.6±0.4 79.4±0.0 322
323 however,
access The
hyper-parameter
42.4±0.4 theretained
toappendix
the methodX is comfortably
data.
provides
selection. a is aam
sensitivity
This b
robust than others;
forgetting on class rocket in seconds. For visual clarity
346 this leads to the need for different hyper-parameters
355
356 for different models. We
O URS 91.4±0.1 1.9±0.2 323
356
357
access to
protection
4.7±0.5 the
hyper-parameter retained
of X
model data.
selection. This
performance, is a
an b
we exclude GKT (∼ 3000 seconds).
357 protection of model performance, an
Table 6: VGG16 Sub-class unlearning performance on 358 JiT, like
CIFAR-20 Tablemost 5:SOTA
sub-class ViTRocket. methods,unlearn
Sub-class is not
(a) 358
359 JiT,
but may like most
Table 5:SOTA
preclude methods,
ViTitsSub-class
use is not
in especiall
unlearn
299 forget the necessary data. Chundawat et al. [2023b]
Df accuracy, iii) MIA score, and iv) method runtime. For all metrics bar runtime, the objective is not to notes that
359
360 but
Lipschitz
EMMN may preclude
continuity
performance its use
and
is in
our especiall
method,
“sub-optimal
GKT M ETHOD r ACC .
modelDperformance Df ACC .GKT’s 360 MIA Lipschitz continuity
ZS andintuitive
M ETHOD our method, ACC
be fails to protect theretrained entirely.
possible. Thispoor performance asmakes
Drsense
minimise/maximise them, but rather 300
to as close to the model as(b) is important, Mnoise
ETHODmatrices Dr A(n C
301 it requires a model be
BSLN retrained from
75.3±0.0 scratch via
79.0±0.0 distillation using
83.1±0.0 only learned
⇥ et [Link] 95.7 ± 0
performing worse than the retrained model implies insufficient
302 ground truth data), RTRN
performance,
which does not scale well
72.9±0.2
but as noted361in
with model14.1±1.3
11.5±2.8 9 Furthermore,
Chundawat
size. Conclusion ⇥ scaling the94.6
BSLN
RTRN numbe
95.7 ±0
(2023a), Foster et al. (2023) and Kurmanji
303
et al.
of classes (2023),
is also FNTN significant
a barrier. A 65.5±0.7
singledeviation
generator from
model
6.2±3.7 the is retrained
used
361 9in GKT
22.3±5.5 Conclusion
modelto ⇥ ([Link]
create representative
RTRN 94.6nois
85.7 ± 30
over-forgetting) may leak information304 about
matricesthe forfact a sample
all classes. The has beenpaper
original forgotten.
only To remain
studied 10-class consistent
problems,
Unlearning⇥is an AMN with
and FNTN 85.7
still a smallchalleng
important, ResNet ±3
AMN 73.8±0.2 2.4±2.4 3623.0±0.9 93.5 ± 0
existing unlearning literature we use model
305 the sameachieves
logistic just
regression
⇠
SCRUB 55% accuracy.
MIA Using
evaluation a single,
as
62.4±28.4 10.1±22.48 16.7±21.7 small,
Chundawat
362
363 generator et
Unlearning
existing [Link] create
(2023a)
⇥ is
methods an noise samples
important,
are
AMN
SSD restricted that
challeng
95.1 in
93.5 ± ar0
the
and Foster et al. (2023). 306 sufficiently representative
SSD of the large training
75.0±0.0 distribution
4.2±0.0 363
364 of existing
CIFAR-100
zero-shot
11.0±0.0 ⇥proves
methods
unlearning are
SSD
BT unreliable.
restricted
algorithm, Indeed
95.1
93.6 inJiT
the0
±
JiT hyper-parameters: We conduct 307 therea ishyper-parameter
little incentive
BT in the generator
for η loss
74.9±0.2
search and for
σ the
48.4±16.9
using generator
3650.1±0.1
250
364 runs to learn
zero-shot
the forget
of the aunlearning
⇥representative
sample
TPEUNSIRBT its distribution
andalgorithm, 93.6 JiT
perturbed
93.3 ±0
308 learning a single challenging
UNSIR sample
74.1±0.2 would suffice.
57.5±10.3
search from Optuna (Akiba et al., 2019), for each unlearning scenario. For VGG11, we use the followingBDSH Failing365
366 to scale
the
57.4±8.6
methods, to
forget large
sampleproblems
comfortably
⇥ UNSIR
and or
its models
93.3
perturbed
out-performing ±is 0
95.7±0
309 significant
parameters: full-class unlearning uses η = 0.0003, σ = BDSHbarrier, since the value of
74.4±0.0
unlearning is
17.535±0.0
found mostly
366
367
both use η 12.9±0.1 in
methods,
with large
unlearningmodels
comfortably that
settings
X 0.01.O URSBDSH
are expensive
out-performing
beyond 95.7±0fullt
0.5, sub-class and random = 0.0003, σ = 92.2±0
310 train, or large datasetsOURS that are expensive
73.7±0.8 to store.
19.3±18.3 Even if these with methods
unlearning
demonstrating X could its scale
settings
efficacy
O URS and performe
beyond
in full
full-clas
92.2±0
For ViT, the selected parameters311are:well, they do ηnot=extend
full-class 1.5, σto=sub-class
0.8, sub-class
or random η =subset =11.2±7.8
367
368
0.5, σunlearning. 1.5, and random
Compared
368 demonstrating itstoefficacy
non-ZSinSOTA, full-clas Ji
η = 0.01, σ = 0.5. ViT and VGG 312 use performs
considerably different learning rates, since only a single
reasonably; dropping only 2% retain set performance compared to the baseline on ViTepoch is used
during the unlearning step. If minimising
313 whiletheoutperforming
runtime is both UNSIR
a looser and SSD 10
constraint, on the MIA.
a smaller Performance
learning rate also8:be
can
Table holds
VGG16 for Random
VGG; whil un
used for ViT with extra epochs of 314 MIA is higher than desired, the retain accuracy is higher than the retrained
training. We stress that when selecting hyper-parameters, we Table 8:
selected model,
VGG16 indicating
Randommor un
values that yielded promising results,
315 aggressive unlearning fitting
without rigorously could haveour been
results applied.
to theWhen applying
retrained [Link] same hyper-parameters to fac
316 unlearning, JiT generalizes well, outperforming SCRUB, SSD, and UNSIR. M ETHOD D
M ETHOD D
Sub-class Unlearning.
317
BSLN 8
6 Results RTRN
BSLN 8
318 Tables 5 and 6 show the performance of JiT on sub-class unlearning for ViT and VGG16. FNTN
For both
8
RTRN
319 JiT is able to comfortably unlearn. For ViT, it actually over-forgets, however this is typically
FNTNan easie 8
6.1 Benchmark Evaluation 320 problem to correct, since more conservative values can always be selected. Dr performance AMN 8
als
321 drops more than usual, by 4%, which a more conservative parameter set could correct. SCRUB
AMN
For VGG16 8
322 however, the method is comfortably amongst the SOTA, outperforming methods SSD
SCRUB
that are grante8
Compute Comparison. Figure 5 shows the runtime of JiT compared to other methods. JiT is very fast, BT
SSD 8
323 access to the retained data.
especially in comparison to the other ZS methods, performing more than 5 times faster than Boundary BT 8
BDSH 8
Shrinking. Sufficiently short runtimes are an important desideratum of unlearning, one which JiT empirically OURS
BDSH 8
satisfies. JiT has a computational complexity ofTable O(N5:|DViT
f |),Sub-class
where Nunlearning
is the number of perturbed samples and
performance on CIFAR-20 sub-class Rocket. OURS 8
|Df | is the cardinality of the forget set. Requiring only Df and processing each sample just once makes JiT
efficient.
M ETHOD Dr ACC . Df ACC . MIA ZS
Full-class Unlearning. We begin by comparing full class BSLN performance
95.7 ± to
0.0that94.5
of the
± 0.0existing
80.4ZS methods.
± 0.0 ⇥
As seen in Tables 2 (a), 2 (b), and 1, JiT demonstratesRTRN
significantly
94.6 superior
± 0.1 performance
22.3 ± 8.3 over
3.4 GKT and
± 1.1 ⇥
EMMN, and is competitive with Boundary Shrinking. FNTN The authors 85.7of±Chundawat
3.1 6.2 ± et
6.0al. (2023b)
16.0 ± 2.7 ⇥
note the
poor scalability of both EMMN and GKT, which is evident AMN in our results.
93.5 ± 0.2Failing to scale
0.8 ± 1.7 to large problems
0.8 ± 0.3 ⇥
SSD
or models is a significant barrier, since the value of unlearning is95.1 ± 0.0mostly
found 5.12in
± 0.0 5.4 ± 0.0
large models ⇥
that are
expensive to train, or large datasets that are expensiveBT to store. 93.6JiT
± 0.3 3.3 ± 2.9 is competitive
performance 0.0 ± 0.1 with⇥
UNSIR 93.3 ± 0.4 74.9 ± 10.1 27.3 ± 13.8 ⇥
Boundary Shrinking despite having a fraction of the compute cost and, even compared to non-ZS SOTA,
BDSH 95.7±0.0 48.4±0.0 0.1±0.0
JiT performs reasonably; dropping only 0.6% retain set performance compared to the baseline on ViT and X
O URS 92.2±0.0 0.0±0.0 14.66±8.8 X
8
9
outperforming both UNSIR and SSD on the MIA. Our performance also holds for VGG and, when using the
same hyper-parameters for the face dataset, JiT generalizes well, outperforming SCRUB, SSD, and UNSIR.
Sub-class Unlearning. Tables 3 (a) and 3 (b) show the performance of JiT on sub-class unlearning for ViT
and VGG11. For both, JiT is able to comfortably unlearn. For ViT, it actually over-forgets, however this is
typically an easier problem to correct, since more conservative values can always be selected. Dr performance
also drops slightly more than usual, which a more conservative parameter set could also correct. For VGG11,
however, the method is comfortably amongst the SOTA, outperforming methods that are granted access to
the retained data. For ViT, JiT better minimises Df compared to BDSH.
Random Unlearning. Tables 4 (a) Figure
and 3: Entropy,
4 (b) showH(x),
methodof the Df output distributions
performance for full-class
when forgetting unlearning on CIFAR-1
100 samples
showing
uniformly distributed across the training set. JiT
JiTexhibits
is ableperformance similarrival
to comfortably to the retrained
existing ZSmodel. For p =SOTA
and non-ZS 0.1, there is no statistical
significant
methods; despite slight over-forgetting, difference
Dr accuracy betweenunchanged
is almost JiT and the (∼retrained
1% formodel.
both Note that we plot log(H(x)) 8x 2 D
models).
to aid visualisation, thus a larger value on the horizontal axis indicates a smaller entropy.
ImageNet Evaluation. Finally, table 5 validates our method on a larger scale problem, with JiT achieving
SOTA performance for ZS methods. As larger pretrained models can be robust to noise, we found larger
291
perturbations were required to induce an forgetting.
effective unlearning
To keepalgorithm.
the inputNext, we evaluate
in-domain, more normalization
we apply contemporary models
to against existin
(x+ξ) 292 SOTA.
the noised image via: √ 2
.
(1+σ )
293 6.2 Benchmark Evaluation
7 Discussion 294 Full-class Unlearning. We begin by comparing full class performance to that of the existing Z
295 methods. As seen in Tables 2, 3, and 4, JiT demonstrates significantly superior performance ov
JiT is by far the fastest ZS unlearning 296 current zero-shot
method unlearning methods.
benchmarked, a criticalThe authors of Chundawat
characteristic et [Link]
for satisfying [2023b] note the poo
unlearning task. JiT is competitive with state-of-the-art performance in the ZS unlearning domain, as well is able to prote
297 scalability of both EMMN and GKT, which is evident in our results. The former
298 model performance, but is unable to do so while reducing the MIA, resulting in a complete inabiltiy
Figure 3: Entropy,asH(x), competing
of the D with non-ZS methods in the sub-class and random unlearning tasks despite their easier task.
f output distributions for full-class unlearning on CIFAR-10,
Figure
showing3:JiTEntropy, The
exhibits entropy
of the experiments
performance
H(x), Df output
similar to the highlight
distributions that JiT
For is
for full-class
retrained model. p=able tothere
replicate
unlearning
0.1, is on the output entropy of a retrained model over
no CIFAR-10,
statistically
significant difference between JiT and the retrained model. Note that we plot log(H(x)) 8x 2unlearning
showing JiT exhibits
a performance
forget set, similar
while to the
preservingretrained
retain model.
set For Table
performance.
p = 0.1, 2: ViT
there
Whenis Full-class
no statistically
compared D to existing performance on CIFAR-100
ZS methods, JiT canclassbe Rocket.
Figure 4: Plot of method runtime for VGG16 full-class forgetting on class xxxxxx in seconds. f
significant difference
to aid visualisation, between
thus a larger
considered JiT and
avalue the
strong on retrained model.
the horizontal
baseline. Note
It isaxis that we
indicates
fast and a plot
smallerlog(H(x))
performant, entropy. 8x 2 Df acceptably across all benchmarks
and performed
to aid visualisation,implemented.
thus a larger value on theconstraints
If time horizontal axisare indicates
ignored,aBDSH smaller is entropy.
Mmore
ETHOD stableDrand ACCless
. sensitive
Df ACC .to hyper-parameter
MIA ZS
Random Unlearning. selection, on account of taking the true gradients of the model with respect to the input, rather than the
an effective BSLN 88.9 ± 0.0 94.7 ± 0.0 94.4 ± 0.0 ⇥
Tables 7 andunlearning
8 show algorithm.
method
approximation Next,
performance we
when
we employ evaluate
withmore
forgettingJiT. contemporary
100However, in models
samples uniformly practice against
distributed
RTRN the poorexisting
across
time
90.1 ± 0.0 complexity
0.0 ± 0.0of BDSH
3.2 ± will
0.5 likely
⇥
an effective unlearning algorithm. Next, we evaluate more contemporary models against existing
SOTA.
the training set. Asprove
with sub-class,
prohibitive JiT is able trying
when to comfortably
to unlearn rival from
existing non-ZS
internet-scale SOTAmodels,
FNTN methods,
80.8 ± 1.4
whereas 0.6
JiT±is0.7amongst
19.0 ±the
8.7fastest
⇥
SOTA.
and despite over-forgetting
methodsslightly, the Dr accuracy
we benchmarked. is almost
Future work unchanged (< 1% the
could explore for
AMN
both models).
efficacy of using
87.9 ± 0.9 JiT with exact
0.0 ± 0.0 gradients,
1.4 ± 0.9 or
⇥a
6.2 Benchmark Evaluation
more specialised gradient approximation. JiT has the potential for positive societal impacts, aiding the
Compute Comparison. SSD 88.90 ± 0.0 0.0 ± 0.0 1.8 ± 0.0 ⇥
6.2 Benchmark Evaluation
Figure 4 shows the runtime of JiT compared BTlittle 87.5 ± 0.5 poor4.2 ± 5.2
of JiT 0.0 ± 0.1result⇥in
Full-class preservation
Unlearning. We begin comparingto
ofbyindividual other
privacy.
full classmethods.
However,
performance JiT
dueadds avery
totothat lack
of overhead,
theofexisting
certification,
ZS use could
especially in comparison to the other zero-shot methods. Sufficiently short runtimes UNSIR
are an 88.5 ± 0.4
important 65.3 9.1 haven’t.
± they 29.1 ± 6.1 ⇥
Full-class Unlearning. We
organizations begin by comparing
believing they full
haveclass performance
removed the to that
influence
methods. As seen in Tables 2, 3, and 4, JiT demonstrates significantly superior performance over ofofthean existing ZS
individual’s data when
desideratum
methods.
current As of
zero-shot unlearning,
seen in Tablesone
unlearning 3,which
2,methods.
and 4,JiTJiT
Theempirically
authors satisfies.
demonstrates JiT hasetasuperior
significantly
of Chundawat computational
al. [2023b] complexity
performance
GKTnote over
1.0±0.6
the poor 0.0±0.0 60.0±51.6 X
of O(N zero-shot
current where
|Dfof|),both
scalability N is the
unlearning
EMMN number
andmethods.
GKT, whichof The
perturbed samples
isauthors
evident ourand
ofinChundawat |Df | is
results. et
Thethe cardinality
[Link]
[2023b]EMMN of to
note
is able the
theforget
84.6±0.4
poor
protect 94.3±1.5 93.7±2.2 X
set. Requiring
scalability
model onlyEMMN
of both
performance, Dbut and
f is
Table processing
2:and
unable GKT,
(a) toViT each
dowhich
so sample
is evident
while
Full-class justinthe
reducing once
our
unlearning makes
results.
MIA, JiT
The
resulting
performanceefficient.
former onBDSH
isCIFAR-100
in a complete ableinabiltiy87.6±0.0
to protectto
class Rocket. 0.0±0.0
(b) VGG115.0±0.0 X
Full-class
model performance,unlearning
but is unableperformance
to do so whileon reducing the MIA, resulting in a O URS inabiltiy
complete 87.5±0.0
to 51.9±2.13
(a) 4.3±0.38 X
CIFAR-100 class Rocket.
7 Discussion Table class
Table 2: ViT Full-class unlearning performance on CIFAR-100 3: VGG16 Full-class unlearning performance on CIFAR-100 class Rocket.
Rocket.
Table 7: 2:
Table ViT Random
ViT unlearning
Full-class (a) for 100
performance
unlearning performance Tablefrom
samples
on CIFAR-100 4:class
VGG
CIFAR-10
Rocket. (b)
Full-class unlearning performance on PinsFaceRecognition class 1.
JiT achieves state-of-the-art performance for the zero-shot unlearning problem. The entropy experi-
ments highlight thatMM it ETHOD
is able to DDr ACC . imitate
effectively D f A
(b) .
CCoutput
the MIA of aZSretrained
entropy M ETHOD
model D Dr A
over a CC . DD f ACC . MIA ZS
METHOD
ETHOD
forget set, while preserving Dr AACC .
CC .
DDf AACC .
CC .
MIA ZS M
ZS
MIA to existing ETHOD r ACC . f ACC . MIA ZS
BSLN retain88.9setr performance.
± 0.0 f When
94.7 ± 0.0 compared
94.4 ± 0.0 ⇥ zero-shot
BSLN methods, 66.3±0.0 77.0±0.0 97.4±0.0 ⇥
JiT has the followingBSLN
advantages:
RTRN
BSLN 98.9
90.1
88.9 ±0.0
0.0 100.0
JiT±achieves ±0.0
0.0 ± 0.0 90.8
considerably
94.7 ± 0.0 3.2 ±
better
94.4 ± 0.0 ⇥⇥ BSLN
3.5
±performance
0.5 across 94.0±0.0
63.2±0.5 93.9±0.0
RTRNall full-class 0.0±0.0 13.82±0.0
10.4±1.1 ⇥⇥
RTRN
FNTNto random
benchmarks, is applicable
RTRN 98.6
90.1 ±0.1
80.8±and 1.4 98.8
0.6±±
sub-class
0.0 0.0 0.8
0.7 91.8
problems
0.0 and±
19.0
3.2 1.8
has
±± 8.7
0.5 ⇥⇥ RTRNFNTN
a significantly 100.0±0.0
59.7±0.4 0.0±0.0
faster runtime. 3.9±3.0 2.6±0.8
13.2±4.2 ⇥⇥
FNTN
FNTN 97.3 ± 0.3 97.2
80.8 ±zero-shot
1.4 0.6± 1.0 86.1
± 0.7 across ± 2.1
19.0 almost
± 8.7 ⇥ FNTN 97.6±0.7 36.9±9.9 4.3±2.7 ⇥
In other words, JiT dominates
AMN previous
87.9 ± 0.9 methods
0.0 ± 0.0 1.4 ± 0.9 every
⇥ metric.
⇥ AMN 64.3±0.4 0.0±0.0 1.8±0.8 ⇥
AMN 97.6 ± 0.3
SSD as performant
AMN 88.90
87.9 ±±as0.0
0.9 73.5 ± 5.1
0.0methods, 10.4 ± 4.9
1.8 ± 0.9
± 0.0 however
1.4 0.0 ⇥ ⇥
SCRUB This AMN66.2±0.1 99.7±0.1
0.0±0.0 8.2±1.70.0±0.0 1.4±1.33 ⇥⇥
Naturally, JiT is not SSD
quite 98.0 ±±1.6 non-ZS
98.1 ±±2.4 85.5 ± 0.1it remains
⇥⇥ competitive.
SCRUB is
98.8±0.0 97.1±0.0 8.8±0.76 ⇥⇥
particularly impressiveBT
SSDgiven the 87.5
88.90 ± 0.5
0.0 4.2
0.0
massive advantage 5.2
0.0 0.0
1.8
fine-tuning ± 0.1
0.0
offers. We SSD
draw attention to 63.79±0.0
JiT’s 0.0±0.0 8.6±0.0
BT
UNSIR
BT 97.6
88.5±±0.4
87.5 0.4 86.7
0.5 4.2±±±3.6
65.3 9.1 33.5
5.2 29.1± 6.1 ⇥⇥ SSD
±±5.6 BT 55.8±0.0
65.5±0.2 0.0±0.0
0.1±0.3 4.0±0.0
0.0±0.1 ⇥⇥
ability to preserve model performance, which given no repair0.0
steps0.1
or quadraticBT
penalty
UNSIR
is used, is
93.7±0.3
64.6±0.4 0.0±0.0 0.0±0.0
42.9±14.3 40.7±12.1 ⇥⇥
UNSIR
BDSH 88.5 ± 0.4 97.9±1.6
98.0±0.29 65.3 ± 9.1 78.8±0.0
29.1 ± 6.1 X⇥
noteworthy. GKT 1.0±0.6 0.0±0.0 60.0±51.6 X UNSIR 99.5±0.1 74.4±9.2 13.6±8.9 ⇥
OGKT
URS
EMMN 98.0±0.3
84.6±0.4
1.0±0.6 98.0±1.5
94.3±1.5
0.0±0.0 78.8±4.0
93.7±2.2
60.0±51.6 XX GKT 2.3±0.2 0.0±0.0 56.2±20.0 X
Finally we note that models
BDSH behave84.6±0.4
EMMN very differently
87.6±0.0 to the additive
0.0±0.0
94.3±1.5 Gaussian noise,
5.0±0.0
93.7±2.2 X GKT some
EMMN are far2.0±0.6
more
26.9±7.7 0.0±0.0
24.3±23.7 23.9±30.3
58.2±14.5 XX
robust than others; this leads to the
O URS
BDSH need for different
87.5±0.0
87.6±0.0 51.9±2.13
0.0±0.0 hyper-parameters
4.3±0.38
5.0±0.0 for EMMN
X different
BDSHmodels. 51.0±13.5
We
66.2±0.1 69.3±25.7
13.0±0.0 26.9±17.8
2.9±0.1 XX
X BDSH OURS most93.6±0.4
66.2±0.3 79.4±0.0
14.2±0.6 42.4±0.4
(a) 2.9±0.3 XX
hose additive noise asOour 87.5±0.0
URS perturbation method 51.9±2.13
as it is simple4.3±0.38
and has analogues across
O URS data
91.4±0.1 1.9±0.2 4.7±0.5 X
Table
modalities.
Table 6:3:VGG16
ChangingVGG16 Full-classunlearning
the perturbation
Sub-class unlearning
to somethingperformance
specific toonon
performance the CIFAR-100
modality
CIFAR-20 class
and
Table 7: more
ViT
sub-class Rocket.
effectiveunlearning
Random
Rocket. (e.g. performance for 100 samples from CIFAR-10
Table
andom crops 3: VGG16
with Full-class
color jitter) would unlearning
likely lead performance
to(a)
more stable, onand
CIFAR-100
more class
effective,
Table 7: ViT Rocket.
forgetting. (b)
Random unlearning performance for 100 samples from CIFAR-10
Table 7: ViT Random unlearning 8
performance for 100 samples from CIFAR-10
M ETHOD D ACC . Df ACC . MIA ZS 9
M ETHOD Drr ACC . Df A299CC . forget MIA ZS
the necessary M ETHOD Dr ACC . Df ACC . MIA ZS
M ETHOD Dr ACC . D(b)
f ACC . MIA ZS data. Chundawat et al. [2023b] notes that EMMN performance is “sub-optimal"
Limitations BSLN BSLN 66.3±0.0
75.3±0.0
77.0±0.0
300
79.0±0.0 GKT 97.4±0.0
83.1±0.0
⇥ the M
fails to protect
⇥ model
ETHOD
BSLN
Dr ACCentirely.
performance
98.9
.
± 0.0 100.0 f ACC . poor performance
DGKT’s MIA
± 0.0 90.8 ± 3.5
ZS
makes
⇥ intuitive sense;
RTRN
BSLN 63.2±0.5
66.3±0.0 0.0±0.0
77.0±0.0 it 10.4±1.1
97.4±0.0
requires a ⇥
model be M ETHODfromDscratch
retrained r ACC . via distillation
Df ACC . using only
MIA learned
ZSnoise matrices (no
RTRN 72.9±0.2 11.5±2.8
301 14.1±1.3 ⇥ RTRN
BSLN 98.6
98.9 ± 0.1 98.8 ±
0.0 100.0 ±0.8
0.0 91.8
90.8 ± 1.8
3.5 ⇥
benchmarks, is applicableFNTN to random and 97.6±0.7
sub-class 36.9±9.9
problems 315 350
316 and84.3±2.7
has Limitations
aggressive
unlearning, unlearning
a significantly ⇥
JiT generalizes could
faster have
runtime.
well, been applied. SCRUB,
outperforming When applying SSD, and RTRN
theUNSIR. 87.7±0.2 91.0±2.5
same hyper-parameters to face
In other words, JiT dominates AMNprevious99.7±0.1 zero-shot methods 0.0±0.0 316across
350 8almost
unlearning,
1.4±1.33 LimitationsJiT generalizes
every metric.
⇥ well, outperforming SCRUB, SSD, and FNTN UNSIR.84.4±0.8 86.4±4.4
317 350
Figure Sub-class
3:8 Entropy, Unlearning.
Limitations of the et
Naturally, JiT is not quite SSD
SCRUB
as performant
98.8±0.0
as non-ZS methods,
97.1±0.0 8.8±0.76
As
Sub-class
however discussed
itand Unlearning.
remains
⇥
H(x),
in Gulrajani
competitive.
Dal.
This f output
is[2017],
distributions
Lipschitz for full-class
regularization unlearning
86.8±0.3 onwith
AMN is incompatible CIFAR-10
51.3±4.4
batchno
55.8±0.0 0.0±0.0 317 351
318 Tables 5
4.0±0.0 6 show ⇥ the performance of JiT on sub-class unlearning for ViT87.7±0.1
and VGG16. For both
particularly impressive given the massive advantage showing
fine-tuning 351 TablesAs JiT
layers,
offers.5 exhibits
discussed
and since
We 6 performance
show
draw in
the Gulrajani
loss
the
attentionis
performancesimilar
et
calculated
to al.
JiT’s ofto
[2017],
with the
JiT retrained
Lipschitz
respect
on to
sub-class model.
regularization
each sample
unlearning For
SCRUBp =is 0.1,
independently,
for ViT there
incompatible
and is
not
[Link]
92.7±2.9
with batchno
entire
For both ba
BT 93.7±0.3 0.0±0.0 318
319 352 JiT is able
0.0±0.0 to comfortably
⇥ unlearn. For ViT, it actually over-forgets, SSD however this is
85.6±2.7 typically an
90.8±3.7easier
significant
351 This As difference
layers, discussed since
limitation between
theinpenalty
Gulrajani
loss JiT
beisunlearn. and
et
calculated al. the retrained
[2017],
with itrespect model.
Lipschitz Note
regularization
antoalternative
each that
sample we isplot
byincompatible
independently, log(H(x)) with8x
notbatchnorm
the batchn
2also
entire Dba
ability to preserve model UNSIR performance, which given74.4±9.2
99.5±0.1 no repair
319 352
320 353 JiT
steps
problem is able
or
13.6±8.9 to to
quadratic
correct, ⇥can
comfortably since mitigated
is used,
more For isby
conservative selecting
ViT, actually
values over-forgets,
can always model,
behowever
selected.
BT independently,
removing
this D
86.9±0.2is
r typically
performance an
82.5±4.9
lay
easier
noteworthy. to
320
321
aid
352 visualisation,
353
354 problem layers,
This
from a to
drops more than usual, since
limitation
model,
correct,thus
the
orcan a
loss
by
since larger
be is
replacing
more value
calculated
mitigated them on
by
conservative the
with
selecting
with horizontal
respect
another
values an to axis
each
alternative
normalisation
can
by 4%, which a more conservative parameter set could correct. alwaysindicates
sample model,
be a smaller
by
strategy,
selected. removing
such
D entropy.
r as not the entire
batchnorm
layer
performance norms.
For VGG16 ba
lay
also
GKT 2.0±0.6 0.0±0.0354 353 23.9±30.3
This
from alimitation
model, Xorcan by be mitigated
replacing them by selecting
with another an alternative
normalisation model,
BDSH by
strategy, removing
such
86.9±0.1 asthat batchnorm
layer norms.
92.2±3.4 la
Table 7: ViT Random unlearning performance for
321 100 dropssamples
however,
The more the
appendix from
than
method CIFAR-10
usual,
Xprovides is bycomfortably
4%, which a more
amongst conservative
the SOTA, parameter
outperforming set could
methods correct. For
are VGG16
granted
Finally we note that models behave very
EMMN differently 69.3±25.7
51.0±13.5 322
to the additive Gaussian
355 26.9±17.8
from athe noise,
model, orsomeby area farsensitivity
more analysis (10.1). It demonstrates that JiTascan benorms.
sensitiv
354
however,
accessThe to the
appendix method
retained
Xprovides is replacing
[Link]
a sensitivity
them with
is aamongst analysis
another
the(10.1).normalisation
SOTA, OURS strategy,
Itoutperforming
demonstrates to Dmethods
such
86.3±0.3 layer
that 88.7±3.9
are sensitiv
granted
rthat JiTthis
canprevents
be
322
robust than others;Table BDSH
this leads3: (a)to the
VGG-16 93.6±0.4
need forSub-class
different 79.4±0.0 323
hyper-parameters
355
356
unlearning hyper-parameter
42.4±0.4performance for selection.
different on models.
CIFAR-20ThisWe by-product
sub-class of having
Rocket. no
(b) access
ViT Sub-class , since exp
O URS 91.4±0.1 291
1.9±0.2 an
323 effective
355
356
access The tounlearning
protection
4.7±0.5 the retained
appendix
hyper-parameter of model
algorithm.
Xprovides data.
selection.
performance,
Next,
a sensitivity
This is a weanalysis
evaluate
by-product
and thus the of more
(10.1).
having
multi-objectiveItcontemporary
demonstrates
no access to
problem D models
that
, since
becomes
against
JiTthis
can prevents
more
existin
beprecariou
sensitiv
exp
Munlearning
ETHOD Dr ACC .
performance Df A292
onCIFAR-20CC . SOTA.
357
MIA
sub-class ZS
Rocket (a) r
hyper-parameter
356 protection
357 of model selection.
performance, This is and a by-product of having no access
thus the multi-objective to Drbecomes
problem , since this prevents
more exp
precariou
JiT, like most SOTA methods,unlearningis not 11
Table 6: VGG16 BSLNSub-class 98.9unlearning
± 0.0 100.0 performance
± 0.0 on 358
90.8
357 protection
CIFAR-20
± 3.5 Table of5:model
sub-class
⇥ ViTRocket. performance,
Sub-class andcertified.
thus the The
performance impact
multi-objective of this
on CIFAR-20 will
problem vary
becomes
sub-class byRocket.
application
more precario dom
RTRN 98.6 ± 0.1 (a) ± 0.8 358
98.8 359
91.8 JiT,
but
± 1.8 like
may most
Table 5:SOTA
preclude methods,
ViTitsSub-class
use is not certified.
in especially
unlearning sensitive (b)The
performanceareas. impactFinally,
on of this
while
CIFAR-20 will wevary have
sub-class bydrawn
application
Rocket. dom
links betw
293 6.2 358 Benchmark
JiT,may like preclude
⇥
mostEvaluation
SOTA methods, is not we certified. The impact of this will varyhaveby application dom
99 forget the necessary FNTNdata. Chundawat97.3 ± 0.3et al.97.2 [2023b]± 1.0 notes359 that but
86.1
360 Lipschitz
EMMN
± 2.1 performance continuity
⇥ itsanduse
is in especially
our method,
“sub-optimal". sensitive areas.
leave rigorous Finally,
connectionwhile forwefuture drawn
work. links betw
GKT fails to protect M ETHOD r ACC .
the modelDperformance D f ACC .GKT’s
entirely. 359 MIA
360 poorbutperformance
may preclude
Lipschitz continuity
ZS M
makesits use
and
ETHOD in especially
our
intuitive method,
D A
sense;CCwe . sensitive
leaveD A areas.
rigorous
CC . Finally,
connection
MIA while for we ZS havework.
future drawn links betw
00 (b) 294 Full-class Unlearning. We begin by comparing full class performance to that of the existing ZS
r f
it requires a model AMNbe retrained 97.6 from
± 0.3 73.5via 5.1
± distillation 10.4using ± 4.9onlycontinuity
Lipschitz ⇥ Mnoise and matrices
ETHOD our method,
Dr ACC .we leave Df rigorous
ACC . connection
MIA for future ZS work.
75.3±0.0scratch learned 95.74,±(no
360
01 BSLN 79.0±0.0
2952.4methods. 83.1±0.0 BSLN 0.0demonstrates
94.5 ± 0.0 significantly 80.4 ± 0.0 superior
ground truth data), SSD which 98.0
does ±
not 1.6scale 98.1
well ±
with model 85.5 9
size.±As 0.1 seen⇥
Conclusion
Furthermore, ⇥ in Tables
scaling
2, 3, and
the number
JiT ⇥ performance ove
02 RTRN 72.9±0.2 11.5±2.8
2963.6current
361 14.1±1.3zero-shot ⇥ BSLN
RTRN 95.7
94.6 ± 0.0
0.1
unlearning methods. The authors of Chundawat et al. [2023b] 94.5
22.3 ± 0.0
8.3 80.4
3.4 ±± 0.0
1.1 ⇥ note the poo
BT 33.59 Conclusion
03 of classes is also a barrier. 97.6
FNTN
± 0.4
A 65.5±0.7
single generator 86.7 ±
model
6.2±3.7 is used
361 in±GKT
22.3±5.5
5.6
to ⇥⇥
createFNTN representative
RTRN 94.6noise
85.7 ± 3.10.1 22.3
6.2 ±±6.0 8.3 3.4 ±
16.0 ± 1.1
2.7 ⇥
matrices for allBDSH classes.
297 scalability
361 9 of Conclusion
both EMMN FNTN and GKT, 85.7 which 3.1 is evident
6.2 in
6.0 our results.
16.0 The
2.7 former is able to protec
AMN The98.0±0.29 original paper only97.9±1.6 studied 10-class 78.8±0.0 problems, Xisand still a smallchallenging
ResNet9
Unlearning an is important, ± ± ± ⇥
0.2 problem. The zero-shot settinginis⇥aamongst theinabiltiy
hardest,t
04
73.8±0.2 2.4±2.4
298 model 362 3.0±0.9
performance, ⇥ but AMN unable to 93.5
do so ±while 0.8 ± 1.7
reducing the MIA, 0.8resulting
± 0.3 complete
05 model achievesOjust URS ⇠ 55% accuracy.
SCRUB 98.0±0.3
62.4±28.4 Using10.1±22.48
a single, small,
98.0±1.5 362
363 generator
78.8±4.0
Unlearning
existing
16.7±21.7 Xis annoise
tomethods
create
⇥ important,
are
AMN
SSD samples
restricted that
challenging
93.5
95.1 in ± are
their0.2
0.0 problem.
performance
0.8
5.12 ±
± The
1.7
0.0 and zero-shot
scope.
0.8
5.4 ± setting
To
0.3
0.0 addressis⇥amongst
this, we the hardest,
present a no
3: sufficiently
Entropy, H(x), of the Doff the
representative output
largedistributions
training for full-class
distribution Unlearning
of existing
CIFAR-100 unlearning is an BT
proves on CIFAR-10,
important,
unreliable. challenging
Indeed, problem. The zero-shot setting is⇥amongst the hardest,
06 SSD 75.0±0.0 4.2±0.0 362
363
364 zero-shot
11.0±0.0 methods
unlearning
⇥ are
SSD restricted
algorithm, 95.1
93.6 inJiT,
their
± 0.0
0.3 performance
which induces
5.12
3.3 ± ±2.90.0 and scope.
5.4 ± To
forgetting
0.0 by
0.0
0.1 address
smoothing this,theweregion
present a no
betw
ng 3:JiTEntropy,
exhibits
there H(x),
is little of the similar
performance
incentive Df output to the distributions
retrained for
model. full-class
For zero-shot
pexisting
to=Tableunlearning
0.1, athere on
isUNSIR
no CIFAR-10,
statistically
07 BT in the generator
74.9±0.2 loss48.4±16.9
for the generator
3630.1±0.1
364
365 the forgetmethods
learn unlearning
sample
⇥
2:log(H(x))
ViT are
representative
BT and restricted
Full-class its distribution,
algorithm,
perturbed
93.6
93.3 in
unlearning ±their
JiT, performance
which
variations.
0.3
0.4 induces
3.3
74.9 ± JiT
2.9
10.1 and
achievesscope.
forgetting
0.0
27.3
performance on CIFAR-100 class Rocket.± To
by address
smoothing
state-of-the-art
0.1
13.8 ⇥ this,thewe present
region
performance afor
betw n
g JiT
cant exhibits
difference
learning performance
between
a single UNSIR JiT similar
challenging and sample
the toretrained
74.1±0.2 the retrained
would model.
suffice.
57.5±10.3 model.
Note
Failing For
that to
57.4±8.6we
pthe plot
=forget
zero-shot
scale 0.1, to there is UNSIR
unlearning
large
sample no
problems and statistically
8x 2
algorithm,
or
its D
models
93.3
perturbed JiT,
is a
0.4 which
variations. 74.9 induces10.1
JiT forgetting
achieves27.3 by
13.8 smoothing
state-of-the-art the region
performance betw
for
08
se additive noise as our perturbation
364
365
366 methods, comfortably
⇥ out-performing f ± the alternatives,
48.4±0.0 while0.1±0.0
± also being theXonly ZS to be compat
± ⇥
cant difference
visualisation,
significant between
thus a larger
barrier, JiT
since andvalue
value
the the ofmethod
on retrained
the horizontal
unlearning as itisaxis
model. isfound
simple
Note that
indicates
mostly and we
the ahas
in
methods,plot
smaller
forget
large analogues entropy.
log(H(x))
sample
models
comfortably
across
BDSH
andare
that 8x most
2
itsbeyond D fdata
95.7±0.0
perturbed
expensive
out-performing tovariations.
the alternatives, JiT achieves
while alsostate-of-the-art
being JiTthe performance for
with unlearning settings full-class forgetting. We14.66±8.8
evaluate Xonly
on ZS toofbebenchma
a range compat
09 365
366
dalities.
visualisation,Changing BDSH
the
thusdatasets perturbation
a largerthat value 74.4±0.0
on to something
the horizontal 17.535±0.0
specific 367
to 12.9±0.1
the modality Xand OBDSH
more URSeffective 95.7±0.0
92.2±0.0
(e.g. 48.4±0.0
0.0±0.0 0.1±0.0
10 train, or large OURS are73.7±0.8
expensive [Link]
to19.3±18.3 Even indicates
if367
366
368 these with a smaller
methods,
methodsunlearning
demonstrating
11.2±7.8 X
entropy.
comfortably
couldM scale
settings
itsETHOD
Oefficacy out-performing
andinperformed
URS beyond D CC . the
r Afull-class
full-class,
92.2±0.0 alternatives,
forgetting.
sub-class,
D f ACC
0.0±0.0 and while
. random also
We14.66±8.8
evaluate
MIA being the
JiTZS
unlearning, onX only
a range
across ZS to
ofbe
multiple compa
benchma
mode
dom
11
cropsthey
well, withdocolor jitter)towould
not extend likely
sub-class or lead
random to more
subsetstable,unlearning. andwithmore
367 demonstrating
368 unlearning
Compared effective, forgetting.
settings
itsto non-ZS
efficacy beyond
inSOTA,
full-class,full-class
JiT sub-class, forgetting.
and random We evaluate JiT onacross
unlearning, a range of benchma
multiple mode
ective
12 unlearning algorithm. Next, we evaluate
performs reasonably; dropping only 2% retain set performance more contemporary
368 demonstrating models
compared to BSLNagainst
itstheefficacy 88.9
existing
baseline in on±ViT,
full-class, 0.0 sub-class, 94.7 ± 0.0 and random 94.4 ±unlearning,
0.0 ⇥ across multiple mode
ctive unlearning
while algorithm.
outperforming both Next,
UNSIR weand evaluate
SSD 10 onmorethe MIA. contemporary
Performance models also8:RTRNagainst
holds forexisting
VGG;90.1 while ± 0.0 0.09± 0.0 3.2 ± 0.5 ⇥
Table VGG16 Random
80.8 ±unlearning1.4 1 0.6performance for±100 8.7samples ⇥ from CIFAR-10
13
Limitations FNTN 9± 0.7 19.0
14 MIA is higher than desired, the retain accuracy is higher than the retrained Table 8: model,
VGG16 indicating
Randommore unlearning performance for 100 samples from CIFAR-10
15 aggressive unlearning could have been applied. When applying the same
8 Limitations Tablehyper-parameters
8: VGG16 Random
AMN 87.9to±face unlearning
0.9 0.0performance
± 0.0 1.4for±100
0.9 samples ⇥ from CIFAR-10
Benchmark Evaluation M ETHOD Dr ACC0.0 . ±D0.0 f ACC . 1.8 ±MIA ZS
unlearning,
discussed inEvaluation
16
Benchmark JiT generalizes well, outperforming SCRUB,
Gulrajani et al. [2017], Lipschitz regularization is incompatible withMbatchnormSSD, and UNSIR. SSD 88.90 ± 0.0 0.0 ⇥
BT ETHOD
87.5 ± Dr ACC4.2
0.5 . ±D5.2 f ACC . 0.0 ±MIA 0.1 ⇥ZS
ers,
lass since the loss
Unlearning.
Sub-class is
We
[Link]
begin by with comparingrespectfull to each
classsample performance independently, to that ofnot thethe BSLNZS
entire
existing batch.87.0±0.0 92.0±3.6 70.1±5.4 ⇥
17
UNSIR M ETHOD
88.5 ± Dr ACC
0.4 . ±D9.1
65.3 f ACC .
29.1 ± MIA
6.1 ⇥ ZS
lass Unlearning.
s18 limitation can be Wemitigated
begin by
by selecting
comparing anfullalternative
class model,
performance by toremoving
that of thebatchnorm RTRN
BSLN
existing layers
ZS 87.7±0.2
87.0±0.0 91.0±2.5
92.0±3.6 78.9±3.5
70.1±5.4 ⇥
ds. As seen in Tables
The 2, 3, and
appendix 4, JiT
providesdemonstrates
a sensitivity significantly
Tables 5 and 6 show the performance of JiT on sub-class unlearning for ViT and VGG16. analysis superior(10.2). performance
It demonstrates
BSLN
FNTN
over
For both, that JiT
87.0±0.0 91.0±2.5
84.4±0.8
can be sensitive
92.0±3.6 78.9±3.5
86.4±4.4
to
70.1±5.4
70.8±4.7
hyper-
RTRN 87.7±0.2 ⇥
mds.
19 a As
model,
JiT seen
t zero-shot or
is able by
into replacing
Tables
unlearning
comfortably
parameter 3, them
2,methods.
and 4,with
unlearn. JiT
The
selection. Foranother
Thisitnormalisation
demonstrates
authors
ViT, is of
actually significantly
Chundawat
a by-product strategy,
over-forgets, et superior
of al.
however
having such
[2023b] as
nothis layer
performance
GKT note norms.
the
is typically
access to over
1.0±0.6
Dpoor
rantoeasierfinetune, 0.0±0.0
and91.0±2.5
also due 60.0±51.6
to our gradient X⇥
RTRN
FNTN 87.7±0.2 86.4±4.4
84.4±0.8 78.9±3.5
70.8±4.7 ⇥⇥
t20zero-shot
problem unlearning
to correct, methods.
since more The isauthors
conservative ofinChundawat
values can always et beal. [2023b]
selected. EMMN note AMN
tothe
performance 84.6±0.4
poormore 86.8±0.3
also 94.3±1.551.3±4.4 93.7±2.2
13.1±2.9 X ⇥
eility of bothprovides
EMMN and GKT, which evident our results. The former iscan
approximation method. Exact gradients are slower, but Dable
rmay protect
prove
FNTN stable. Importantly,
84.4±0.8 86.4±4.4 70.8±4.7 we note that ⇥
appendix a sensitivity analysis (10.1). It demonstrates that JiT be sensitive
SCRUB to 87.7±0.1 92.7±2.9 5.0±0.0 71.8±5.2 X⇥
ility drops
of both
performance,
21 more
EMMN than
but is usual,
unable
better by
and stability
GKT,
to do
4%, sowhich
which while
can is
bea evident
more
reducing
achieved conservative
inthe ourMIA,
[Link]
resultingThe former
gradient set
in a couldBDSH
complete
clipping, correct.
is able though toAMN For 87.6±0.0
protect
inabiltiy this VGG16,
to came 86.8±0.3
at the 0.0±0.0
cost51.3±4.4
of method 13.1±2.9
performance.
er-parameter selection. Thiscomfortably
is a by-product of havingSOTA, no access to Dr , since this
URSprevents AMN
SSD
SCRUB explicit 86.8±0.3
85.6±2.7 51.3±4.44.3±0.38
90.8±3.7 13.1±2.9 X⇥
66.7±5.9
O to 87.7±0.1
87.5±0.0 51.9±2.1392.7±2.9 71.8±5.2
however, the ⇥
performance,
22 but method
is unableis to do so whileamongst reducingthe the MIA, outperforming
resulting in a complete methods that
inabiltiy are granted (a)
tection
23
of model performance,
access to the retained data. and thus the multi-objective problem becomes more precarious.
SCRUB
BT
SSD 87.7±0.1
86.9±0.2
85.6±2.7 92.7±2.9 66.7±5.9
82.5±4.9
90.8±3.7 71.8±5.2
40.8±6.3 ⇥⇥
SSD
BT 85.6±2.7 82.5±4.9
86.9±0.2 90.8±3.7 40.8±6.3 66.7±5.9 ⇥⇥
like Table
most 2: SOTAViT methods,
Full-class
Since weisunlearning
not certified.
minimise performance
the The impact
gradient on CIFAR-100
of
over this
each Table
will forget 3: VGG16
class
vary byRocket.
sample Full-class
application BDSH unlearning
domain,
independently,
BT we performance
86.9±0.1
86.9±0.2 advise 92.2±3.4
caution
82.5±4.9 on CIFAR-100
when usingclass
69.8±5.1
40.8±6.3
Xa Rocket.
(a) Table 4: VGG Full-class unlearning (b)
performance on PinsFaceRecognition X
⇥ class 1.
mayTable
preclude2: ViT Full-class
its use
model withunlearning
in especially batch sensitiveperformance
areas. Finally,
normalization; onsince
CIFAR-100
while
this we
Table changes
4:
class
haveRocket.
VGG drawn
the model’s
Full-class linksOURS
BDSHmapping
between
unlearning
86.3±0.3
86.9±0.1 of single
performance
88.7±3.9
92.2±3.4
input/output
on
64.2±5.2
69.8±5.1
PinsFaceRecognition to batch class 1.
X
Tableand 5: ViT Sub-class unlearning performance onThisCIFAR-20 sub-class [Link]
OURS 86.9±0.1 88.7±3.9
86.3±0.3 92.2±3.4 64.2±5.2 69.8±5.1 X
schitz continuity M ETHOD our method,
input/output Dr ACC we . leaveD
(Gulrajani rigorous
et
f ACC .
connection
al., 2017). MIA for future
ZS work.
limitation Mcan ETHOD be OURSmitigated
Dr ACC86.3±0.3 by
. selecting Df ACC an. alternative
88.7±3.9 MIA64.2±5.2choiceZS X of
(b) M ETHOD D A CC . D A CC . MIA ZS
model
M ETHOD or normalization
Dr ACC . strategy
Df ACC . (e.g. MIA layer norms). ZS r f
BSLN 88.9
M ETHOD
± 0.0
Dr ACC .
94.7 ± 0.0 94.4 ± 0.0
Df ACC . MIA
⇥ M ETHOD
ZS
BSLN Dr66.3±0.0
ACC . ACC .11
Df77.0±0.0 97.4±0.0 ZS ⇥
MIA
RTRN
BSLN 90.1
88.9 0.0 0.0
94.7 0.0
0.0 3.2
94.4 0.5
0.0 BSLN RTRN 94.0±0.0
63.2±0.5 1 93.9±0.0
0.0±0.0 11 13.82±0.0 10.4±1.1 ⇥ ⇥
Conclusion ± ±
± ±± ⇥
BSLN
RTRN 94.0±0.0
100.0±0.0 93.9±0.0
0.0±0.0 11 13.82±0.0
2.6±0.8 ⇥ ⇥
FNTN
RTRN BSLN 80.8
90.1 ± 95.7
1.4
0.0 ± 0.0
0.6
0.0 ±
JiT, like most SOTA methods, is not certified. The 94.5
0.7
0.0 ± 0.0
19.0
3.2 ±± 80.4
0.58.7 ± 0.0
⇥ FNTN
⇥ impact 100.0±0.0 59.7±0.4
of this will vary 3.9±3.0
by application 13.2±4.2domain,
FNTN RTRN 80.8 ±its 94.6
1.4use in ± 0.1 22.3
0.6especially
± 0.7 ± 8.3
19.0 3.4
± 8.7 areas. ± 1.1 RTRN
FNTN ⇥ 97.6±0.7
⇥ Finally, we note that our method is specifically tailored 0.0±0.0
36.9±9.9 2.6±0.8
4.3±2.7 ⇥ but
AMN
earning is an important,FNTNmay 87.9
preclude
challenging ± 0.9
85.7 problem. 0.0
± 3.1 The ± 0.0
6.2zero-shot
± 6.0 1.4
sensitive
± 0.9
setting
16.0 ± is ⇥ FNTN
2.7amongst AMN
⇥ the hardest, 64.3±0.4
97.6±0.7 and 0.0±0.0 0.0±0.0
36.9±9.9 1.8±0.8
4.3±2.7 ⇥⇥
SSDto classification
AMN 88.90
87.9 ±±0.9 0.0tasks. 0.0While
± 0.0 it is1.4 1.8 ± 0.9
possible 0.0 a variation ⇥ AMN SCRUB of this99.7±0.1
66.2±0.1
approach could 0.0±0.0work 1.4±1.33
for 8.2±1.7
large ⇥ ⇥
generative
sting methods are restricted in their performance and±scope. To address this,
AMN ⇥we present a novel 97.1±0.0
99.7±0.1 0.0±0.0 1.4±1.33
BT models,
SSD AMN 87.5
88.90
we ± 93.5
± 0.5
restrict0.0 our ± 0.24.2
0.0
focus± 0.0 0.8
5.2
and our
1.7
0.0
1.8
claims± 0.0 0.8
0.1 to
± 0.3
larger ⇥ SCRUB SSD
classifiers and
98.8±0.0
63.79±0.0
leave generative 0.0±0.0
applications
8.8±0.768.6±0.0
to future
⇥ ⇥
work.
o-shot unlearning UNSIR
BT algorithm,
SSD JiT,
88.5 ± 0.5
87.5 which
95.1
0.4 ± induces
0.0
65.3
4.2 ±±5.2 5.12 forgetting
9.1 29.1± 0.0
0.0 ±±0.1 by 5.4
6.1 smoothing
± 0.0
⇥ SCRUB
SSD the
⇥
BT region between
98.8±0.0
55.8±0.0
65.5±0.2 97.1±0.0
0.0±0.0
0.1±0.3 8.8±0.76
4.0±0.0
0.0±0.1 ⇥ ⇥
forget sampleUNSIR and its perturbed
BT 88.5 ± variations.
93.6
0.4 ± 0.3 JiT 3.3achieves
65.3 ± 9.1 29.1 ± 6.1 ± 2.9 state-of-the-art
0.0 ± 0.1
⇥ SSD
BT performance
⇥
UNSIR 55.8±0.0
93.7±0.3for
64.6±0.4 ZS 0.0±0.0
42.9±14.3 40.7±12.1 4.0±0.0
0.0±0.0 ⇥ ⇥
hods, comfortably GKTout-performing
UNSIR 1.0±0.6 the 0.0±0.0
93.3alternatives,
± 0.4 74.9while 60.0±51.6
± 10.1 also 27.3 being±the Xonly
13.8 BTZS
UNSIR ⇥ to be compatible 93.7±0.3
99.5±0.1 0.0±0.0
74.4±9.2 0.0±0.0
13.6±8.9 ⇥
h unlearning settings EMMNbeyond
GKT 84.6±0.4
1.0±0.6 94.3±1.5
0.0±0.0 We60.0±51.6 93.7±2.2 X aUNSIR GKTof benchmarks, 2.3±0.2 74.4±9.2
99.5±0.1 0.0±0.0 13.6±8.9 56.2±20.0 ⇥ X
BDSH full-class 95.7±0.0 forgetting. 48.4±0.0 evaluate JiT on
0.1±0.0 range
X
BDSH
EMMN 87.6±0.0
84.6±0.4 0.0±0.0
94.3±1.5 5.0±0.0
93.7±2.2 X GKT EMMN 2.0±0.6 26.9±7.7 0.0±0.0 24.3±23.7 23.9±30.3 58.2±14.5 X X
monstrating its efficacy in full-class,
O URS sub-class,
92.2±0.0 and0.0±0.0
random unlearning, 14.66±8.8across GKT Xmultiple models.
2.0±0.6 0.0±0.0 23.9±30.3 XX
O URS
BDSH Table 4: 87.5±0.0
87.6±0.0
(a) VGG11 51.9±2.13
0.0±0.0 unlearning
Random 4.3±0.38
5.0±0.0 performance X EMMN BDSH for 100 51.0±13.5
66.2±0.1from
samples 69.3±25.7
13.0±0.0
CIFAR-10. 26.9±17.82.9±0.1
(b)ViT Random
: Plot of method runtime
O URS unlearning for VGG16
87.5±0.0
performance full-class
51.9±2.13 forgetting
for 100 samples on
4.3±0.38 class xxxxxx
from CIFAR-10. X EMMN
BDSH in
OURS seconds. 51.0±13.5
93.6±0.4
66.2±0.3 69.3±25.7
79.4±0.0
(a)
14.2±0.6 26.9±17.8
42.4±0.42.9±0.3 XX
BDSH
O URS 93.6±0.4
91.4±0.1 79.4±0.0
1.9±0.2 42.4±0.4
4.7±0.5 X
Table
Table3:8:VGG16
VGG16Full-class
Random unlearning
unlearning performance
performance 9 on
forCIFAR-100
100 samples Table class 7:from
OViT
URS
Rocket.
CIFAR-10
Random91.4±0.1 unlearning performance 1.9±0.2 100 samplesXfrom CIFAR-10
for4.7±0.5
Table 3: VGG16 Full-class unlearning performance
(a) on CIFAR-100 Table class
7: ViT Rocket.
Random unlearning performance (b) for 100 samples from CIFAR-10
Unlearning. 8
nd 8 show method M ETHOD ADCC A
M ETHODDr when
performance . forgetting
Df D ACC [Link] samples MIA uniformly ZSZSdistributed acrosset
CC . f 100 .
forget MIA
the necessary MChundawat ACC . Df ACC . EMMN MIA ZS is “sub-optimal".
ZS data.
ETHOD Dral. [2023b] notes that performance
D(b) 299
r
M ETHOD JiTDisr able
g set. As with sub-class, ACC .to comfortably
f ACC . rival existingMIA non-ZS SOTA methods,
BSLN 66.3±0.0 77.0±0.0299 forget
GKT 97.4±0.0
the
fails necessary
to ⇥
protect data.
the MChundawat
modelETHOD et
D
performance ACC
ral. .
[2023b]
entirely. f ACC
notes
DGKT’s .poor
that EMMN MIAperformance
performance ZS isintuitive
makes “sub-optimal".
sense;
BSLN
Dr87.0±0.0 is 92.0±3.6 70.1±5.4 BSLN 98.9 ± 0.0 100.0 ± 0.0 90.8 ± 3.5
300 ⇥ ⇥
e over-forgettingBSLN
slightly,
RTRN the63.2±0.5
accuracy 0.0±0.0
almost unchanged
10.4±1.1 a (< 1%⇥⇥for both models).
RTRN 66.3±0.087.7±0.277.0±0.0 91.0±2.5
300
301 itGKT 97.4±0.0
fails to
requires
78.9±3.5 protect
model betheretrained
model
RTRNperformance
BSLN from98.6
scratch
98.9 entirely.
± 0.1
0.0 GKT’s
via distillation
98.8
100.0 ± ±0.8 poor
0.0 using performance
91.8
90.8only learned
1.8
± 3.5 makes
⇥noiseintuitive
matrices sense;
(no
FNTN
RTRN 59.7±0.4
63.2±0.5 3.9±3.0
84.4±0.8 0.0±0.0 13.2±4.2
10.4±1.1
it requires a model ⇥⇥be retrained fromscale
scratch
Comparison. FNTNFNTN 59.7±0.4 86.4±4.4
301
302 ground truth
70.8±4.7 data), which FNTN
RTRNdoes not 97.3
98.6 0.3via
±well
0.1 withdistillation
model
97.2
98.8 ± 0.81.0 using
size. 86.1
91.8only
Furthermore,learned⇥
2.1
± 1.8 noise matrices
scaling the number(no
3.9±3.0 ground 13.2±4.2
truth ⇥ which does not scale well with model size. Furthermore, scaling the number
data),
AMN
hows the runtime of AMN 64.3±0.4
JiT compared
86.8±0.3 0.0±0.0
to other 302
303
51.3±4.4 of classes
1.8±0.8
methods. 13.1±2.9 is also
JiT adds very a
⇥ barrier.
⇥ little
FNTN
AMN
A single
overhead, generator
97.3 ±
97.6 ± 0.3
0.3 model is
97.2 ± used
73.5 ± 5.1
1.0 in GKT
86.1 to
±
10.4 ± 4.9
create
2.1 representative
⇥
⇥
noise
of classes
matrices forisruntimes
also
all a⇥barrier.
classes. anA
areThe single paper
original generator model is used in GKT to create representative noise
98.0 ±only1.6 studied
98.1 10-class
2.4 problems, 0.1and still
⇥ a small ResNet9
in comparison to SCRUB
AMN
theSCRUB 66.2±0.1
64.3±0.4
other zero-shot 0.0±0.0
methods.
87.7±0.1 Sufficiently
303
304
92.7±2.9 8.2±1.7
1.8±0.8
short
71.8±5.2 ⇥ important
SSD
AMN 97.6 0.3 73.5 ± 5.1 85.5 ± 4.9
10.4
SSD
m of unlearning,SCRUB which63.79±0.0
oneSSD 66.2±0.1
JiT empirically 0.0±0.0
satisfies.
304
85.6±2.7 90.8±3.7 66.7±5.9
305 matrices
modelJiT8.6±0.0
8.2±1.7
has for
achieves
a all classes.
just
⇥ ⇥⇠
computational 55%The
BT
SSD original
accuracy.
complexity paper
Using only
97.6 ± 0.4
98.0 a
1.6 studied
single, 86.7 10-class
small,
98.1 ± 2.4 problems,
generator
3.6 to
33.5 ± 0.1
85.5 and
create
5.6 still
⇥ asamples
noise small ResNet9
that are
|), where isBT
SSD
the number
BT 65.5±0.2
63.79±0.0
of perturbed
86.9±0.2 0.1±0.3
0.0±0.0
samples 305 and
82.5±4.9 model |D0.0±0.1
8.6±0.0
sufficientlyachieves
is
40.8±6.3
| the just
⇥⇥⇠ 55%
representative
cardinality ofofaccuracy.
the
the large
forget Using
traininga single, small,
distribution generator
of CIFAR-100 to create noise
proves samples
unreliable. that are
Indeed,
f N 306
f BT 97.6 ± 0.4 86.7 ± 3.6 33.5 ± 5.6 ⇥
UNSIR
BT 64.6±0.4
65.5±0.2 42.9±14.3
0.1±0.3 there 40.7±12.1
0.0±0.1
sufficiently
isJiT
littlerepresentative
⇥ BDSH
of the 98.0±0.29
large training 97.9±1.6
distribution of 78.8±0.0
CIFAR-100 proves
incentive in the generator loss for the generator to learn a representative distribution, X unreliable. Indeed,
ring only Df and processing each sample just92.2±3.4
once
306
307 makes69.8±5.1 efficient.
UNSIR BDSH 64.6±0.486.9±0.1 42.9±14.3 there 40.7±12.1
learning is little
a single ⇥X inBDSH
incentive
challengingOthe
URSgenerator
sample 98.0±0.3
98.0±0.29
would 98.0±1.5
for the97.9±1.6
losssuffice. generator
Failing to 78.8±4.0
78.8±0.0
to learn
scale to X
a representative
large problems ordistribution,
models is a
GKTOURS 2.3±0.2 0.0±0.0307
308 56.2±20.0 X X
86.3±0.3 88.7±3.9 64.2±5.2
learning a single O URSsample would98.0±0.3 98.0±1.5 78.8±4.0 X
EMMN
GKT 26.9±7.7
2.3±0.2 24.3±23.7
0.0±0.0308
309 significant
58.2±14.5
56.2±20.0 barrier,challenging
Xsince the value of unlearningsuffice. Failing
is found to scale
mostly to large
in large problems
models that areorexpensive
models istoa
ussion
Table 4: VGGEMMNFull-class unlearning
BDSH 66.2±0.1 performance
26.9±7.7 13.0±0.0
24.3±23.7309
310
on58.2±14.5
train, PinsFaceRecognition
significant
2.9±0.1
or large barrier, Xsincethat
datasets class
the are 1. of unlearning
value
expensive to [Link] found
Even ifmostly
these in large models
methods that are
could scale andexpensive
performed to
OURS
BDSH 66.2±0.3
66.2±0.1 14.2±0.6
13.0±0.0310
311 train,
347 2.9±0.3
or
2.9±0.1large X
datasets that are expensive to store. Even if these methods
well, they do not extend to sub-class or random subset unlearning. Compared to non-ZS SOTA, JiTda1
chose additive noise as our perturbation method as it is simple and hascould scale
analogues and performed
across most
OURS 66.2±0.3 11311unlearning
14.2±0.6 well, 2.9±0.3
chose
they additive
do not X noise
extend as our
tothe perturbation
sub-class to method as unlearning.
itspecific
is simple and has analogues across mostJiTda
es state-of-the-art performance
M ETHOD for
Dr ACC .the zero-shot
Df ACC . performs
312 347
348
MIA problem.
modalities.
reasonably;
ZS The
Changing entropy
dropping
10 only or
experi- 2%random
perturbation retain subset
something
set performance Compared
tocompared
the modalitytotothenon-ZS
and more SOTA,
baselineeffective
on ViT, (e
hlight that it is able to effectively imitate the output 312 348
313 while
349 modalities.
performs
entropy
random of crops
outperforming Changing
reasonably;
a retrained
withboth
colorthe
dropping
model perturbation
only
over
jitter)
UNSIR a 2%
would
and to something
retain
likely
SSD on lead
the to
MIA. specific
set performance
more tocompared
stable,
Performance theandmodality
more
also to andbaseline
the more
effective,
holds for effective
ViT,(e
onwhile
forgetting.
VGG;
BSLN 94.0±0.0 93.9±0.0 13.82±0.0 1
while preserving retain set performance. 8 314
When while
313compared
MIA
349 randomto crops
isoutperforming
existing
higher ⇥
thanwith colorUNSIR
both
zero-shot
desired, jitter)
methods,
the would
retain likely
andaccuracy
SSD on lead
is to more
thehigher
MIA. than stable,
Performance and also
the retrained more effective,
holds
model, for forgetting.
VGG;
indicating while
more
RTRN 100.0±0.0
following advantages: JiT achieves considerably 0.0±0.0
8 314 better 2.6±0.8
MIA performance ⇥
is higher thanacross desired, the retain accuracy is higher than the retrained model, indicating more
all full-class
Table 5: Zero-shot methods performance on a ViT-L trained on ILSVRC Imagenet.
9 Conclusion
Unlearning is an important, challenging problem. The ZS setting is amongst the hardest, requiring delicate
treatment of the unlearning process to ensure model performance is protected. In this work, we approached
this challenge from an information theoretic perspective, deriving an unlearning algorithm directly from the
notion of minimising information gained from a sample. We demonstrate empirically the geometric insights
behind why JiT can effectively tackle the ZS unlearning problem, alongside showing experimentally that JiT
can reconstruct behaviour analagous to that of a model retrained from scratch. JiT achieves performance
competitive with state-of-the-art ZS and non-ZS methods. We evaluate JiT on a range of benchmarks,
demonstrating its efficacy in full-class, sub-class, and random unlearning, across multiple models. Future
work is needed to establish a stronger theoretical relationship between forgetting and information theory,
as well as exploring whether this can be formalized to provide guarantees on forgetting using information
theoretic approaches.
11
References
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-
generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international
conference on knowledge discovery & data mining, pages 2623–2631, 2019.
Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal,
Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in
deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017.
Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu
Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and
Privacy (SP), pages 141–159. IEEE, 2021.
Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE
symposium on security and privacy, pages 463–480. IEEE, 2015.
Min Chen, Weizhuo Gao, Gaoyang Liu, Kai Peng, and Chen Wang. Boundary unlearning: Rapid forgetting
of deep networks via shifting the decision boundary. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 7766–7775, 2023.
Vikram S Chundawat, Ayush K Tarun, Murari Mandal, and Mohan Kankanhalli. Can bad teaching induce
forgetting? unlearning in deep networks using an incompetent teacher. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 37, pages 7210–7217, 2023a.
Vikram S Chundawat, Ayush K Tarun, Murari Mandal, and Mohan Kankanhalli. Zero-shot machine
unlearning. IEEE Transactions on Information Forensics and Security, 2023b.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.
An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the
52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
Jack Foster, Stefan Schoepf, and Alexandra Brintrup. Fast machine unlearning without retraining through
selective synaptic dampening. arXiv preprint arXiv:2308.07707, 2023.
Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in
machine learning. Advances in neural information processing systems, 32, 2019.
Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective
forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 9304–9312, 2020.
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.
arXiv preprint arXiv:1412.6572, 2014.
Laura Graves, Vineel Nagisetty, and Vijay Ganesh. Amnesiac machine learning. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 35, pages 11516–11524, 2021.
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved
training of wasserstein gans. Advances in neural information processing systems, 30, 2017.
Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, and Chris Waites. Adaptive
machine unlearning. Advances in Neural Information Processing Systems, 34:16319–16330, 2021.
12
Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification
and preference learning. arXiv preprint arXiv:1112.5745, 2011.
Jenny E Jeong and Peng Qiu. Quantifying the relative importance of experimental data points in parameter
estimation. BMC systems biology, 12:1–14, 2018.
Alex Krizhevsky and Geoff Hinton. Convolutional deep belief networks on cifar-10. Unpublished manuscript,
40(7):1–9, 2010.
Meghdad Kurmanji, Peter Triantafillou, and Eleni Triantafillou. Towards unbounded machine unlearning.
arXiv preprint arXiv:2302.09880, 2023.
Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical
Statistics, 27(4):986–1005, 1956.
Ronak Mehta, Sourav Pal, Vikas Singh, and Sathya N Ravi. Deep unlearning via randomized conditionally
independent hessians. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 10422–10431, 2022.
Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. Descent-to-delete: Gradient-based methods for machine
unlearning. In Algorithmic Learning Theory, pages 931–962. PMLR, 2021.
J. Ross Quinlan. Induction of decision trees. Machine learning, 1:81–106, 1986.
Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what you
want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems,
34:18075–18086, 2021.
Vedant Shah, Frederik Träuble, Ashish Malik, Hugo Larochelle, Michael Mozer, Sanjeev Arora, Yoshua
Bengio, and Anirudh Goyal. Unlearning via sparse representations. arXiv preprint arXiv:2311.15268, 2023.
Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27
(3):379–423, 1948.
Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against
machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2017.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.
Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large
learning rates. In Artificial intelligence and machine learning for multi-domain operations applications,
volume 11006, pages 369–386. SPIE, 2019.
Ayush K Tarun, Vikram S Chundawat, Murari Mandal, and Mohan Kankanhalli. Fast yet effective machine
unlearning. IEEE Transactions on Neural Networks and Learning Systems, 2023.
Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification.
Journal of machine learning research, 2(Nov):45–66, 2001.
Eleni Triantafillou, Fabian Pedregosa, Jamie Hayes, Peter Kairouz, Isabelle Guyon, Meghdad Kurmanji,
Gintare Karolina Dziugaite, Peter Triantafillou, Kairan Zhao, Lisheng Sun Hosoya, Julio C. S. Jacques Junior,
Vincent Dumoulin, Ioannis Mitliagkas, Sergio Escalera, Jun Wan, Sohier Dane, Maggie Demkin, and
Walter Reade. Neurips 2023 - machine unlearning, 2023. URL [Link]
neurips-2023-machine-unlearning.
Alexander Warnecke, Lukas Pirch, Christian Wressnegger, and Konrad Rieck. Machine unlearning of features
and labels. arXiv preprint arXiv:2108.11577, 2021.
Robert F Woolson. Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials, pages 1–3, 2007.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep
learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
13
10 Appendix
187 parameterized network that has been trained on a set D, parameterized network th
k kx ykp , 188 and on which
(4) kf
we aim to perform unlearning. It is crucial
(x) f (y)k k kx yk , (4)
and on which we aim to p
189 to note that we presuppose f✓ (x) to already demonstrate to note that we presuppos
✓ ✓ p p
of magnitude, performance varies significantly. For σ, in general increasing the noise actually reduced the
forgetting/increased Dr accuracy. This is because VGG was so sensitivity to noise, and so small values of σ
did little to reduce the divergence of model output, and thus for σ < 1 the loss increases due to the division
in the loss term. For models that are robust to additive noise, the relationship between σ and performance is
often parabolic.
Choice of how many perturbed variants is simple. The more samples the more stable the forgetting, and the
only trade-off is compute time.
Below are the forget-class breakdowns of each method’s performance for full-class and sub-class unlearning.
For each unlearning scenario, the same parameters are used across all classes; this can lead to significant
variance in method performance.
15
Table 6: ViT CIFAR-100 full-class unlearning breakdown
Amnesiac baby 88.4 ± 0.7 0.0 ± 0.0 1.8 ± 0.3 1132.0 ± 147.1 ×
Amnesiac lamp 88.4 ± 0.6 0.0 ± 0.0 2.7 ± 0.4 1146.5 ± 97.5 ×
Amnesiac mushroom 88.3 ± 0.7 0.0 ± 0.0 0.5 ± 0.2 1071.9 ± 98.1 ×
Amnesiac rocket 87.9 ± 0.9 0.0 ± 0.0 1.0 ± 0.6 1029.7 ± 70.8 ×
Amnesiac sea 88.3 ± 0.3 0.0 ± 0.0 0.8 ± 0.2 1176.2 ± 136.7 ×
SSD baby 88.59 ± 0.0 0.0 ± 0.0 0.60 ± 0.0 685.40 ± 97.35 ×
SSD lamp 89.06 ± 0.0 36.89 ± 0.0 0.40 ± 0.0 741.23 ± 170.57 ×
SSD mushroom 88.82 ± 0.0 0.0 ± 0.0 3.80 ± 0.0 650.93 ± 76.43 ×
SSD rocket 88.90 ± 0.0 0.0 ± 0.0 1.80 ± 0.0 655.64 ± 65.36 ×
SSD sea 87.95 ± 0.0 0.0 ± 0.0 3.20 ± 0.0 767.35 ± 177.77 ×
Teacher baby 87.5 ± 0.4 23.8 ± 22.5 0.0 ± 0.0 610.0 ± 98.1 ×
Teacher lamp 87.5 ± 0.4 25.2 ± 12.5 0.1 ± 0.2 668.6 ± 135.1 ×
Teacher mushroom 87.4 ± 0.4 12.8 ± 5.9 0.0 ± 0.1 602.9 ± 33.2 ×
Teacher rocket 87.5 ± 0.5 4.2 ± 5.2 0.0 ± 0.1 602.6 ± 63.2 ×
Teacher sea 87.7 ± 0.2 51.1 ± 17.4 0.0 ± 0.0 661.8 ± 117.1 ×
UNSIR baby 88.8 ± 0.4 2.0 ± 1.2 14.3 ± 6.1 954.6 ± 110.3 ×
UNSIR lamp 88.5 ± 0.4 70.9 ± 4.4 29.4 ± 4.8 1002.2 ± 135.1 ×
UNSIR mushroom 88.4 ± 0.6 83.9 ± 2.9 21.3 ± 2.7 891.8 ± 69.9 ×
UNSIR rocket 88.5 ± 0.4 65.3 ± 9.1 29.1 ± 6.1 868.8 ± 47.9 ×
UNSIR sea 88.8 ± 0.2 13.9 ± 6.2 9.1 ± 4.7 986.1 ± 149.6 ×
GKT baby 1.00 ± 0.07 0.00 ± 0.00 70.00 ± 48.30 1074.86 ± 791.54 ✓
GKT lamp 1.01 ± 0.06 0.00 ± 0.00 60.00 ± 51.64 1924.50 ± 1269.59 ✓
GKT mushroom 1.01 ± 0.06 0.00 ± 0.00 60.00 ± 51.64 1793.25 ± 1100.70 ✓
GKT rocket 1.00 ± 0.06 0.00 ± 0.00 60.00 ± 51.64 1943.99 ± 1227.35 ✓
GKT sea 1.01 ± 0.06 0.00 ± 0.00 70.00 ± 48.30 1788.32 ± 1150.91 ✓
EMMN baby 84.63 ± 0.30 87.86 ± 2.98 74.82 ± 6.54 1064.73 ± 202.59 ✓
EMMN lamp 84.99 ± 0.37 93.03 ± 2.04 87.58 ± 2.95 1100.77 ± 212.08 ✓
EMMN mushroom 84.65 ± 0.33 93.11 ± 1.84 91.14 ± 2.23 1068.89 ± 132.31 ✓
EMMN rocket 84.62 ± 0.40 94.33 ± 1.53 93.68 ± 2.16 1115.30 ± 179.26 ✓
EMMN sea 84.68 ± 0.33 89.47 ± 4.26 74.64 ± 7.44 1075.82 ± 216.62 ✓
BDSH baby 83.87 ± 0.0 0.0 ± 0.0 8.6 ± 0.0 1388.14 ± 310.5 ✓
BDSH lamp 86.94 ± 0.0 2.0 ± 0.0 28.44±0.0 1386.44 ± 304.9 ✓
BDSH mushroom 87.8 ± 0.0 0.0 ± 0.0 0.8 ± 0.0 1398.19 ± 305.3 ✓
BDSH rocket 87.62 ± 0.0 0.0 ± 0.0 5.0 ± 0.0 1396.85 ± 306.4 ✓
BDSAH sea 87.34 ± 0.0 3.0 ± 0.0 0.8 ± 0.0 1414,41 ± 146.2 ✓
Ours baby 87.16 ± 0.03 38.70 ± 0.95 0.40 ± 0.00 607.94 ± 130.32 ✓
Ours lamp 88.39 ± 0.01 93.90 ± 0.32 49.86 ± 2.09 596.99 ± 143.57 ✓
Ours mushroom 87.70 ± 0.03 77.9016± 0.57 4.78 ± 0.37 656.20 ± 141.64 ✓
Ours rocket 87.46 ± 0.02 51.90 ± 2.13 4.26 ± 0.38 629.38 ± 116.30 ✓
Ours sea 83.78 ± 0.08 24.00 ± 1.15 16.32 ± 0.36 596.28 ± 151.89 ✓
Table 7: ViT CIFAR-20 sub-class unlearning breakdown
Amnesiac baby 93.3 ± 0.3 38.8 ± 7.4 0.9 ± 0.7 1025.0 ± 35.2 ×
Amnesiac lamp 93.7 ± 0.5 0.6 ± 1.5 2.0 ± 1.0 1009.0 ± 37.7 ×
Amnesiac mushroom 93.4 ± 0.5 0.2 ± 0.4 1.5 ± 0.5 1131.8 ± 137.6 ×
Amnesiac rocket 93.5 ± 0.2 0.8 ± 1.7 0.8 ± 0.3 1186.6 ± 107.5 ×
Amnesiac sea 93.3 ± 0.2 21.4 ± 8.5 0.4 ± 0.2 1070.4 ± 138.1 ×
SSD baby 95.54 ± 0.0 94.10 ± 0.0 77.20 ± 0.0 736.00 ± 12.01 ×
SSD lamp 95.54 ± 0.0 14.58 ± 0.0 3.2 ± 0.0 728.98 ± 73.07 ×
SSD mushroom 95.51 ± 0.0 6.68 ± 0.0 0.40 ± 0.0 718.83 ± 73.41 ×
SSD rocket 95.13 ± 0.0 5.12 ± 0.0 5.40 ± 0.0 699.33 ± 72.47 ×
SSD sea 95.57 ± 0.0 97.05 ± 0.0 82.20 ± 0.0 645.74 ± 53.38 ×
Teacher baby 93.0 ± 0.5 46.7 ± 17.9 0.0 ± 0.1 553.9 ± 65.9 ×
Teacher lamp 93.6 ± 0.7 8.2 ± 7.1 0.1 ± 0.2 558.1 ± 63.9 ×
Teacher mushroom 93.6 ± 0.4 13.0 ± 9.1 0.0 ± 0.0 620.1 ± 111.7 ×
Teacher rocket 93.6 ± 0.3 3.3 ± 2.9 0.0 ± 0.1 631.9 ± 115.2 ×
Teacher sea 93.6 ± 0.3 26.0 ± 14.0 0.2 ± 0.1 586.7 ± 89.7 ×
UNSIR baby 93.2 ± 0.3 94.5 ± 0.8 88.0 ± 3.1 871.1 ± 60.7 ×
UNSIR lamp 93.4 ± 0.5 76.5 ± 5.2 36.5 ± 11.7 899.8 ± 72.0 ×
UNSIR mushroom 93.1 ± 0.6 79.8 ± 7.6 19.0 ± 7.4 925.7 ± 117.5 ×
UNSIR rocket 93.3 ± 0.4 74.9 ± 10.1 27.3 ± 13.8 983.1 ± 143.2 ×
UNSIR sea 93.3 ± 0.3 94.3 ± 2.3 77.0 ± 7.2 1024.5 ± 144.8 ×
BDSH baby 95.36 ± 0.0 93.32 ± 0.0 18.8 ± 0.0 1163 ± 49.76 ✓
BDSH lamp 95.76 ± 0.0 89.58 ± 0.0 80.8 ± 0.0 1152.48 ± 53.9 ✓
BDSH mushroom 95.72 ± 0.0 88.37 ± 0.0 2.6 ± 0.0 1087.87 ± 200.2 ✓
BDSH rocket 95.66 ± 0.0 48.44 ± 0.0 1.4 ± 0.0 1087.15 ± 212.6 ✓
BDSH sea 95.09 ±0.0 78.91±0.0 4.6 ± 0.0 1101.91 ± 215.9 ✓
Ours baby 87.40 ± 0.02 0.00 ± 0.00 0.80 ± 0.00 532.20 ± 112.52 ✓
Ours lamp 90.20 ± 0.02 0.00 ± 0.00 22.86 ± 0.19 560.02 ± 58.54 ✓
Ours mushroom 93.73 ± 0.01 0.00 ± 0.00 1.40 ± 0.00 563.76 ± 69.07 ✓
Ours rocket 92.15 ± 0.01 0.00 ± 0.00 14.66 ± 0.10 535.14 ± 105.65 ✓
Ours sea 87.39 ± 0.02 0.00 ± 0.00 3.40 ± 0.00 542.19 ± 87.80 ✓
17
Table 8: VGG11 CIFAR-100 class unlearning breakdown
Amnesiac baby 64.70 ± 0.36 0.00 ± 0.00 4.30 ± 1.10 99.50 ± 3.40 ×
Amnesiac lamp 64.58 ± 0.37 0.00 ± 0.00 4.80 ± 1.20 99.90 ± 3.00 ×
Amnesiac mushroom 64.49 ± 0.52 0.00 ± 0.00 3.10 ± 1.10 100.80 ± 3.30 ×
Amnesiac rocket 64.34 ± 0.38 0.00 ± 0.00 1.80 ± 0.80 99.70 ± 2.80 ×
Amnesiac sea 64.39 ± 0.40 0.00 ± 0.00 1.20 ± 0.30 100.20 ± 3.50 ×
SCRUB baby 67.05 ± 0.11 0.00 ± 0.00 9.30 ± 1.50 109.10 ± 2.60 ×
SCRUB lamp 66.81 ± 0.10 0.00 ± 0.00 8.70 ± 0.90 108.50 ± 3.50 ×
SCRUB mushroom 67.05 ± 0.08 0.00 ± 0.00 9.40 ± 0.50 107.60 ± 2.80 ×
SCRUB rocket 66.19 ± 0.14 0.00 ± 0.00 8.20 ± 1.70 108.70 ± 2.50 ×
SCRUB sea 66.71 ± 0.12 0.00 ± 0.00 6.30 ± 2.40 107.70 ± 2.70 ×
SSD baby 52.68 ± 0.01 0.00 ± 0.00 7.40 ± 0.00 81.30 ± 1.30 ×
SSD lamp 65.44 ± 0.00 0.00 ± 0.00 5.80 ± 0.00 81.60 ± 0.80 ×
SSD mushroom 62.19 ± 0.00 0.00 ± 0.00 14.60 ± 0.00 81.20 ± 1.00 ×
SSD rocket 63.79 ± 0.01 0.00 ± 0.00 8.60 ± 0.00 81.50 ± 0.90 ×
SSD sea 32.75 ± 0.01 0.00 ± 0.00 7.00 ± 0.00 81.30 ± 1.20 ×
Teacher baby 66.05 ± 0.33 1.00 ± 1.10 0.10 ± 0.10 79.60 ± 0.80 ×
Teacher lamp 65.96 ± 0.19 0.40 ± 1.00 0.00 ± 0.00 79.10 ± 1.30 ×
Teacher mushroom 65.67 ± 0.27 0.90 ± 1.50 0.10 ± 0.10 79.60 ± 1.40 ×
Teacher rocket 65.51 ± 0.24 0.10 ± 0.30 0.00 ± 0.10 79.10 ± 1.00 ×
Teacher sea 65.57 ± 0.24 5.20 ± 4.60 0.00 ± 0.00 86.70 ± 24.00 ×
UNSIR baby 64.88 ± 0.43 3.80 ± 2.30 29.40 ± 11.00 111.10 ± 2.00 ×
UNSIR lamp 64.72 ± 0.35 25.90 ± 6.10 17.00 ± 5.90 111.60 ± 2.50 ×
UNSIR mushroom 64.79 ± 0.27 21.10 ± 10.70 11.30 ± 5.20 111.30 ± 2.90 ×
UNSIR rocket 64.58 ± 0.39 42.90 ± 14.30 40.70 ± 12.10 112.10 ± 3.50 ×
UNSIR sea 64.51 ± 0.33 13.90 ± 7.50 22.90 ± 4.50 111.30 ± 2.90 ×
GKT baby 2.31 ± 0.27 0.00 ± 0.00 47.90 ± 26.50 634.30 ± 9.20 ✓
GKT lamp 2.42 ± 0.37 0.00 ± 0.00 45.30 ± 23.70 630.80 ± 8.60 ✓
GKT mushroom 2.23 ± 0.29 0.00 ± 0.00 47.40 ± 11.80 628.90 ± 6.00 ✓
GKT rocket 2.31 ± 0.24 0.00 ± 0.00 56.20 ± 20.00 629.00 ± 6.90 ✓
GKT sea 2.42 ± 0.39 0.00 ± 0.00 60.50 ± 28.60 632.00 ± 5.60 ✓
EMMN baby 30.11 ± 8.56 7.80 ± 9.20 54.00 ± 11.20 274.70 ± 4.40 ✓
EMMN lamp 33.02 ± 9.02 21.50 ± 17.00 54.70 ± 14.60 276.00 ± 4.60 ✓
EMMN mushroom 31.87 ± 11.81 13.50 ± 12.40 53.40 ± 12.80 275.40 ± 4.60 ✓
EMMN rocket 26.91 ± 7.74 24.30 ± 23.70 58.20 ± 14.50 274.80 ± 3.90 ✓
EMMN sea 30.28 ± 8.81 33.30 ± 21.40 69.10 ± 16.00 275.40 ± 4.70 ✓
BDSH baby 66.88 ± 0.0 16.9 ± 0.3 4.3 ± 0.0 85.16 ± 1.1 ✓
BDSH lamp 66.31 ± 0.0 16.7 ± 0.5 14.68 ± 0.0 85.4 ± 2.2 ✓
BDSH mushroom 66.83 ± 0.0 18±
21.0 0.0 12.64 ± 0.0 85.27 ± 1.8 ✓
BDSH rocket 66,17 ± 0.0 13.0 ± 0.0 2.9 ± 0.0 85.01 ± 1.6 ✓
Ours baby 67.06 ± 0.04 11.10 ± 1.40 9.90 ± 0.30 78.70 ± 2.40 ✓
Table 9: VGG11 face unlearning class breakdown
Amnesiac baby 73.70 ± 0.25 53.55 ± 6.44 4.96 ± 0.90 97.95 ± 1.11 ×
Amnesiac lamp 74.18 ± 0.24 9.20 ± 2.38 7.08 ± 1.62 97.62 ± 0.90 ×
Amnesiac mushroom 74.05 ± 0.24 2.52 ± 1.60 4.28 ± 0.94 97.81 ± 1.00 ×
Amnesiac rocket 73.80 ± 0.19 2.43 ± 2.43 2.98 ± 0.86 98.39 ± 0.85 ×
Amnesiac sea 73.65 ± 0.16 64.00 ± 17.71 1.22 ± 0.53 97.64 ± 1.23 ×
SCRUB baby 75.04 ± 0.26 76.99 ± 2.26 71.00 ± 2.92 130.78 ± 2.81 ×
SCRUB lamp 75.44 ± 0.16 41.22 ± 2.77 40.22 ± 6.66 130.45 ± 2.12 ×
SCRUB mushroom 75.40 ± 0.13 15.52 ± 7.61 11.82 ± 0.85 130.79 ± 2.64 ×
SCRUB rocket 62.44 ± 28.38 10.12 ± 22.48 16.71 ± 21.68 131.41 ± 3.75 ×
SCRUB sea 75.05 ± 0.12 91.18 ± 1.56 83.13 ± 3.26 130.69 ± 2.52 ×
SSD baby 71.97 ± 0.00 0.00 ± 0.00 8.80 ± 0.00 83.78 ± 1.04 ×
SSD lamp 73.98 ± 0.00 3.56 ± 0.00 9.00 ± 0.00 84.39 ± 1.09 ×
SSD mushroom 71.99 ± 0.00 0.00 ± 0.00 6.20 ± 0.00 84.03 ± 1.25 ×
SSD rocket 74.96 ± 0.00 4.17 ± 0.00 11.00 ± 0.00 83.91 ± 0.81 ×
SSD sea 74.80 ± 0.00 69.62 ± 0.00 56.20 ± 0.00 83.54 ± 1.42 ×
Teacher baby 74.87 ± 0.17 77.90 ± 1.25 0.38 ± 0.71 78.74 ± 0.59 ×
Teacher lamp 75.18 ± 0.11 35.55 ± 13.07 0.40 ± 0.38 79.47 ± 1.10 ×
Teacher mushroom 74.99 ± 0.13 27.47 ± 14.49 0.25 ± 0.09 90.58 ± 36.91 ×
Teacher rocket 74.86 ± 0.20 48.36 ± 16.87 0.07 ± 0.10 91.51 ± 38.05 ×
Teacher sea 74.71 ± 0.14 85.28 ± 2.58 0.08 ± 0.14 78.36 ± 0.90 ×
UNSIR baby 73.85 ± 0.20 77.66 ± 2.12 77.60 ± 2.52 110.77 ± 0.90 ×
UNSIR lamp 74.17 ± 0.27 44.52 ± 7.40 55.06 ± 6.21 110.70 ± 0.73 ×
UNSIR mushroom 74.03 ± 0.32 47.14 ± 4.00 41.70 ± 8.15 110.58 ± 1.31 ×
UNSIR rocket 74.08 ± 0.24 57.47 ± 10.25 57.44 ± 8.61 111.27 ± 0.66 ×
UNSIR sea 73.80 ± 0.17 90.75 ± 1.83 85.32 ± 3.83 109.55 ± 1.01 ×
BDSH baby 74.25 ± 0.0 59.12 ± 0.0 31.2 ± 0.0 86.05 ± 1.1 ✓
BDSH lamp 75.3 ± 0.0 36.46 ± 0.0 36.8 ± 0.0 85.24 ± 1.3 ✓
BDSH mushroom 75.06 ±0.0 36.63 ± 0.0 25.74 ± 0.0 86.47 ± 2.5 ✓
BDSH rocket 74.41 ± 0.0 17.54 ± 0.0 12.88 ± 0.0 85.91 ± 1.6 ✓
BDSH sea 72.88 ± 0.0 39.15 ± 0.0 11.84 ± 0.0 85.254 ± 1.2 ✓
Ours baby 73.57 ± 1.28 49.82 ± 19.43 33.32 ± 17.53 79.47 ± 3.63 ✓
Ours lamp 74.05 ± 1.34 32.46 ± 15.76 32.38 ± 19.76 79.73 ± 3.22 ✓
Ours mushroom 74.20 ± 0.69 33.67 ± 12.27 23.52 ± 12.84 79.14 ± 4.00 ✓
Ours rocket 73.68 ± 0.84 19.31 ± 18.31 11.20 ± 7.78 80.14 ± 3.27 ✓
Ours sea 73.07 ± 1.02 37.82 ± 22.10 17.29 ± 20.20 79.04 ± 3.25 ✓
20