0% found this document useful (0 votes)
18 views17 pages

Personalized Federated Learning Survey

In parallel with the rapid adoption of artificial intelligence (AI) empowered by advances in AI research, there has been growing awareness and concerns of data privacy. Recent significant developments in the data regulation landscape have prompted a seismic shift in interest toward privacy-preserving AI. This has contributed to the popularity of Federated Learning (FL), the leading paradigm for the training of machine learning models on data silos in a privacy-preserving manner. In this survey, we explore the domain of personalized FL (PFL) to address the fundamental challenges of FL on heterogeneous data, a universal characteristic inherent in all real-world datasets. We analyze the key motivations for PFL and present a unique taxonomy of PFL techniques categorized according to the key challenges and personalization strategies in PFL. We highlight their key ideas, challenges, opportunities, and envision promising future trajectories of research toward a new PFL architectural design, realistic PFL benchmarking, and trustworthy PFL approaches.

Uploaded by

Shixiong Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views17 pages

Personalized Federated Learning Survey

In parallel with the rapid adoption of artificial intelligence (AI) empowered by advances in AI research, there has been growing awareness and concerns of data privacy. Recent significant developments in the data regulation landscape have prompted a seismic shift in interest toward privacy-preserving AI. This has contributed to the popularity of Federated Learning (FL), the leading paradigm for the training of machine learning models on data silos in a privacy-preserving manner. In this survey, we explore the domain of personalized FL (PFL) to address the fundamental challenges of FL on heterogeneous data, a universal characteristic inherent in all real-world datasets. We analyze the key motivations for PFL and present a unique taxonomy of PFL techniques categorized according to the key challenges and personalization strategies in PFL. We highlight their key ideas, challenges, opportunities, and envision promising future trajectories of research toward a new PFL architectural design, realistic PFL benchmarking, and trustworthy PFL approaches.

Uploaded by

Shixiong Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO.

12, DECEMBER 2023 9587

Towards Personalized Federated Learning


Alysa Ziying Tan , Han Yu , Member, IEEE, Lizhen Cui , Member, IEEE, and Qiang Yang , Fellow, IEEE

Abstract— In parallel with the rapid adoption of artificial the rapid growth of private data originating from distributed
intelligence (AI) empowered by advances in AI research, there sources. In this digital age, organizations are using big data
has been growing awareness and concerns of data privacy. Recent and artificial intelligence (AI) to optimize their processes
significant developments in the data regulation landscape have
prompted a seismic shift in interest toward privacy-preserving AI. and performance. While the wealth of data offers tremendous
This has contributed to the popularity of Federated Learning opportunities for AI applications, most of these data are highly
(FL), the leading paradigm for the training of machine learning sensitive in nature and they exist in the form of isolated
models on data silos in a privacy-preserving manner. In this islands. This is especially relevant in the healthcare industry
survey, we explore the domain of personalized FL (PFL) to where medical data are highly sensitive and they are often
address the fundamental challenges of FL on heterogeneous data,
a universal characteristic inherent in all real-world datasets. collected and reside across different healthcare institutions
We analyze the key motivations for PFL and present a unique [1]–[4]. Such circumstances pose huge challenges for AI
taxonomy of PFL techniques categorized according to the key adoption as data privacy issues are not well addressed by
challenges and personalization strategies in PFL. We highlight conventional AI approaches. With the recent introduction of
their key ideas, challenges, opportunities, and envision promising data privacy preservation laws, such as the general data pro-
future trajectories of research toward a new PFL architec-
tural design, realistic PFL benchmarking, and trustworthy PFL tection regulation (GDPR) [5], there is an increasing demand
approaches. for privacy-preserving AI [6] in order to meet regulatory
compliance.
Index Terms— Edge computing, federated learning (FL),
non-IID data, personalized FL (PFL), privacy preservation, In view of these data privacy challenges, federated learning
statistical heterogeneity. (FL) [7], [8] has seen growing popularity in recent years.
FL is a learning paradigm that enables collaborative training
I. I NTRODUCTION of machine learning models involving multiple data silos
in a privacy-preserving manner. The prevailing FL setting
T HE pervasiveness of edge devices in modern society,
such as mobile phones and wearable devices, has led to assumes a federation of data owners (a.k.a. clients), which
may be as small as individual mobile devices to as large
Manuscript received 12 October 2021; revised 17 January 2022 and as entire organizations, that collaboratively train a model
9 March 2022; accepted 16 March 2022. Date of publication 28 March under the orchestration of a central parameter server (a.k.a.
2022; date of current version 1 December 2023. This work was sup-
ported in part by the AI Singapore Program, National Research Foundation, the FL server) [7], [8]. The training data are stored locally
Singapore, under Award AISG2-RP-2020-019; in part by the Alibaba Group and are not directly shared during the training process. Most
through the Alibaba Innovative Research Program and the Alibaba-NTU of the existing FL training approaches are derived from the
Singapore Joint Research Institute under Grant Alibaba-NTU-AIR2019B1;
in part by the Nanyang Technological University, Singapore; in part by the federated averaging (FedAvg) algorithm introduced in [9]. The
RIE 2020 Advanced Manufacturing and Engineering Programmatic Fund, goal is to train a global model that performs well on most
Singapore, under Grant A20G8b0102; in part by the Nanyang Assistant FL clients.
Professorship; in part by the Joint SDU-NTU Centre for Artificial Intelli-
gence Research under Grant NSC-2019-011; in part by NSFC under Grant
91846205; in part by the National Key Research and Development Program of A. Categorization of Federated Learning
China under Grant 2021YFF0900800 and Grant 2018AAA0101100; in part by
SDNSFC under Grant ZR2019LZH008; in part by the Shandong Provincial FL can be categorized into horizontal FL (HFL), vertical
Key Research and Development Program through the Major Scientific and FL (VFL), and federated transfer learning (FTL), according
Technological Innovation Project under Grant 2021CXGC010108; and in to how data are distributed in terms of feature and sample
part by Hong Kong RGC TRS under Grant T41-603/20-R. (Corresponding
authors: Han Yu; Lizhen Cui; Qiang Yang.) spaces among participating entities [7]. HFL refers to scenarios
Alysa Ziying Tan is with the School of Computer Science and Engineering, whereby participants share the same feature space but have
Nanyang Technological University, Singapore 639798, also with the Alibaba- different data samples. It is the most commonly adopted FL
NTU Singapore Joint Research Institute, Nanyang Technological University,
Singapore 637335, and also with the Alibaba Group, Hangzhou 310052, setting popularized by Google, which applied HFL to train
China. language models in mobile devices [9]. In VFL, participants
Han Yu is with the School of Computer Science and Engineering, Nanyang have overlapping data samples but differ in the feature space.
Technological University, Singapore 639798 (e-mail: [Link]@[Link]).
Lizhen Cui is with the School of Software, Shandong University, Jinan A typical application scenario would involve the collabora-
250101, China, and also with the Joint SDU-NTU Centre for Artificial tion of multiple organizations from different industry sectors
Intelligence Research, Shandong University, Jinan 250101, China (e-mail: (e.g., a bank and an e-commerce company) which have dif-
clz@[Link]).
Qiang Yang is with the Department of Computer Science and Engineering, ferent data features but may have a large number of shared
The Hong Kong University of Science and Technology, Hong Kong, and also users. FTL is applicable when participants have little overlap
WeBank, Shenzhen 518052, China (e-mail: qyang@[Link]). in both the feature space and the sample space. For example,
Color versions of one or more figures in this article are available at
[Link] organizations from different industry sectors serving mar-
Digital Object Identifier 10.1109/TNNLS.2022.3160699 kets in different regions can leverage FTL to collaboratively
2162-237X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
9588 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

Fig. 1. Concept, motivations, and proposed taxonomy for PFL. (a) CML, which pools data together to train a central ML model. (b) FL, which trains a
global model under the orchestration of a central parameter server. Data resides in different data silos. (c). PFL, which addresses the limitations of FL through
global model personalization and personalized models learning. 1–4 Four categories of PFL approaches: 1) data-based; 2) model-based; 3) architecture-based;
and 4) similarity-based.

build models. Existing PFL works mainly focus on the HFL such as the Internet of Things (IoT), that entails privacy,
setting which makes up the majority of the FL application connectivity, bandwidth, and latency challenges in varying
scenarios [8]. The HFL setting is the focus of this article. For edge computing environments [11].
brevity, we use the terms HFL and FL interchangeably in the However, the general FL approach faces several fundamen-
rest of this survey. tal challenges: 1) poor convergence on highly heterogeneous
data and 2) lack of solution personalization. These issues dete-
riorate the performance of the global FL model on individual
B. Motivations for Personalized Federated Learning clients in the presence of heterogeneous local data distributions
Fig. 1 shows the key concepts and motivations for central- and may even disincentivize affected clients from joining the
ized machine learning (CML) [10], FL and PFL. We consider FL process. Compared with traditional FL, PFL research seeks
a cloud-based CML setting, where data are pooled together in to address these two challenges.
the cloud server to train an ML model. In this setting, the CML 1) Poor Convergence on Heterogeneous Data: When learn-
model achieves good generalization from the rich amount of ing on nonindependent and identically distributed (non-IID)
data. However, CML faces bandwidth and latency challenges data, the accuracy of FedAvg is significantly reduced. This
due to the sheer amount of data transferred to the cloud. It also performance degradation is attributed to the phenomenon of
does not preserve data privacy or not personalize well. client drift [12], as a result of the rounds of local training
The FL setting assumes a federation of distributed clients, and synchronization on local data distributions that are non-
each with its own private local dataset. As these clients face IID. Fig. 2 shows the effect of client drift on IID and non-IID
data scarcity that limits their capacities to train effective data. In FedAvg, the server updates move toward the average
local models, they are motivated to join the FL process to of client optima. When data are IID, the averaged model is
obtain a better performing model. FL enables collaborative close to the global optimum w∗ as it is equidistant to both
model training on data silos in a privacy-preserving manner, local optima w1∗ and w2∗ . However, when data are non-IID,
which sets it apart from the CML setting. In addition, FL is the global optimum w∗ is not equidistant to the local optima.
communication-efficient as it only transfers model parameters In this illustration, w∗ is closer to w2∗ . The averaged model
which are a fraction in size compared to transferring raw data. wt+1 will, therefore, be far from the global optimum w∗ , and
By considering privacy and communication constraints, FL is the global model does not converge to its true global optimum.
applicable to support a wide range of application scenarios, As the FedAvg algorithm experiences convergence issues on

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TAN et al.: TOWARDS PERSONALIZED FEDERATED LEARNING 9589

benchmarking and offer suggestions on enhancing PFL


experimental evaluation techniques.
4) We envision promising future trajectories of research
toward new architectural design, realistic benchmark-
ing, and trustworthy approaches toward building PFL
systems.

II. S TRATEGIES FOR P ERSONALIZED


F EDERATED L EARNING
In this section, we provide an overview of the PFL strategies
which are the basis for our systematic and comprehensive
review of existing PFL approaches. We organize the literature
Fig. 2. Illustration of client drift in FedAvg for two clients with two local around the proposed taxonomy [Fig. 1(c)] that divides PFL
steps. (a) IID data setting. (b) Non-IID data setting. methods according to the key challenges and personalization
strategies involved.

non-IID data, careful tuning of hyperparameters (e.g., learning A. Strategy I: Global Model Personalization
rate decay) is required to improve learning stability [13].
2) Lack of Solution Personalization: In the vanilla FL The first strategy addresses the performance issues in
setting, a single-globally shared model is trained to fit the training a globally shared FL model on heterogeneous data.
“average client.” As a result, the global model will not When learning on non-IID data, the accuracy of FedAvg-
generalize well for a local distribution that is very different based approaches is significantly reduced due to client drift.
from the global distribution. Having a single model is often Under global model personalization, the PFL setup closely
insufficient for practical applications which often face non-IID follows the general FL training procedure where a single-
local datasets. Taking the example of applying FL to develop global FL model is trained. The trained global FL model is
language models for mobile keyboards, users from different then personalized for each FL client through a local adaptation
demographics are likely to have divergent usage patterns due to step that involves additional training on each local dataset.
diverse generational, linguistic, and cultural nuances. Certain This two-step “FL training + local adaptation” approach is
words or emojis are likely to be used predominantly by specific commonly regarded as an FL personalization strategy by
groups of users. For such a scenario, a more tailored prediction the FL community [8], [17]. As personalization performance
pattern is needed for each individual user in order for the word directly depends on the generalization performance of the
suggestions to be meaningful. global model, many PFL approaches aim to improve the per-
formance of the global model under data heterogeneity in order
to improve the performance of subsequent personalization on
C. Contributions local data. Personalization techniques for this category are
There are several surveys on the general concepts, methods, classified into data-based and model-based approaches. Data-
and applications of FL [7], [14]. Others review FL from the based approaches aim to mitigate the client drift problem
perspectives of privacy [15] and robustness [16]. Our survey by reducing the statistical heterogeneity among the clients’
focuses on PFL, which studies the problem of learning person- datasets, while model-based approaches aim to learn a strong
alized models to handle statistical heterogeneity under the FL global model for future personalization on individual clients
setting. There is a shortage of a comprehensive survey on PFL or improve the adaptation performance of the local model.
that provides a systematic perspective on this important topic
for new researchers. In this article, we bridge this gap in the B. Strategy II: Learning Personalized Models
current FL literature. Our main contributions are summarized The second strategy addresses the challenge of solution
as follows. personalization. In contrast to the global model personalization
1) We provide a succinct overview of FL and its catego- strategy which trains a single-global model, approaches in this
rization. A detailed analysis of the key motivations for category train individual personalized FL (PFL) models. The
PFL in the current FL settings is also included. goal is to build personalized models by modifying the FL
2) We identify personalization strategies to address key FL model aggregation process. This is achieved through applying
challenges and offer a unique data-based, model-based, different learning paradigms in the FL setting. Personalization
architecture-based, and similarity-based perspective for techniques are classified into architecture-based and similarity-
guiding the review of the PFL literature. Based on based approaches. Architecture-based approaches aim to pro-
this perspective, we propose a hierarchical taxonomy vide a personalized model architecture tailored to each client,
to present existing works on PFL, highlighting the while similarity-based approaches aim to leverage client rela-
challenges they face, their main ideas, and assumptions tionships to improve personalized model performance where
they made which could introduce potential limitations. similar personalized models are built for related clients.
3) We discuss commonly adopted public datasets and In PFL model training, the optimization objective is formu-
evaluation metrics in the current literature for PFL lated differently from the vanilla FL setting, as an individual

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
9590 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

Fig. 3. Setup and configurations of approaches that fall under Strategy I: global model personalization. Data-based approaches—(a) data augmentation and
(b) client selection. Model-based approaches—(c) regularized local loss; regularization can be performed: 1) between global and local models and 2) between
historical local model snapshots; (d) meta-learning; and (e) TL.

personalized model is learned for each client. Here, we provide where θc ∈ Rd encodes the parameters of the local model of
formulations of the optimization objectives under the FL client c. In this setting, the resulting models may not achieve
setting and the local learning setting in order to highlight the good generalization performance as the number of training
positioning of PFL approaches. The standard FL objective is examples that the local models are exposed to are limited.
given as Stronger generalization guarantees can be obtained with more
collaboration amongst clients to exploit the pool of knowledge
1 
C
min F(w) := f c (w) (1) for model training.
w∈Rd C c=1 Comparing the formulations of the standard FL and local
learning settings, standard FL facilitates collaboration and
where C is the number of participating clients, w ∈ Rd
knowledge sharing amongst clients but does not entail per-
encodes the parameters of the global model, and
sonalized outputs as it relies on a shared global model for
f c (w) := E(x,y)∼Dc [ f c (w; x, y)] (2) client inference. On the other hand, local learning entails a
fully personalized model for each client but fails to leverage
represents the expected loss over the data distribution Dc potential performance gains from interclient collaboration.
of client c. The prevailing FL formulation minimizes the Given the need to achieve a balance between generalization
aggregation of local functions and entails a common output for and personalization performance, PFL approaches fall between
all clients using the global model without any personalization. the standard FL setting and the local learning setting.
In the presence of data heterogeneity (i.e., the underlying
data distributions across the clients are not identical), simply III. S TRATEGY I: G LOBAL M ODEL P ERSONALIZATION
minimizing the average local loss with no personalization will In this section, we survey PFL approaches following the
result in poor performance. global model personalization strategy. The main setup and
At the opposite end of the spectrum, we consider a local configurations for these approaches are shown in Fig. 3. Based
learning setting where each client c trains its own model on our proposed taxonomy, they are divided into data-based
θc locally without any communication with other clients. approaches and model-based approaches as follows.
The objective is given as
A. Data-Based Approaches
1 
C
min F(θ ) := f c (θc ) (3) Motivated by the client drift problem arising from federated
θ1 ,...,θc ∈Rd C c=1 training on heterogeneous data, data-based approaches aim to

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TAN et al.: TOWARDS PERSONALIZED FEDERATED LEARNING 9591

reduce the statistical heterogeneity of client data distributions. resource heterogeneity challenges that are prevalent in edge
This helps to improve the generalization performance of the computing applications. For cross-device FL, there is often
global FL model. significant variability in hardware capabilities in terms of com-
1) Data Augmentation: As the IID property of training data putation and communication capacities. Heterogeneity also
is a fundamental assumption in statistical learning theory, data exists in data, whereby the quantity and distribution of data
augmentation methods to enhance the statistical homogeneity differ among clients. Such diversity exacerbates challenges,
of the data have been extensively studied in the field of such as communication costs, stragglers, and model accuracy.
machine learning. Oversampling techniques involving syn- Chai et al. [27] proposed a tier-based FL system (TiFL)
thetic data generation (e.g., SMOTE [18] and ADASYN [19]), that groups clients into tiers based on training performance.
and undersampling techniques (e.g., Tomek links [20]) have The algorithm adaptively selects participating clients from the
been proposed to reduce data imbalance. These techniques, same tier for each training round by optimizing both accuracy
however, cannot be directly applied under the FL setting, and training time. This helps alleviate the performance issues
where data residing at the clients in the federation are dis- caused by data and resource heterogeneity. Li et al. [28]
tributed and private. proposed FedSAE, a self-adaptive FL system that adaptively
Data augmentation in FL [Fig. 3(a)] is highly challenging selects clients with larger local training loss in each training
as it often requires some form of data sharing or relies on round to accelerate the convergence of the global model.
the availability of a proxy dataset that is representative of A prediction mechanism of the affordable workload of each
the overall data distribution. Zhao et al. [21] proposed a data client is also proposed to enable the dynamic adjustment of
sharing strategy that distributes a small amount of global data the number of local training epochs for each client in order to
balanced by classes to each client. Their experiments show that improve device reliability.
there is potential for significant accuracy gains (∼30%) with
the addition of a small amount of data. Jeong et al. [22] pro- B. Model-Based Approaches
posed FAug, a federated augmentation approach that involves
Although data-based approaches improve the convergence
training a generative adversarial network (GAN) model in the
FL server. Some data samples of the minority classes are of the global FL model by mitigating the client drift problem,
uploaded to the server to train the GAN model. The trained they generally need to modify the local data distributions. This
GAN model is then distributed to each client to generate may result in the loss of valuable information associated with
additional data to augment its local data to produce an IID the inherent diversity of client behaviors. Such information
can be useful for personalizing the global model for each
dataset. Duan et al. [23] proposed Astraea, a self-balancing FL
framework to handle class imbalance by using Z-score-based client. In this section, we cover model-based global model
data augmentation and down-sampling of local data. The FL personalization FL approaches. The objective is either to learn
a strong global FL model for future personalization on each
server requires statistical information about clients’ local data
distributions (e.g., class sizes, mean, and standard deviation individual client or to improve the adaptation performance of
values). Wu et al. [24] proposed the FedHome algorithm that the local model.
trains a generative convolutional autoencoder (GCAE) model 1) Regularized Local Loss: Model regularization is a com-
using FL. At the end of the FL procedure, each client performs mon strategy for preventing overfitting and improving con-
further personalization on a locally augmented class-balanced vergence when training machine learning models. In FL,
dataset. This dataset is generated by executing the SMOTE regularization techniques can be applied to limit the impact
algorithm on the low-dimensional features of the encoder of local updates. This improves convergence stability and the
generalization of the global model, which, in turn, can be
network based on the local data.
2) Client Selection: Another line of work focuses on used to produce better personalized models. Instead of just
designing FL client selection mechanisms to enable sampling minimizing the local function f c (θ ), each client c minimizes
from a more homogeneous data distribution, with the aim the following objective:
of improving model generalization performance [Fig. 3(b)]. min h c (θ ; w) := f c (θ ) + lreg (θ ; w) (4)
Wang et al. [25] proposed FAVOR which selects a subset of θ ∈Rd

participating clients for each training round in order to mitigate where lreg (θ ; w) is the regularization loss, which is generally
the bias introduced by non-IID data. A deep Q-learning formu- formulated as a function of the global model w and the local
lation for client selection was designed with the objective of model θc of client c. Regularization can be applied in the
maximizing accuracy while minimizing the number of com- following ways, as shown in Fig. 3(c).
munication rounds. In a similar approach, a client selection 2) Between Global and Local Models: Several works imple-
algorithm based on the multi-armed bandit formulation was ment regularization between the global and local models to
proposed in [26] to select the subset of clients with minimal tackle the client drift problem that is prevalent in FL due
class imbalance. The local class distributions are estimated by to statistical data heterogeneity. FedProx [29] introduced a
comparing the similarity between the local gradient updates proximal term to the local subproblem which considers the
submitted to the FL server with the gradients inferred from a dissimilarity between the global FL model and local models
balanced proxy dataset residing on the server. to adjust the impact of local updates. Along with model dis-
Recently, there is an emerging line of work that focuses similarity, FedCL [30] further considers parameter importance
on developing client selection strategies to tackle data and in the regularized local loss function using elastic weight

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
9592 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

consolidation (EWC) [31] from the field of continual learning. which aims to learn a global model that performs well on
The importance of the weights to the global model is estimated most participating clients, the new goal is transformed to
on a proxy dataset in the FL server. They are then transferred to learn a good initial global model that performs well on a
the clients where penalization steps are carried out to prevent new heterogeneous task after it is updated with a few steps
important parameters of the global model from being changed of gradient descent. This problem formulation is suitable for
when adapting the global model to clients’ local data. Doing so learning an improved global model initialization for stronger
alleviates the weight divergence between the local and global personalization on local data silos with heterogeneous distri-
models while preserving the knowledge of the global model butions. However, this approach is computationally expensive
to improve generalization. Recently, SCAFFOLD [12] uses due to the need to compute second-order gradients. To reduce
variance reduction to alleviate the effect of client drifting that computational overhead, the authors evaluated 2 forms of
causes weight divergence between the local and global models. gradient approximations: 1) FO-MAML [34], which replaces
The update directions of the global (v) and local (v c ) models the gradient estimate with its first-order approximation where
are estimated. The difference, (v−v c ), is added as a component the Hessian term is ignored and 2) HF-MAML [38], which
of the local loss function to correct local updates. replaces the Hessian-vector product with gradient differences.
3) Between Historical Local Model Snapshots: Recently, It has been found that HF-MAML achieves better gradient
a contrastive learning-based FL—MOON [32] has been pro- approximation.
posed. The goal of MOON is to reduce the distance between The idea of Per-FedAvg has been extended in [39] to
the representations learned by the local models and the global propose a federated meta-learning formulation using Moreau
model (i.e., to alleviate weight divergence), and increase the envelopes (pFedMe). It incorporates an l2 -norm regularization
distance between the representations learned between a given loss which can control the balance between personalization
local model and its previous local model (i.e., to speed up con- and generalization performance. It has achieved improved
vergence). This emerging approach enables each client to learn accuracy and convergence over FedAvg and Per-FedAvg.
a representation close to the global model to minimize local Khodak et al. [40] proposed the ARUBA framework. It is
model divergence. It also speeds up learning by encouraging based on online learning to achieve adaptive meta-learning
the local model to improve from its previous version. under FL settings. When combined with FedAvg, it improves
a) Meta-learning: Commonly known as “learning to model generalization performance and eliminates the need for
learn,” meta-learning aims to improve the learning algorithm hyperparameter optimization during personalization.
through exposure to a variety of tasks (i.e., datasets) [33]. This b) Transfer learning: TL is commonly used for model
enables the model to learn a new task quickly and effectively. personalization in nonfederated settings [41]. It aims to trans-
Optimization-based meta-learning algorithms, such as model- fer knowledge from a source domain to a target domain,
agnostic meta-learning (MAML) [34] and Reptile [35], are where both domains are often different but related. TL is an
known for their good generalization and fast adaptation on new efficient approach that leverages knowledge transfer from a
heterogeneous tasks. They are also model agnostic and can be pretrained model, thereby avoiding the need to build models
applied to any gradient descent-based approaches, enabling from scratch. TL-based PFL approaches have also emerged.
applications in supervised learning and reinforcement learning. FedMD [42] is an FL framework based on TL and knowledge
Jiang et al. [36] drew parallels between meta-learning and distillation (KD) for clients to design independent models
FL. Meta-learning algorithms run in two phases: meta-training using their own private data. Before the FL training and KD
and meta-testing. The authors mapped the meta-training step phases, TL is first carried out using a model pretrained on a
in MAML to the FL global model training process, and public dataset. Each client then fine-tunes this model on its
the meta-testing step to the FL personalization process in private data.
which a few steps of gradient descent are performed on local Domain adaptation TL techniques are commonly adopted
data during local adaptation. They also show that FedAvg is to achieve PFL. These techniques aim to reduce the domain
analogous to the Reptile algorithm and is in fact equivalent discrepancy between the trained global FL model (i.e., the
when all clients possess equal amounts of local data. Given the source domain) and a given local model (i.e., the target
similarities in the formulations of meta-learning and FL algo- domain) for improved personalization. There are several stud-
rithms, meta-learning techniques can be applied to improve the ies in FL that uses TL in the healthcare domain for model
global FL model, while achieving fast personalization on the personalization (e.g., FedHealth [43] and FedSteg [44]). The
clients [Fig. 3(d)]. training procedure generally involves three steps: 1) training
Per-FedAvg [37], which is a variant of FedAvg built on top a global model via FL; 2) training local models by adapting
of the MAML formulation, has also been proposed the global model on local data; and 3) training personalized
models by refining the local model using the global model
1 
C
via TL. In order to enable domain adaptation, an alignment
min F(w) := f c (w − α∇ fc (w)) (5)
w∈Rd C c=1 layer, such as the correlation alignment (CORAL) layer [45],
is often added before the softmax layer for adaptation of
where α > 0 is the step size. The cost function can be written the second-order statistics of the source and target domains
as the average of meta-functions F1 , . . . , Fc , where Fc (w) := [Fig. 3(e)].
f c (w−α∇ f c (w)) is the meta-function associated with client c. To reduce training overhead in deep neural networks, the
In contrast to the optimization objective of FedAvg in (1) lower layers of the global model are often transferred and

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TAN et al.: TOWARDS PERSONALIZED FEDERATED LEARNING 9593

TABLE I
S UMMARY OF P ERSONALIZATION T ECHNIQUES IN Global Model Personalization

reused directly in the local models as low-level generic fea- assume a single-global model setup where a single-global
tures are learned. Other layers of the local model are fine- model is learned over heterogeneous data silos, they are
tuned with the local data to learn task-specific features for not well-suited for solution personalization when there are
personalization. significant differences among the client data distributions.
c) Summary: In this section, we have discussed data- In addition, model-based approaches generally assume that all
based approaches and model-based approaches for global clients and the FL server share a common model architecture.
model personalization. We now summarize and compare the This assumption requires all clients to have sufficient computa-
personalization techniques in terms of their advantages and tion and communication resources. However, edge computing
disadvantages (as shown in Table I). FL clients are often resource-constrained [11], making such
Data-based approaches aim to reduce the statistical hetero- approaches unsuitable.
geneity of client data distributions to tackle the problem of
client drift. Data augmentation methods are easy to implement
IV. S TRATEGY II: L EARNING P ERSONALIZED M ODELS
in the general FL training procedure. However, the applicabil-
ity of these data augmentation methods is limited to some In this section, we survey PFL approaches following the
extent as the possibility of privacy leakage from data sharing strategy of learning personalized models. The main setup and
has not been adequately addressed in existing designs. Data configurations for these approaches are shown in Fig. 4. Based
samples [21], [22] or data statistics about the clients’ data on our proposed taxonomy, they are divided into architecture-
distributions [23] are often shared during the training process. based approaches and similarity-based approaches as follows.
Client selection methods improve the model generalization
performance by optimizing the subset of participating clients
for each FL communication round. As this requires compu- A. Architecture-Based Approaches
tationally intensive algorithms, such as deep Q-learning [25] Architecture-based PFL approaches aim to achieve person-
and multi-armed bandits [26], it incurs higher computational alization through a customized model design that is tailored
overhead than FedAvg. In addition, many of these data-based to each client. Parameter decoupling methods implement per-
approaches assume the availability of a proxy dataset that is sonalization layers for each client, while KD methods support
representative of the global data distribution [21], [22], [26]. personalized model architectures for each client.
In order to construct such a proxy dataset, it is necessary to 1) Parameter Decoupling: Parameter decoupling aims to
understand the global data distribution, which is challenging achieve PFL by decoupling the local private model parameters
under FL settings due to privacy-preservation concerns. from the global FL model parameters. Private parameters are
Model-based approaches closely follow the general FL trained locally on the clients and not shared with the FL server.
training procedure in which a single-global model is This enables task-specific representations to be learned for
trained. Regularization methods, such as FedProx [29] and enhanced personalization.
MOON [32], are easy to implement and they only require The division between private and federated model parame-
a slight modification to the FedAvg algorithm. Meta-learning ters is an architectural design decision. There are generally
optimizes the global model for fast personalization. However, two configurations used in parameter decoupling for deep
gradient approximations are needed as it is expensive to feed-forward neural networks [Fig. 4(a)]. The first is a “base
compute second-order gradients [34], [37]. TL improves per- layers + personalized layers” design proposed by [46]. In this
sonalization by reducing the domain discrepancy between the setting, personalized deep layers are kept private by the
global and local models. As the above-mentioned approaches clients for local training to learn personalized task-specific

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
9594 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

Fig. 4. Setup and configurations of approaches that fall under strategy ii: learning personalized models. Architecture-based approaches—(a) parameter
decoupling; parameter privatization designs include: 1) personalized layers and 2) personalized feature representations. (b) KD; knowledge can be distilled:
1) toward clients; 2) toward server; 3) toward both clients and server; and 4) amongst clients. Similarity-based approaches—(c) MTL, (d) model interpolation,
and (e) clustering.

representations, while the base layers are shared with the FL client model are shared during the forward propagation and
server to learn low-level generic features. the gradients from the split layer are shared with the client
The second design considers personalized feature repre- during backpropagation. SL, therefore, has a privacy advantage
sentations for each client. In [47], a document classification over FL as the server and clients do not have full access to the
model using a bidirectional LSTM architecture is trained global and local models [51]. However, training is less efficient
via FL by treating user embeddings as the private model due to the sequential client training process. SL also performs
parameters, and character embeddings (i.e., LSTM and MLP worse than FL on non-IID data and has higher communication
layers) as the FL model parameters. In [48], local global overheads [52].
FedAvg (LG-FedAvg) has been proposed to combine local 2) Knowledge Distillation: In server-based HFL [53], the
representation learning and global federated training. Learning same model architecture is adopted by both the FL server
lower dimensional local representations improves communica- and the FL clients. The underlying assumption is that there is
tion and computational efficiency for federated global model sufficient communication bandwidth and computation capacity
training. It also offers flexibility as specialized encoders can at the clients. However, for practical applications with a large
be designed based on the source data modality (e.g., images number of edge devices as FL clients, they are often resource-
and texts). The authors also demonstrated how fair and unbi- constrained. Clients may also choose to have different model
ased representations that are invariant to protected attributes architectures due to different training objectives. The key
(e.g., race and gender) can be learned by incorporating adver- motivation for KD in FL is to enable a greater degree of
sarial learning into FL model training. flexibility to accommodate personalized model architectures
As the idea of parameter decoupling has some similarities to for the clients. At the same time, it also seeks to tackle com-
split learning (SL) [49], [50], a distributed and private machine munication and computation capacity challenges by reducing
learning paradigm, we briefly discuss their differences in this resource requirements.
section. In SL, the deep network is split layerwise between KD for neural networks was introduced in [54] as a par-
the server and the clients. Unlike parameter decoupling, the adigm for transferring the knowledge from an ensemble of
server model in SL is not transferred to the client for model teacher models to a lightweight student model. Knowledge
training. Instead, only the weights of the split layer of the is commonly represented as class scores or logit outputs in

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TAN et al.: TOWARDS PERSONALIZED FEDERATED LEARNING 9595

existing FL distillation approaches. In general, there are four architecture agnostic distributed algorithm for on-device
main types for distillation-based FL architectures: 1) distilla- learning—D-Distillation. It assumes an IoT edge FL setting,
tion of knowledge to each FL client to learn stronger person- in which every edge device is connected to only a few neigh-
alized models; 2) distillation of knowledge to the FL server boring devices. Only connected devices can communicate with
to learn stronger server models, 3) bidirectional distillation each other. The learning algorithm is semisupervised, with
to both the FL clients and the FL server; and 4) distillation local training performed on private data and federated training
amongst clients [Fig. 4(b)]. on an unlabeled public dataset. For each communication
Li and Wang [42] proposed FedMD, a distillation-based round, each client broadcasts its soft decisions to its neighbors
FL framework, which allows clients to design diverse models while receiving their soft decision broadcasts. Each client then
using their own private data via KD. Learning occurs through a updates its soft decisions based on its neighbors’ soft decisions
consensus computed using the average class scores on a public via a consensus algorithm. The updated soft decisions are then
dataset. For every communication round, each client trains its used to update the client’s model weights by regularizing its
model using the public dataset based on the updated consensus local loss. This procedure facilitates model learning via knowl-
and fine-tunes its model on its private dataset thereafter. This edge transfer amongst neighboring FL clients in a network.
enables each client to obtain its own personalized model while
leveraging knowledge from other clients. Zhu et al. [55] pro-
posed FedGen, a data-free distillation framework that distills B. Similarity-Based Approaches
knowledge to the FL clients. A generative model is trained in Similarity-based approaches aim to achieve personalization
the FL server and broadcast to the clients. Each client then by modeling client relationships. A personalized model is
generates augmented representations over the feature space learned for each client, with related clients learning similar
using the learned knowledge as the inductive bias to regulate models. Different types of client relationships have been
its local learning. studied in PFL. Multitask learning (MTL) and model interpo-
Lin et al. [56] proposed the FedDF algorithm. It assumes lation consider pairwise client relationships while clustering
a setting in which the edge clients require different model considers group-level client relationships.
architectures due to diverse computational capabilities. The 1) Multitask Learning: The goal of MTL is to train a model
FL server constructs p distinct prototype models, each repre- that jointly performs several related tasks. This improves gen-
senting clients with identical model architectures (e.g., ResNet eralization by leveraging domain-specific knowledge across
and MobileNet). For each communication round, FedAvg the learning tasks. By treating each FL client as a task in
is first performed among clients from the same prototype MTL, there is potential to learn and capture relationships
group to initialize a student model. Cross-architecture learning among the clients exhibited by their heterogeneous local data
is then performed via ensemble distillation, in which the [Fig. 4(c)]. The MOCHA algorithm [59] has been proposed to
client (teacher) model parameters are evaluated on an unla- extend distributed MTL into the FL settings. MOCHA uses a
belled public dataset to generate logit outputs that are used to primal-dual formulation to optimize the learned models. The
train each student model in the FL server. algorithm addresses communication and system challenges
Knowledge may also be distilled in a bidirectional manner prevalent in FL which are not considered in the field of
between the FL client and FL server within the same FL MTL. Unlike the conventional FL design which learns a
training procedure. He et al. [57] proposed group knowledge single-global model, MOCHA learns a personalized model
transfer (FedGKT) to improve model personalization perfor- for each FL client. While MOCHA improves personalization,
mance for resource-constrained edge devices. It uses alternat- it is not suitable for cross-device FL applications as all
ing minimization to train small edge models and a large server clients are required to participate in every round of FL model
model through a bidirectional distillation approach. The large training. Another drawback of MOCHA is that it is only
server model takes extracted features from the local models applicable to convex models and is, thus, unsuitable for deep
as inputs and uses the KL-divergence loss to minimize the learning implementations. This motivated [60] to propose the
difference between the ground truth and soft labels predicted VIRTUAL federated MTL algorithm that performs variational
by the local models. By doing so, the server model absorbs inference using a Bayesian approach. Although it can handle
the knowledge transferred from the local models. Similarly, nonconvex models, it is computationally expensive for large-
each local model calculates the KL-divergence loss using its scale FL networks.
private dataset and the predicted soft labels transferred from Huang et al. [61] proposed FedAMP, an attention-based
the server. This facilitates knowledge transfer from the server mechanism that enforces stronger pairwise collaboration
model to the local models. Using this bidirectional distillation amongst FL clients with similar data distributions. In contrast
framework, the computation burden is shifted from the edge to the standard FL framework in which a single-global model
clients to the more powerful FL server. However, there is a is maintained by the server, FedAMP maintains a personalized
potential privacy risk as the ground truth labels from each cloud model u c for each client in the server. The personalized
client are uploaded to the FL server. cloud model u c = ξc,1 θ1 +· · ·+ξc,m θm is the linear
 combination
KD-based PFL may also be carried out in distributed of the local client models m ∈ C, where m∈C ξc,m = 1.
settings where knowledge is transferred amongst neighbor- In each communication round t, the personalized cloud model
ing clients in a network. Bistritz et al. [58] proposed an u c is transferred to client c to perform local training on its

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
9596 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

own dataset. The local weights are computed as homogeneous group of clients is more suitable [Fig. 4(e)].
μ Several recent works focus on clustering for FL personaliza-
θc∗ = arg min f c (θ ) + θ − u c 2 (6) tion. The underlying assumption of clustering-based FL is the
θ ∈Rd 2α
existence of a natural grouping of clients based on their local
where α is the step size of gradient descent. data distributions.
FedCurv [62] uses EWC to prevent catastrophic forgetting In [66], hierarchical clustering has been incorporated into
when moving across learning tasks. Parameter importance is FL as a postprocessing step. An optimal bipartitioning algo-
estimated using the Fisher information matrix and penaliza- rithm based on the cosine similarity of the gradient updates
tion steps are carried out to preserve important parameters. from the clients is used to divide the FL clients into clusters.
At the end of each communication round, each client sends As multiple communication rounds are needed to separate
its updated local parameters and the diagonal of its Fisher all incongruent clients, the recursive bipartitioning clustering
information matrix to the server. These parameters will be framework incurs high computation and communication costs
shared among all clients to perform local training in the next that limit practical feasibility for large-scale settings. Another
round. hierarchical clustering framework for FL has been proposed
2) Model Interpolation: In [63], a new formulation that in [67]. It uses an agglomerative hierarchical clustering formu-
learns personalized models using a mixture of global and lation that reduces clustering to a single step to lower com-
local models has been proposed to balance generalization with putation and communication loads. The procedure begins by
personalization [Fig. 4(d)]. Each FL client learns an individual first training a global FL model for t communication rounds.
local model. A penalty parameter λ is used to discourage the The global model is then fine-tuned on the private datasets of
local models from being too dissimilar from the mean model. all clients to determine the difference w between the global
Pure local model learning occurs when λ is set to zero. This model parameters w and the local model parameters θc . The
is equivalent to the fully PFL setting in (3), where each client w values for all clients are used as inputs to the agglom-
trains its own model locally without any communication with erative hierarchical clustering algorithm to generate multiple
other clients. As λ increases, mixed model learning occurs and client clusters. FL training is then performed independently
the local models become increasingly similar to each other. for each client cluster to produce multiple federated models.
The setting approximates global model learning in which all This approach is designed for a wider range of non-IID
local models are forced to be identical when λ approaches settings and allows training on a subset of clients during each
infinity. In this way, the degree of personalization can be round of FL model training. However, computing the pairwise
controlled. In addition, the authors proposed a communication- distance between all clients in agglomerative clustering can be
efficient variant of SGD known as the loopless local gradient computationally intensive when there are a large number of
descent (L2GD). Through a probabilistic framework that deter- clients.
mines whether a local GD step or a model aggregation step Other clustering approaches require a fixed number
is to be performed, the number of communication rounds is of clusters to be set at the beginning of FL training.
reduced significantly. Ghosh et al. [68] proposed the iterative federated clustering
In a related line of work, [64] proposed the APFL algo- algorithm (IFCA). Instead of a single-global model, the server
rithm with the goal of finding the optimal combination of constructs K global models and broadcasts these models to all
global and local models in a communication-efficient manner. clients for local loss computation. Each client is assigned to
They introduced a mixing parameter for each client which is one of the K clusters the global model of which achieves the
adaptively learned during the FL training process to control lowest loss value on the client’s data. Cluster-based FL model
the weights of the global and local models. This enables the aggregation within the cluster partition is then performed
optimal degree of personalization for each client to be learned. by the server. Compared with FedAvg, the communication
The weighting factor on a particular local model is expected overhead of IFCA is K times higher as the server needs to
to be larger if the local and global data distributions are not broadcast K cluster models to all clients in every communi-
well-aligned, and vice versa. A similar formulation involving cation round.
the joint optimization of local and global models to determine Huang et al. [69] proposed community-based FL (CBFL) to
the optimal interpolation weight has been proposed in [17]. predict patient hospitalization time and mortality. They trained
Recently, [65] proposed the HeteroFL framework which a denoising autoencoder and performed K-means clustering
trains local models with diverse computational complexities, with a predetermined number of clusters to cluster patients
based on a single-global model. By adaptively allocating based on the encoded features of their private data. An FL
local models of different complexity levels according to the model is then trained for each cluster.
computation and communication capabilities of each client, Duan et al. [70] proposed FedGroup, an FL clustering
it achieves PFL to address system heterogeneity in edge framework that implements a static client clustering strat-
computing scenarios. egy and a newcomer client cold start mechanism. FedGroup
3) Clustering: For applications in which there are inherent performs clustering on the local client updates using the
partitions among clients or data distributions that are sig- K-means++ algorithm [71] based on the Euclidean distance
nificantly different, adopting a client-server FL architecture of the decomposed cosine similarity (EDC).
to train a shared global model is not optimal. A multi- Xie et al. [72] proposed a multicenter formulation
model approach in which an FL model is trained for each that learns multiple global models. It introduces a new

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TAN et al.: TOWARDS PERSONALIZED FEDERATED LEARNING 9597

TABLE II
S UMMARY OF P ERSONALIZATION T ECHNIQUES IN Learning Personalized Models

distance-based multicenter loss function model parameters but also on model architecture. As it may
be difficult for the student model to learn well if there is a
1   (k)
K C
= r Dist(θc , w(k) ) (7) huge capacity gap between the large teacher model and the
C k=1 c=1 c small student model [73], [74], it is imperative to determine
an optimal design for both the server and client models.
where rc(k) means that client c is assigned to cluster k, and w(k)
Similarity-based approaches aim to achieve personalization
is the model parameters of cluster k. Expectation maximization
by modeling client relationships. MTL methods, such as
is used to solve the distance-based objective clustering prob-
FedAMP [61], excel in capturing pairwise client relationships
lem and derive the optimal matching of clients to each cluster
to learn similar models for related clients. As a result, it may
center. In the E-step, the cluster assignment rc(k) is updated
be sensitive to poor data quality, which results in the segrega-
by fixing wc . rc(k) is set to 1 if k = arg min j Dist(θc , w( j ) ).
tion of clients based on their data quality. Model interpolation
Otherwise, it is set to 0. In the
M-step, thecluster centers w(k)
(k) C (k) C (k) methods have a simple formulation that learns personalized
are updated with w = (1/ c=1 rc ) c=1 rc wc . Finally,
models using a mixture of global and local models. However,
w(k) is sent to all clients in cluster k to perform fine-tuning
it is likely to experience a degradation in performance in
of the local model parameters θc on its private training data.
highly non-IID scenarios as it uses a single global model
The above-mentioned steps are repeated until convergence.
as a basis for personalization [64], [65]. Clustering methods
4) Summary: In this section, we have discussed
are advantageous when there are inherent partitions among
architecture-based approaches and similarity-based approa-
clients. However, they incur high computation and commu-
ches for learning personalized models. We now summarize
nication costs that limit practical feasibility for large-scale
and compare the personalization techniques in terms of their
settings [66], [67]. Additional architectural components for the
advantages and disadvantages (as shown in Table II).
management and deployment of the clustering mechanism are
Architecture-based approaches aim to achieve personal-
also required [67].
ization through a customized model design that is tailored
to each client. As parameter decoupling methods have a
simple formulation that implements personalized layers for V. PFL B ENCHMARK AND E VALUATION M ETRICS
each client [46], [47], it is limited in its ability to support Another important factor for the long-term advancement of
a high degree of model design personalization. In contrast, the PFL research field is performance benchmarking. In this
KD-based PFL methods provide clients with a greater degree section, we review and discuss the benchmarks and evaluation
of flexibility to accommodate personalized model architectures metrics used by the existing PFL literature.
for clients. They are also advantageous in communication and
computation constrained edge FL settings [56], [57]. However,
a representative proxy dataset is often required in the KD A. FL Benchmark Datasets
process [42], [56]. For both methods, there are some chal- There are several FL benchmarking frameworks developed
lenges in model building. In parameter decoupling, the classi- in recent years, including FLBench [75], edge AIBench [76],
fication of private and federated parameters is an architectural OARF [77] and FedGraphNN [78]. LEAF [79] is one of the
design decision, which controls the balance between general- earliest and most popular benchmarking frameworks proposed
ization and personalization performance [46]. Determining the for FL. At the time of writing, it provides six FL datasets cov-
optimal privatization strategy is a research challenge. In KD, ering a range of machine learning tasks, including image clas-
the effectiveness of knowledge transfer depends not only on sification, language modeling, and sentiment analysis, under

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
9598 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

TABLE III
T YPES OF N ON -IID D ATA C ONSIDERED IN PFL R ESEARCH

both IID and non-IID settings. Examples datasets include the from different demographics as there are diverse linguistic and
extended MNIST [80] dataset split according to the writers of cultural nuances that result in certain words or emojis being
the character digits, the CelebA [81] dataset split according to used predominantly by different users. To model label distribu-
the celebrity, and the Shakespeare [9] dataset split according tion skew, the dataset is partitioned based on labels, where each
to the characters in the play. A set of accuracy and com- client draws samples from a fixed number of label classes k.
munication metrics, along with implementation references for A smaller k value would mean stronger data heterogeneity
well-known approaches, such as FedAvg, SGD, and MOCHA, [9], [27], [48], [64]. Different levels of label distribution
are also provided. As LEAF extends existing public datasets imbalance can be simulated by using a Dirichlet distribution
from traditional machine learning settings, it does not fully Dir(α), where α controls the degree of data heterogene-
reflect the data heterogeneity in FL scenarios. Although there ity. An α of 100 is equivalent to the IID setting, while
are a few real-world federated datasets, such as a street image a smaller α value means that each client is more likely
dataset for object detection [82] and a species dataset for image to hold data from only one class resulting in high data
classification [83], they are often limited in size. heterogeneity [55], [56], [86].
4) Label Preference Skew: The conditional distribution
B. PFL Experimental Evaluation Design Pc (x|y) varies across clients, while the label distribution P(y)
Despite the release of benchmark datasets for FL, they are is the same across clients. Due to personal preferences, there
not widely adopted in PFL research. The vast majority of PFL may be variations in the labels. To model label preference
studies choose to simulate the non-IID setting by performing skew, a proportion of labels are often swapped to increase
their own partitioning on a public benchmark dataset used variance in the ground truth labels [66], [67].
in machine learning (e.g., MNIST [84], EMNIST [80], From Table III, the evaluation of PFL algorithms is lim-
CIFAR-100 [85]), or creating a synthetic dataset ited to a single type of non-IID setting in most existing
[17], [64], [68]. Here, we survey the different types of studies. Feature distribution and label distribution skew are
non-IID settings simulated in PFL literature and summarize most commonly considered to simulate the non-IID setting
them according to the personalization methods in Table III. in PFL studies. Label preference skew settings have only
1) Quantity Skew: FL clients hold local datasets of different been adopted by clustering-based PFL approaches. Other PFL
sizes, with some clients having considerably larger amounts approaches have not been studied under this type of non-IID
of data than others. Data size heterogeneity is pervasive in FL setting. A collective effort by the FL research community is
real-world environments due to diverse usage patterns across needed to align and adopt benchmarks in order to standardize
FL clients. To simulate data size heterogeneity, data from experimental evaluation design in PFL research.
an imbalanced dataset are used directly without further sam- 5) PFL Evaluation Metrics: We categorize the evaluation
pling [23], [72]. Alternatively, data can be distributed to FL metrics adopted in PFL research into: 1) model performance-
clients according to power law [28], [29], [39]. related; 2) system performance-related; and 3) trustworthy
2) Feature Distribution Skew: The feature distribution AI-related (Table IV).
Pc (x) varies across clients, while the conditional distribution Model performance can be measured in terms of accuracy
P(y|x) is the same across clients. For example, in health mon- and convergence. Most PFL works adopt the average test
itoring applications, the distributions of users’ activity data accuracy of personalized models to measure model accuracy.
vary considerably according to their habits and lifestyle pat- While using an aggregated accuracy metric may be adequate
terns [24], [43]. To model feature distribution skew, a dataset to evaluate the performance of vanilla FL which trains a
that is partitioned by users is often used with each user single globally shared model, such a metric cannot reflect the
associated with a different client [24], [59]. It can also be performance of individual personalized models. As such, there
simulated by augmenting datasets via rotations [68]. are PFL works that use distribution-based evaluation frame-
3) Label Distribution Skew: The label distribution Pc (y) works, such as histogram profiling [61], [87], variance metrics
varies across clients, while the conditional distribution P(x|y) [37], [55], [88], and metrics at the individual client level
is the same across clients. For example, in software mobile [24], [43], to evaluate the performance of personalized models.
keyboards, label distribution skew is a likely problem for users As each client experiences a different baseline accuracy due to

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TAN et al.: TOWARDS PERSONALIZED FEDERATED LEARNING 9599

TABLE IV
E VALUATION M ETRICS A DOPTED BY PFL R ESEARCH

statistical data heterogeneity, measuring the changes in model personalization performance. Based on our review of the exist-
accuracy before and after personalization is a useful approach ing PFL literature, we envision promising future trajectories
to assess the benefits of personalization [30], [87], [89]. of research toward new PFL architectural design, realistic
Model convergence is measured by training loss [28], [32], benchmarking, and trustworthy PFL approaches.
[64], [66], [67], number of communication rounds [12], [32],
number of local training epochs [23], [32], [39], and formal-
A. Opportunities for PFL Architectural Design
ization of convergence bounds [12], [29], [37].
System performance metrics focus on communication effi- 1) Client Data Heterogeneity Analytics: The heterogeneity
ciency, computational efficiency, system heterogeneity, system of data among FL clients is a key consideration when assessing
scalability, and fault tolerance. Communication efficiency is the type of PFL required. For example, a multimodel approach,
evaluated by the number of communication rounds [12], [32], such as clustering, is preferred for applications where there are
the number of parameters [23], [57], [65] and message inherent partitions or data distributions that are significantly
sizes [58], [90]. Computational efficiency is evaluated in terms different. In order to facilitate experimentation on non-IID
of the number of FLOPs [57], [65] and training time [27], [57]. data, recent works in PFL have proposed metrics, such as
System heterogeneity is assessed by simulating variations in total-variation, 1-Wasserstein [37], and Earth mover’s distance
hardware capabilities and network conditions. This can be (EMD) [21] to quantify the statistical heterogeneity of data
achieved by varying the number of local training epochs distributions. However, these metrics can only be calculated
[59], [61], CPU resources [27] and local model complex- with access to raw data. The problem of FL client data
ity [56], [65]. System scalability is evaluated in terms of heterogeneity analysis in a privacy-preserving manner remains
performance on a large number of clients [32], total elapse open.
time [23], [29], and total memory consumption [29], [65]. 2) Aggregation Procedure: In more complex PFL scenar-
Fault tolerance is measured in terms of performance under ios, averaging-based model aggregation may not be an ideal
different ratios of dropped out clients [59], [61] and approach in handling data heterogeneity. Model averaging is
stragglers [29], [55]. adopted in most prevailing FL architectures, and its effec-
Trustworthy AI metrics have not been extensively adopted tiveness as an aggregation method has not been well-studied
to evaluate PFL approaches. There are a few emerging works for PFL from a theoretical perspective [91]. Recently, [92]
that consider these metrics [89]. In [48], local model fairness proposed a layerwise matched averaging formulation for CNN
and robustness against adversary attacks have been used to and LSTM architectures. Specialized aggregation procedures
evaluate the performance of the proposed approach. for PFL are to be explored.
The current direction for the evaluation of personalization 3) PFL Architecture Search: In the presence of statistical
performance in PFL research focuses primarily on accuracy heterogeneity, federated neural architectures are highly sen-
gains in terms of model performance. However, the costs sitive to hyperparameter choices and may, therefore, experi-
for achieving PFL should also be considered. While seeking ence poor learning performance if not tuned carefully [13].
accurate models, there are often tradeoffs in terms of system The choice of the FL model architecture also needs to fit
scalability, communication, and computation overheads. The the underlying non-IID distribution well. Neural architecture
fulfillment of trustworthy AI attributes is also not sufficiently search (NAS) [93] is a promising technique to help PFL
considered. It is important to design an effective PFL frame- reduce manual design effort to optimize the model architecture
work that jointly optimizes these cost-benefit objectives that based on given scenarios. It will be particularly beneficial for
are important in real-world FL applications. Given that PFL parameter decoupling and KD-based PFL methods.
faces unique challenges and application scenarios, it is imper- 4) Spatial Adaptability: It refers to the ability of PFL
ative to strengthen the development of evaluation metrics that systems to handle variations across client datasets as a result
are tailored to PFL. of: 1) the addition of new clients and/or 2) dropouts and
stragglers. These are practical issues prevalent in complex edge
VI. P ROMISING F UTURE R ESEARCH D IRECTIONS computing-based FL environments, where there is significant
The field of PFL is starting to gain traction as prac- variability in hardware capabilities in terms of computation,
tical FL applications begin to demand models with better memory, power, and network connectivity [94].

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
9600 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

1) Existing PFL approaches commonly assume a fixed and 2) the presence of adversarial attackers. Such an effort
client pool at the start of an FL training cycle and requires wider collaboration among researchers and industry
that new client cannot join the training process midway practitioners and will be beneficial for building a healthy PFL
[22], [67]. Other approaches involve a pretraining research ecosystem.
step [42] that requires time for local computation. 3) Holistic Evaluation Metrics: The establishment of sys-
Besides meta-learning approaches [37] that encourage tematic evaluation methodologies and metrics is important for
fast learning on a new client, there is limited work PFL research. Model performance, system performance, and
addressing the cold-start problem in PFL. Current deep trustworthy AI attributes are important aspects to consider
FL techniques are also prone to catastrophic forgetting when evaluating the performance of an FL system. Method-
of previously learned knowledge when new clients join, ologies that can provide a holistic cost-benefit analysis on a
due to the stability-plasticity dilemma in neural net- given PFL approach are needed for potential adopters to gain
works [95]. As a result, existing clients may experience deeper insight into its real-world impact.
a degradation in performance. A promising direction is
to incorporate continual learning [96] into FL to mitigate C. Opportunities for Trustworthy PFL
catastrophic forgetting. 1) Open Collaboration: Besides algorithmic challenges,
2) With the prevalence of dropouts and stragglers in large- future PFL research can explore promoting collaboration
scale federated systems due to network, communication, among self-interested data owners. For instance, data owners
and computation constraints, it is necessary to design for with PFL models may need to collaborate by sharing their
robustness in FL systems. Developing communication- models with other suitable data owners in order to adapt
efficient algorithms to mitigate the problem of strag- to changes in the learning task over time in dynamic real-
glers is an ongoing research direction, where gradient world applications [101]. Incentive mechanism design is a
compression [97] and asynchronous model updates [98] promising research direction toward this vision. Game theory,
are common strategies for addressing FL communica- pricing, and auction mechanisms [102] may be applied to build
tion bottlenecks. These issues require further study in suitable incentive schemes to support the emergence of open
PFL to formalize the trade-offs between overhead and collaborative PFL systems.
performance. 2) Fairness: As machine learning technologies become
5) Temporal Adaptability: It refers to the ability of a PFL more widely adopted by businesses to support decision-
system to learn from nonstationary data. In dynamic real- making, there has been a growing interest in developing
world systems, we may expect changes in the underlying methods to ensure fairness in order to avoid undesirable eth-
data distributions over time. This phenomenon is known as ical and social implications [103], [104]. Current approaches
concept drift. Learning in the presence of concept drift often do not adequately address the unique set of fairness-related
involve three steps: (i) drift detection (whether drift has challenges presented in PFL. These include new sources of
occurred); (ii) drift understanding (when, how, and where the bias introduced by the diversity of participating FL clients
drift occurs); and (iii) drift adaptation (response to drift) [99]. due to unequal local data sizes, activity patterns, location, and
Casado et al. [100] is one of the few works that study the connection quality [8]. The study of fairness in PFL is still in
problem of concept drift in FL. It extends FedAvg with its infancy and the framing of fairness in PFL has not yet been
the Change Detection Technique (CDT) for drift detection. well-defined. The study of fairness in FL is mostly focused
It remains an open direction to leverage existing drift detection on the prevailing server-based FL paradigm [105]–[107],
and adaptation algorithms to improve learning on dynamic although new work on fairness in alternative FL paradigms
real-world data in PFL systems. is emerging [108]. As FL approaches maturity, advances
in improving fairness for PFL, in particular, will become
increasingly important in order for FL to be adopted at scale.
B. Opportunities for PFL Benchmarking 3) Explainability: Explainable AI (XAI) [109] is an active
1) Realistic Datasets: Realistic datasets are important for research area that has attracted significant interest recently,
the development of a field. To facilitate PFL research, datasets driven by pressure from government agencies and the general
that include more modalities, such as audio, video, and sensor public for interpretable models [110]. It is important for
signals, and involve a broader range of machine learning tasks models in high stake applications, such as healthcare to be
from real-world applications are required. explainable, where there is a strong need to justify decisions
2) Realistic Non-IID Settings: In most existing studies, the made [111]. Explainability has not yet been systematically
evaluation of PFL algorithms is limited to a single type of explored in the FL literature. There are complex challenges
non-IID setting. Experiments are performed by either lever- unique to achieving explainability in PFL due to the scale and
aging an existing prepartitioned public dataset (e.g., LEAF) heterogeneity of distributed datasets. Striving for FL model
or prepared by partitioning a public dataset to fit the target explainability may also be associated with potential privacy
non-IID setting. For a fairer comparison, it is imperative for risks from inadvertent data leakage, as demonstrated in [112],
the research community to develop a deeper understanding where certain gradient-based explanation methods are prone
of the different non-IID settings in real-world FL in order to to privacy leakage. There are a few works addressing both
simulate realistic non-IID settings. Possible scenarios include: explainability and privacy objectives simultaneously. Devel-
1) temporal skew (changes in the data distributions over time) oping an FL framework that balances the tradeoff between

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TAN et al.: TOWARDS PERSONALIZED FEDERATED LEARNING 9601

explainability and privacy is an important future research [11] W. Y. B. Lim et al., “Federated learning in mobile edge networks: A
direction. One possible approach to achieve this tradeoff is comprehensive survey,” IEEE Commun. Surveys Tuts., vol. 22, no. 3,
pp. 2031–2063, 3rd Quart., 2020.
to incorporate explainability into the global FL model but not [12] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and
the personalization component of the FL model. A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for feder-
4) Robustness: Although FL offers better privacy protec- ated learning,” in Proc. ICML, 2020, pp. 5132–5143.
[13] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the conver-
tion compared with traditional centralized model training gence of FedAvg on non-IID data,” in Proc. ICLR, 2020, pp. 1–26.
approaches, recent research has exposed vulnerabilities of FL [14] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:
that could potentially compromise data privacy [16]. It is, Challenges, methods, and future directions,” IEEE Signal Process.
Mag., vol. 37, no. 3, pp. 50–60, May 2020.
therefore, of paramount importance to study FL attack methods [15] V. Mothukuri, R. M. Parizi, S. Pouriyeh, Y. Huang, A. Dehghantanha,
and develop defensive strategies to counteract these attacks in and G. Srivastava, “A survey on security and privacy of federated learn-
order to ensure the robustness of the FL system. With more ing,” Future Gener. Comput. Syst., vol. 115, pp. 619–640, Feb. 2021.
[16] L. Lyu et al., “Privacy and robustness in federated learning: Attacks
complex protocols and architectures developed for PFL, more and defenses,” 2020, arXiv:2012.06337.
work is needed to study related forms of attacks and defenses [17] Y. Mansour, M. Mohri, J. Ro, and A. T. Suresh, “Three approaches
to enable robust PFL approaches to emerge. for personalization with applications to federated learning,” 2020,
arXiv:2002.10619.
[18] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
VII. C ONCLUSION “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell.
In this survey, we provide an overview of FL and discuss Res., vol. 16, no. 1, pp. 321–357, 2002.
[19] H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive syn-
the key motivations for PFL. We propose a unique taxonomy thetic sampling approach for imbalanced learning,” in Proc. IEEE Int.
of PFL techniques categorized according to the key chal- Joint Conf. Neural Netw. (IEEE World Congr. Comput. Intelligence),
lenges and personalization strategies in PFL and highlight key Jun. 2008, pp. 1322–1328.
[20] M. Kubat and S. Matwin, “Addressing the curse of imbalanced training
ideas, challenges, and opportunities for these PFL approaches. sets: One-sided selection,” in Proc. ICML, 1997, pp. 179–186.
Finally, we discuss commonly adopted public datasets and [21] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated
evaluation metrics in the PFL literature and outline open learning with non-IID data,” 2018, arXiv:1806.00582.
[22] E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S.-L. Kim,
problems and directions that would inspire further research “Communication-efficient on-device machine learning: Federated dis-
in PFL. We believe that the discussions in this survey based tillation and augmentation under non-IID private data,” 2018,
on our proposed PFL taxonomy will serve as a useful roadmap arXiv:1811.11479.
[23] M. Duan, D. Liu, X. Chen, R. Liu, and Y. Tan, “Self-balancing
for aspiring researchers and practitioners to enter the field of federated learning with global imbalanced data in mobile systems,”
PFL and contribute to its long-term development. IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 1, pp. 59–71, Jul. 2021.
[24] Q. Wu, X. Chen, Z. Zhou, and J. Zhang, “FedHome: Cloud-edge
based personalized federated learning for in-home health monitoring,”
ACKNOWLEDGMENT IEEE Trans. Mobile Comput., early access, Dec. 16, 2020, doi:
Any opinions, findings, conclusions, or recommendations 10.1109/TMC.2020.3045266.
[25] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learning
expressed in this material are those of the authors and do not on non-IID data with reinforcement learning,” in Proc. IEEE Conf.
reflect the views of the funding agencies. Comput. Commun. (INFOCOM), Jul. 2020, pp. 1698–1707.
[26] M. Yang, X. Wang, H. Zhu, H. Wang, and H. Qian, “Federated learning
with class imbalance reduction,” in Proc. 29th Eur. Signal Process.
R EFERENCES Conf. (EUSIPCO), Aug. 2021, pp. 2174–2178.
[1] G. A. Kaissis, M. R. Makowski, D. Rückert, and R. F. Braren, [27] Z. Chai et al., “TiFL: A tier-based federated learning system,” in Proc.
“Secure, privacy-preserving and federated machine learning in medical 29th Int. Symp. High-Performance Parallel Distrib. Comput., Jun. 2020,
imaging,” Nature Mach. Intell., vol. 2, no. 6, pp. 305–311, Jun. 2020. pp. 125–136.
[2] S. Warnat-Herresthal et al., “Swarm learning for decentralized and [28] L. Li et al., “FedSAE: A novel self-adaptive federated learning frame-
confidential clinical machine learning,” Nature, vol. 594, no. 7862, work in heterogeneous systems,” in Proc. Int. Joint Conf. Neural Netw.
pp. 265–270, 2021. (IJCNN), Jul. 2021.
[3] M. J. Sheller et al., “Federated learning in medicine: Facilitating multi- [29] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,
institutional collaborations without sharing patient data,” Sci. Rep., “Federated optimization in heterogeneous networks,” in Proc. Mach.
vol. 10, no. 1, pp. 1–12, Dec. 2020. Learn. Syst., vol. 2, 2020, pp. 429–450.
[4] I. Dayan et al., “Federated learning for predicting clinical outcomes in [30] X. Yao and L. Sun, “Continual local training for better initialization
patients with COVID-19,” Nat. Med., pp. 1–9, 2021. of federated models,” in Proc. IEEE Int. Conf. Image Process. (ICIP),
[5] P. Voigt and A. von dem Bussche, The EU General Data Protection Oct. 2020, pp. 1736–1740.
Regulation (GDPR). Springer, 2017. [31] K. James et al., “Overcoming catastrophic forgetting in neural net-
[6] Y. Cheng, Y. Liu, T. Chen, and Q. Yang, “Federated learning for works,” Proc. Nat. Acad. Sci. USA, vol. 114, no. 13, pp. 3521–3526,
privacy-preserving AI,” Commun. ACM, vol. 63, no. 12, pp. 33–36, Mar. 2017.
Nov. 2020. [32] Q. Li, B. He, and D. Song, “Model-contrastive federated learning,”
[7] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
Concept and applications,” ACM Trans. Intell. Syst. Technol., vol. 10, Jun. 2021, pp. 10713–10722.
no. 2, pp. 1–19, 2019. [33] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, “Meta-
[8] E. B. P. Kairouz and H. B. Mcmahan, “Advances and open problems learning in neural networks: A survey,” IEEE Trans. Pattern Anal.
in federated learning,” Found. Trends Mach. Learn., vol. 14, no. 1, Mach. Intell., p. 1, 2020.
pp. 1–210, 2021. [34] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
[9] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and for fast adaptation of deep networks,” in Proc. ICML, 2017,
B. A. Y. Arcas, “Communication-efficient learning of deep networks pp. 1126–1135.
from decentralized data,” in Proc. AISTATS, 2017, pp. 1273–1282. [35] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning
[10] G. Drainakis, K. V. Katsaros, P. Pantazopoulos, V. Sourlas, and algorithms,” 2018, arXiv:1803.02999.
A. Amditis, “Federated vs. centralized machine learning under privacy- [36] Y. Jiang, J. Konečný, K. Rush, and S. Kannan, “Improving federated
elastic users: A comparative analysis,” in Proc. IEEE NCA, Nov. 2020, learning personalization via model agnostic meta learning,” 2019,
pp. 1–8. arXiv:1909.12488.

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
9602 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

[37] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated [67] C. Briggs, Z. Fan, and P. Andras, “Federated learning with hierarchical
learning with theoretical guarantees: A model-agnostic meta-learning clustering of local updates to improve training on non-IID data,” in
approach,” in Proc. NIPS, vol. 33, 2020, pp. 3557–3568. Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2020, pp. 1–9.
[38] A. Fallah, A. Mokhtari, and A. Ozdaglar, “On the convergence theory [68] A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient
of gradient-based model-agnostic meta-learning algorithms,” in Proc. framework for clustered federated learning,” in Proc. NIPS, vol. 33,
AISTATS, 2020, pp. 1082–1092. 2020, pp. 19586–19597.
[39] C. T. Dinh, N. Tran, and J. Nguyen, “Personalized federated learning [69] L. Huang, A. L. Shea, H. Qian, A. Masurkar, H. Deng, and
with Moreau envelopes,” in Proc. Adv. Neural Inf. Process. Syst. D. Liu, “Patient clustering improves efficiency of federated machine
(NIPS), vol. 33, 2020, pp. 21394–21405. learning to predict mortality and hospital stay time using distributed
[40] M. Khodak, M.-F. Balcan, and A. Talwalkar, “Adaptive gradient-based electronic medical records,” J. Biomed. Informat., vol. 99, Nov. 2019,
meta-learning methods,” in Proc. NIPS, vol. 32, 2019, pp. 5917–5928. Art. no. 103291.
[41] S. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. [70] M. Duan et al., “FedGroup: Efficient federated learning
Knowl. Data Eng., vol. 22, pp. 1345–1359, Nov. 2010. via decomposed similarity-based clustering,” in Proc. IEEE
[42] D. Li and J. Wang, “FedMD: Heterogenous federated learning via Int. Conf Parallel Distrib. Process. Appl., Big Data Cloud
model distillation,” 2019, arXiv:1910.03581. Comput., Sustain. Comput. Commun., Social Comput. Netw.
[43] Y. Chen, X. Qin, J. Wang, C. Yu, and W. Gao, “FedHealth: A federated (ISPA/BDCloud/SocialCom/SustainCom), Sep. 2021, pp. 228–237.
transfer learning framework for wearable healthcare,” IEEE Intell. Syst., [71] S. Vassilvitskii and D. Arthur, “K-means++: The advantages of careful
vol. 35, no. 4, pp. 83–93, Jul. 2020. seeding,” in Proc. ACM-SIAM, 2006, pp. 1027–1035.
[44] H. Yang, H. He, W. Zhang, and X. Cao, “FedSteg: A federated transfer [72] M. Xie et al., “Multi-center federated learning,” 2020,
learning framework for secure image steganalysis,” IEEE Trans. Netw. arXiv:2005.01026.
Sci. Eng., vol. 8, no. 2, pp. 1084–1094, Apr. 2021. [73] Y. Liu et al., “Search to distill: Pearls are everywhere but not the eyes,”
[45] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
adaptation,” in Proc. AAAI Conf. Artif. Intell., 2016, pp. 2058–2065. Jun. 2020, pp. 7539–7548.
[46] M. G. Arivazhagan, V. Aggarwal, A. K. Singh, and S. Choudhary, “Fed- [74] C. Li et al., “Block-wisely supervised neural architecture search with
erated learning with personalization layers,” 2019, arXiv:1912.00818. knowledge distillation,” in Proc. CVPR, Jun. 2020, pp. 1989–1998.
[47] D. Bui et al., “Federated user representation learning,” 2019, [75] Y. Liang, Y. Guo, Y. Gong, C. Luo, J. Zhan, and Y. Huang, “Flbench:
arXiv:1909.12535. A benchmark suite for federated learning,” in Proc. FICC, 2020,
[48] P. Pu Liang et al., “Think locally, act globally: Federated learning with pp. 166–176.
local and global representations,” 2020, arXiv:2001.01523. [76] T. Hao et al., “Edge AIBench: Towards comprehensive end-to-end edge
[49] O. Gupta and R. Raskar, “Distributed learning of deep neural network computing benchmarking,” in Bench, 2018, pp. 23–30.
over multiple agents,” J. Netw. Comput. Appl., vol. 116, pp. 1–8, [77] S. Hu, Y. Li, X. Liu, Q. Li, Z. Wu, and B. He, “The OARF bench-
Aug. 2018. mark suite: Characterization and implications for federated learning
[50] P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning systems,” 2020, arXiv:2006.07856.
for health: Distributed deep learning without sharing raw patient data,” [78] C. He et al., “FedGraphNN: A federated learning system and bench-
2018, arXiv:1812.00564. mark for graph neural networks,” 2021, arXiv:2104.07145.
[51] C. Thapa, M. A. P. Chamikara, S. Camtepe, and L. Sun, [79] S. Caldas et al., “LEAF: A benchmark for federated settings,” 2018,
“SplitFed: When federated learning meets split learning,” 2020, arXiv:1812.01097.
arXiv:2004.12088. [80] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik, “EMNIST:
[52] Y. Gao et al., “End-to-end evaluation of federated learning and split Extending MNIST to handwritten letters,” in IJCNN, vol. 2017,
learning for Internet of Things,” in Proc. IEEE SRDS, Sep. 2020, pp. 2921–2926.
pp. 91–100. [81] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes
[53] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: in the wild,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
Concept and applications,” ACM TIST, vol. 10, no. 2, pp. 1–19, 2019. pp. 3730–3738.
[54] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a [82] J. Luo et al., “Real-world image datasets for federated learning,” 2019,
neural network,” 2015, arXiv:1503.02531. arXiv:1910.11089.
[55] Z. Zhu, J. Hong, and J. Zhou, “Data-free knowledge distillation for [83] T.-M. H. Hsu, H. Qi, and M. Brown, “Federated visual classification
heterogeneous federated learning,” in ICML, 2021. with real-world data distribution,” in Proc. ECCV, 2020, pp. 76–92.
[56] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation for [84] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
robust model fusion in federated learning,” in Proc. NIPS, vol. 33, learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
2020, pp. 2351–2363. pp. 2278–2324, Nov. 1998.
[57] C. He, M. Annavaram, and S. Avestimehr, “Group knowledge transfer: [85] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
Federated learning of large cnns at the edge,” in Proc. NIPS, vol. 33, MIT and NYU, Tech. Rep., 2009.
2020, pp. 14068–14080. [86] T.-M. Harry Hsu, H. Qi, and M. Brown, “Measuring the effects of
[58] I. Bistritz, A. Mann, and N. Bambos, “Distributed distillation for on- non-identical data distribution for federated visual classification,” 2019,
device learning,” in Proc. NIPS, vol. 33, 2020, pp. 22593–22604. arXiv:1909.06335.
[59] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated [87] K. Wang, R. Mathews, C. Kiddon, H. Eichner, F. Beaufays, and
multi-task learning,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), D. Ramage, “Federated evaluation of on-device personalization,” 2019,
vol. 30, 2017, pp. 4427–4437. arXiv:1910.10252.
[60] L. Corinzia, A. Beuret, and J. M. Buhmann, “Variational federated [88] T. Li, S. Hu, A. Beirami, and V. Smith, “Ditto: Fair and robust
multi-task learning,” 2019, arXiv:1906.06268. federated learning through personalization,” in Proc. ICML, 2021,
[61] Y. Huang et al., “Personalized cross-silo federated learning on non- pp. 6357–6368.
IID data,” in Proc. AAAI Conf. Artif. Intell., vol. 35, no. 9, 2021, [89] S. Divi, Y.-S. Lin, H. Farrukh, and Z. Berkay Celik, “New metrics
pp. 7865–7873. to evaluate the performance and fairness of personalized federated
[62] N. Shoham et al., “Overcoming forgetting in federated learning on learning,” 2021, arXiv:2107.13173.
non-IID data,” 2019, arXiv:1910.07796. [90] D. Sui, Y. Chen, J. Zhao, Y. Jia, Y. Xie, and W. Sun, “Feded: Federated
[63] F. Hanzely and P. Richtárik, “Federated learning of a mixture of global learning via ensemble distillation for medical relation extraction,” in
and local models,” 2020, arXiv:2002.05516. Proc. EMNLP, 2020, pp. 2118–2128.
[64] Y. Deng, M. M. Kamani, and M. Mahdavi, “Adaptive personalized [91] P. Xiao, S. Cheng, V. Stankovic, and D. Vukobratovic, “Averaging is
federated learning,” 2020, arXiv:2003.13461. probably not the optimum way of aggregating parameters in federated
[65] E. Diao, J. Ding, and V. Tarokh, “Heterofl: Computation and commu- learning,” Entropy, vol. 22, no. 3, p. 314, Mar. 2020.
nication efficient federated learning for heterogeneous clients,” in Proc. [92] H. Wang, M. Yurochkin, Y. Sun, D. Papailiopoulos, and Y. Khazaeni,
ICLR, 2021, pp. 1–24. “Federated learning with matched averaging,” in Proc. ICLR, 2020,
[66] F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learn- pp. 1–16.
ing: Model-agnostic distributed multitask optimization under privacy [93] H. Zhu, H. Zhang, and Y. Jin, “From federated learning to federated
constraints,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 8, neural architecture search: A survey,” Complex Intell. Syst., vol. 7,
pp. 3710–3722, Aug. 2021. no. 2, pp. 639–657, Apr. 2021.

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TAN et al.: TOWARDS PERSONALIZED FEDERATED LEARNING 9603

[94] Q. Wu, K. He, and X. Chen, “Personalized federated learning for Han Yu (Member, IEEE) received the Ph.D. degree
intelligent IoT applications: A cloud-edge based framework,” IEEE from the School of Computer Science and Engi-
Open J. Comput. Soc., vol. 1, pp. 35–44, 2020. neering, Nanyang Technological University (NTU),
[95] R. Kemker, M. McClure, A. Abitino, T. Hayes, and C. Kanan, “Mea- Singapore.
suring catastrophic forgetting in neural networks,” in Proc. AAAI Conf. He held the prestigious Lee Kuan Yew Post-
Artif. Intell., vol. 2018, pp. 3390–3398. Doctoral Fellowship from 2015 to 2018. He is
[96] M. Delange et al., “A continual learning survey: Defying forgetting currently a Nanyang Assistant Professor with the
in classification tasks,” IEEE Trans. Pattern Anal. Mach. Intell., early School of Computer Science and Engineering, NTU.
access, Feb. 5, 2021, doi: 10.1109/TPAMI.2021.3057446. He has authored or coauthored over 150 research
[97] Y. Lin, S. Han, H. Mao, Y. Wang, and W. Dally, “Deep gradient papers and book chapters in leading international
compression: Reducing the communication bandwidth for distributed conferences and journals. He has coauthored the
training,” in Proc. ICLR, 2018, pp. 1–14. book Federated Learning, the first monograph on the topic of federated
[98] Y. Chen, X. Sun, and Y. Jin, “Communication-efficient federated deep learning. His research interests include federated learning and algorithmic
learning with layerwise asynchronous model update and temporally fairness.
weighted aggregation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, Dr. Yu’s research works have received multiple awards from conferences
no. 10, pp. 4229–4238, Oct. 2020. and journals.
[99] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning
under concept drift: A review,” IEEE Trans. Knowl. Data Eng., vol. 31,
no. 12, pp. 2346–2363, Dec. 2019.
[100] F. E. Casado, D. Lema, R. Iglesias, C. V. Regueiro, and S. Barro,
“Concept drift detection and adaptation for robotics and mobile devices
in federated and continual settings,” in Proc. WAF, 2021, pp. 79–93. Lizhen Cui (Member, IEEE) is currently a Professor
[101] S. Zheng, Y. Cao, M. Yoshikawa, H. Li, and Q. Yan, “FL-market: and the Vice Chair of the School of Software Engi-
Trading private models in federated learning,” 2021, arXiv:2106.04384. neering, Shandong University, Jinan, China. From
[102] Y. Zhan, J. Zhang, Z. Hong, L. Wu, P. Li, and S. Guo, “A sur- 2013 and 2014, he was a Visiting Scholar with the
vey of incentive mechanism design for federated learning,” IEEE Georgia Institute of Technology, Atlanta, GA, USA.
Trans. Emerg. Topics Comput., early access, Mar. 3, 2021, doi: His current research interests include data science
10.1109/TETC.2021.3063517. and engineering, intelligent data analysis, service
[103] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, computing, and collaborative computing.
“A survey on bias and fairness in machine learning,” ACM Comput.
Surv., vol. 54, no. 6, pp. 1–35, Jul. 2021.
[104] K. Holstein, J. W. Vaughan, H. Daumé III, M. Dudík, and H. Wallach,
“Improving fairness in machine learning systems: What do industry
practitioners need?” in Proc. CHI Conf. Hum. Factors Comput. Syst.,
2019, pp. 1–16.
[105] M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,”
in Proc. ICML, 2019, pp. 4615–4625.
Qiang Yang (Fellow, IEEE) received the [Link].
[106] T. Li, M. Sanjabi, A. Beirami, and V. Smith, “Fair resource allocation
degree in astrophysics from Peking University,
in federated learning,” in Proc. ICLR, 2020, pp. 1–27.
Beijing, China, in 1982, and the Ph.D. degree in
[107] J. Zhang, C. Li, A. Robles-Kelly, and M. Kankanhalli, “Hierarchically
computer science and the [Link]. degree in astro-
fair federated learning,” 2020, arXiv:2004.10386.
physics from the University of Maryland at College
[108] L. Lyu et al., “Towards fair and privacy-preserving federated deep mod-
Park, College Park, MD, USA, in 1985 and 1989,
els,” IEEE Trans. Parallel Distrib. Syst., vol. 31, no. 11, pp. 2524–2541,
respectively.
Nov. 2020.
He was a Faculty Member with the University
[109] A. Adadi and M. Berrada, “Peeking inside the black-box: A survey
of Waterloo, Waterloo, ON, Canada, from 1989 to
on explainable artificial intelligence (XAI),” IEEE Access, vol. 6,
1995, and Simon Fraser University, Burnaby, BC,
pp. 52138–52160, 2018.
Canada, from 1995 to 2001. He was the Founding
[110] A. B. Arrieta et al., “Explainable artificial intelligence (XAI): Concepts,
Director of the Noah’s Ark Laboratory, Huawei, Hong Kong, from 2012 to
taxonomies, opportunities and challenges toward responsible AI,” Inf.
2014, and a Co-Founder of 4Paradigm Corporation, Beijing, an AI platform
Fusion, vol. 58, pp. 82–115, Jun. 2020.
company. He is currently the Head (Chief AI Officer) with the AI Department,
[111] S. Tonekaboni, S. Joshi, M. D. McCradden, and A. Goldenberg,
WeBank, Shenzhen, China, and the Chair Professor with the Department of
“What clinicians want: Contextualizing explainable machine learning
Computer Science and Engineering (CSE), The Hong Kong University of
for clinical end use,” in Proc. MLHC, 2019, pp. 359–380.
Science and Technology, Hong Kong, where he was the Former Head
[112] R. Shokri, M. Strobel, and Y. Zick, “On the privacy risks of model
of the Department of CSE and the Founding Director of the Big Data
explanations,” in Proc. AAAI/ACM Conf. AI, Ethics, Soc., Jul. 2021,
Institute, Hong Kong, from 2015 to 2018. He has authored several books,
pp. 231–241.
including Intelligent Planning (Springer), Crafting Your Research Future
(Morgan and Claypool), and Constraint-Based Design Recovery for Software
Alysa Ziying Tan received the master’s degree Engineering (Springer). His research interests include artificial intelligence,
in intelligent systems from the National University machine learning, and data mining, with an emphasis on transfer learning,
of Singapore, Singapore. She is currently pursu- automated planning, federated learning, and case-based reasoning.
ing the Ph.D. degree with the Alibaba-NTU Joint Dr. Yang is a fellow of several international societies, including the ACM,
Research Institute, Nanyang Technological Univer- AAAI, IAPR, and AAAS. He served as an Executive Council Member for
sity, Singapore. the Association for the Advancement of AI from 2016 to 2020 and the
She worked as a data scientist and built deep learn- President for the International Joint Conference on AI from 2017 to 2019.
ing and optimization solutions across manufacturing, He was a recipient of several awards, including the 2004/2005 ACM KDDCUP
insurance, and supply chain domains. She received Championship, the AAAI Innovative AI Applications Award in 2016, and
the inaugural IMDA Singapore Digital Postgraduate the ACM SIGKDD Distinguished Service Award in 2017. He was the
Scholarship. Her research interests include federated Founding Editor-in-Chief of the ACM Transactions on Intelligent Systems
learning, deep learning, and optimization. and Technology and the IEEE T RANSACTIONS ON B IG D ATA.

Authorized licensed use limited to: UNIV OF MASS-LOWELL. Downloaded on April 13,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.

You might also like