0% found this document useful (0 votes)

4 views14 pages

Research Paper1

This paper presents an experimental study on federated learning (FL) in non-IID data settings, addressing the challenges posed by fragmented data across multiple organizations and countries. The authors propose six comprehensive data partitioning strategies and conduct extensive experiments on four state-of-the-art FL algorithms, revealing that none consistently outperforms the others across all scenarios. The findings emphasize the significance of understanding non-IID data distributions for improving FL effectiveness and provide insights for future research in this area.

Uploaded by

prabal.29082002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views14 pages

Research Paper1

Uploaded by

prabal.29082002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

2022 IEEE 38th International Conference on Data Engineering (ICDE)

Federated Learning on Non-IID Data Silos: An

Experimental Study
Qinbin Li∗ Yiqun Diao∗ Quan Chen Bingsheng He
National University of Singapore Shanghai Jiao Tong University National University of Singapore
Singapore Shanghai, China Singapore
qinbin@[Link] {diaoyiqun, chen-quan}@[Link] hebs@[Link]

Abstract—Due to the increasing privacy concerns and data centralized to a single country due to the data regulations in
2022 IEEE 38th International Conference on Data Engineering (ICDE) | 978-1-6654-0883-7/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICDE53745.2022.00077

regulations, training data have been increasingly fragmented, many countries.

forming distributed databases of multiple “data silos” (e.g., To develop effective machine learning services, it is neces-
within different organizations and countries). To develop effective
machine learning services, there is a must to exploit data from sary to exploit data from such distributed databases without
such distributed databases without exchanging the raw data. Re- exchanging the raw data. While there are many studies work-
cently, federated learning (FL) has been a solution with growing ing on privacy-preserving data management and data mining
interests, which enables multiple parties to collaboratively train [3], [30], [59], [63], [65] in a centralized setting, they cannot
a machine learning model without exchanging their local data. handle the cases of distributed databases. Thus, how to con-
A key and common challenge on distributed databases is the
heterogeneity of the data distribution among the parties. The data duct data mining/machine learning from distributed databases
of different parties are usually non-independently and identically without exchanging local data has become an emerging topic.
distributed (i.e., non-IID). There have been many FL algorithms To address the above challenge, we borrow the federated
to address the learning effectiveness under non-IID data settings. learning (FL) [32], [42], [43], [76] approach from the machine
However, there lacks an experimental study on systematically learning community. Originally proposed by Google, FL is
understanding their advantages and disadvantages, as previous
studies have very rigid data partitioning strategies among parties, a promising solution to enable many parties jointly train a
which are hardly representative and thorough. In this paper, to machine learning model while keeping their local data decen-
help researchers better understand and study the non-IID data tralized. Here we focus on horizontal federated learning, where
setting in federated learning, we propose comprehensive data the parties share the same feature space but different sample
partitioning strategies to cover the typical non-IID data cases. space. Instead of exchanging data and conducting centralized
Moreover, we conduct extensive experiments to evaluate state-of-
the-art FL algorithms. We find that non-IID does bring significant training, each party sends its model to the server, which
challenges in learning accuracy of FL algorithms, and none of updates and sends back the global model to the parties in each
the existing state-of-the-art FL algorithms outperforms others in round. Since their raw data are not exposed, FL is an effective
all cases. Our experiments provide insights for future studies of way to address privacy concerns. It has attracted many research
addressing the challenges in “data silos”. interests [9], [25], [34], [41], [44], [51], [74] and been widely
used in practice [5], [23], [33]. Thus, we consider FL to
I. I NTRODUCTION develop machine learning services for distributed databases.
In recent years, we have witnessed some promising ad- One key and common data challenge in such distributed
vancement with leveraging machine learning services, such databases is that data distributions in different parties are
as learned index structures [12], [53] and learned cost esti- usually non-independently and identically distributed (non-
mation [24], [54]. As such, machine learning services have IID). For example, different areas can have very different
become emerging data-intensive workloads, such as [Link] disease distributions. Due to the ozone hole, the countries in
[45], Machine Learning Bazaar [68] and Rafiki [73]. Despite the Southern Hemisphere may have more skin cancer patients
the success of machine learning services, their effectiveness than the Northern Hemisphere. Then, the label distributions
highly relies on large-volume high-quality training data. How- differ across parties. Another example is that people have
ever, due to the increasing privacy concerns and data regula- different writing styles even for the same world. In such a
tions such as GDPR [69], training data have been increasingly case, the feature distributions differ across parties. According
fragmented, forming distributed databases of multiple “data to previous studies [27], [34], [46], the non-IID data settings
silos” (e.g., within different organizations and countries). Due can degrade the effectiveness of machine learning services.
to the deployed data regulations, raw data are usually not al- There have been some studies trying to develop effective
lowed to transfer across organizations/countries. For example, FL algorithms under non-IID data including FedProx [44],
a multinational corporation (MNC) provides services to users SCAFFOLD [34], and FedNova [71]. However, there lacks
in multiple nations, whose personal data usually cannot be an experimental study on systematically understanding their
advantages and disadvantages, as the previous studies have
∗ Equal contribution. very rigid data partitioning strategies among parties, which

2375-026X/22/$31.00 ©2022 IEEE 965

DOI 10.1109/ICDE53745.2022.00077
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
are hardly representative and thorough. In the experiments of
these studies, they only try one or two partitioning strategies to
simulate the non-IID data setting, which does not sufficiently
cover different non-IID cases. For example, in FedAvg [55],
each party only has samples of two classes. In FedNova [71],
the number of samples of each class in each party follows
Dirichlet distribution. The above partitioning strategies only

cover the label skewed case. Thus, it is a necessity to evaluate
those algorithms with a systematic exploration of different
non-IID scenarios.

In this paper, we break the barrier of experiments on non-
IID data distribution challenges in FL by proposing NIID-
Bench. Specifically, we introduce six non-IID data partitioning
strategies which thoroughly consider different cases including
label distribution skew, feature distribution skew, and quantity Fig. 1. The FedAvg framework.
skew. Moreover, we conduct extensive experiments on nine
datasets to evaluate the accuracy of four state-of-the-art FL
For example, paper [27] only covers a single partitioning
algorithms including FedAvg [55], FedProx [44], SCAFFOLD
strategy to generate the label distribution skew setting.
[34], and FedNova [71]. The experimental results provide
• Using the proposed partitioning strategies, we conduct
insights for the future development of FL algorithms. Last,
an extensive experimental study on four state-of-the-
our code is publicly available 1 . Researchers can easily use our
art algorithms, including FedAvg [55], FedProx [44],
code to try different partitioning strategies for the evaluation
SCAFFOLD [34], and FedNova [71]. Moreover, we
of existing algorithms or a new algorithm. We also maintain
provide insightful findings and future directions for data
a leaderboard along with our code to rank state-of-the-art
management and learning for distributed data silos, which
federated learning algorithms on different non-IID settings,
we believe are more and more common in the future.
which can benefit the federated learning community a lot.
Through extensive studies, we have the following key II. P RELIMINARIES
findings. First, we find that non-IID does bring significant
challenges in learning accuracy of FL algorithms, and none of A. Notations
the existing state-of-the-art FL algorithms outperforms others Let D = {(x, y)} denote the global dataset. Suppose there
in all cases. Second, the effectiveness of FL is highly related are N parties, denoted as P1 , ..., PN . The local dataset of Pi
to the kind of data skews, e.g., the label distribution skew is denoted as Di = {(xi , yi )}. We use wt and wit to denote
setting is more challenging than the quantity skew setting. the global model and the local model of party Pi in round
This indicates the importance of having a more comprehensive t, respectively. Thus, wt is the output model of the federated
benchmark on non-IID distributions. Last, in non-IID data learning process.
setting, instability of the learning process widely exists due to
B. FedAvg
techniques such as batch normalization and partial sampling.
This can severely hurt the effectiveness of machine learning FedAvg [55] has been a de facto approach for FL. The
services on distributed data silos. framework of FedAvg is shown in Figure 1. In each round,
Our main contributions are as follows: first, the server sends the global model to the randomly
selected parties. Second, each party updates the model with its
• We identity non-IID data distributions as a key and local dataset. Then, the updated models are sent back to the
common challenge in designing effective federated learn- server. Last, the server averages the received local models as
ing algorithms for distributed data silos and develop a the updated global model. Unlike traditional distributed SGD,
benchmark for researchers’ study of federated learning the parties update their local model with multiple epochs,
on non-IID data. which can decrease the number of communication rounds and
• We summarize six different partitioning strategies to is much more communication-efficient. However, the local
generate comprehensive non-IID data distribution cases. updates may lead to a bad accuracy, as shown in previous
Among six partitioning strategies, four simple and ef- studies [27], [34], [46].
fective partitioning strategies are designed by our study,
while the other two strategies are adopted from existing C. Effect of Non-IID Data
studies due to their popularity. We also demonstrate the A key challenge in FL is the non-IID data among the parties
significance of those strategies. None of the previous [32], [42]. Non-IID data can influence the accuracy of FedAvg
studies [27], [34], [44], [71] are as comprehensive as ours. a lot. Since the distribution of each local dataset is highly
different from the global distribution, the local objective of
1 [Link]
each party is inconsistent with the global optima. Thus, there

966

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
IID setting non-IID setting ∗ Algorithm 1: A summary of FL algorithms including
FedAvg/FedProx/FedNova. We use red and orange
∗
colors to mark the part specially included in FedProx
∗ ∗
∗ ∗ and FedNova, respectively.
Input: local datasets Di , number of parties N , number
local model global model local optima global optima of communication rounds T , number of local
epochs E, learning rate η
Fig. 2. Example of a drift under the non-IID setting. Output: The final model wT
1 Server executes:
exists a drift in the local updates [34]. In other words, in 2 initialize x0
the local training stage, each model is updated towards its 3 for t = 0, 1, ..., T − 1 do
own local optima, which can be far from the global optima. 4 Sample a set of parties St
The averaged model may also be far from the global optima 5 n ← i∈St |Di |
especially when the local updates are large (e.g., a large 6 for i ∈ St in parallel do
number of local epochs) [34], [44], [70], [71]. Eventually, the 7 send the global model wt to party Pi
converged global model has much worse accuracy than IID 8 Δwit , τi ← LocalTraining(i, wt )
setting. Figure 2 demonstrates the issue of FedAvg under the 9 For FedAvg/FedProx:
non-IID data setting. Under the IID setting, the global optima |D i |
wt+1 ← wt − η i∈St n Δwk
t
w∗ is close to the local optima w1∗ and w2∗ . Thus, the averaged
10 For FedNova:
model wt+1 is also close to the global optima. However, under
|D i |τi |D i |Δwit
the non-IID setting, since w∗ is far from w1∗ , wt+1 can be far wt+1 ← wt − η i∈St
n i∈St nτi
from w∗ . It is challenging to design an effective FL algorithm 11 return wT
under the non-IID setting. We will present the FL algorithms
on handling non-IID data in the next section. 12 Party executes:
13 For FedAvg/FedNova: L(w; b) = (x,y)∈b (w; x; y)
III. FL A LGORITHMS ON N ON -IID DATA 14 For FedProx:
2
There have been some studies [34], [44], [71] trying to L(w; b) = (x,y)∈b (w; x; y)+ μ2 w − wt
address the drift issue in FL. Here we summarize several state- 15 LocalTraining(i, wt ):
of-the-art and popular approaches as shown in Algorithm 1 16 wit ← wt
(FedAvg [55], FedProx [44], FedNova [71]) and Algorithm 17 τi ← 0
2 (SCAFFOLD [34]). These approaches are all based on 18 for epoch k = 1, 2, ..., E do
FedAvg, and we use colors to mark the parts that specially 19 for each batch b = {x, y} of Di do
designed in FedProx (red), SCAFFOLD (blue), and FedNova 20 wit ← wit − η∇L(wit ; b)
(orange). Note that the studied approaches have the same 21 τi ← τi + 1
objective, i.e., learning an effective global model under the
22 Δwit ← wt − wit
non-IID data setting. There are also other FL studies related
23 return Δwit , τi to the server
to non-IID data setting, such as personalizing the local models
for each party [13], [15], [22] and designing robust algorithms
against different combinations of local distributions [10], [56],
[62], which are out of the scope of this paper. is too small, then the regularization term has almost no effect.
If μ is too big, then the local updates are very small and the
A. FedProx convergence speed is slow.
FedProx [44] improves the local objective based on FedAvg.
It directly limits the size of local updates. Specifically, as B. FedNova
shown in Line 14 of Algorithm 1, it introduces an additional Another recent study, FedNova [71], improves FedAvg in
L2 regularization term in the local objective function to limit the aggregation stage. It considers that different parties may
the distance between the local model and the global model. conduct different numbers of local steps (i.e., the number of
This is a straightforward way to limit the local updates so mini-batches in the local training) each round. This can happen
that the averaged model is not so far from the global optima. when parties have different computation power given the same
A hyper-parameter μ is introduced to control the weight of time constraint or parties have different local dataset size given
the L2 regularization. Overall, the modification to FedAvg is the same number of local epochs and batch size. Intuitively, the
lightweight and easy to implement. FedProx introduces addi- parties with a larger number of local steps will have a larger
tional computation overhead and does not introduce additional local update, which will have a more significant influence on
communication overhead. However, one drawback is that users the global updates if simply averaged. Thus, to ensure that the
may need to carefully tune μ to achieve good accuracy. If μ global updates are not biased, FedNova normalizes and scales

967

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
the local updates of each party according to their number of Algorithm 2: The SCAFFOLD algorithm. We use
local steps before updating the global model (see Line 10 blue color to mark the part specially included in
of Algorithm 1). FedNova also only introduces lightweight SCAFFOLD compared with FedAvg.
modifications to FedAvg, and negligible computation overhead Input: same as Algorithm 1
when updating the global model. Output: The final model wT
C. SCAFFOLD 1 Server executes:
2 initialize x0
SCAFFOLD [34] models non-IID as introducing variance
3 ct ← 0
among the parties and applies the variance reduction technique
4 for t = 0, 1, ..., T − 1 do
[31], [64]. It introduces control variates for the server (i.e., c)
and parties (i.e., ci ), which are used to estimate the update
5
sample a set of parties St
Randomly
6 n ← i∈St |Di |
direction of the server model and the update direction of each
7 for i ∈ St in parallel do
client. Then, the drift of local training is approximated by the
8 send the global model wt to party Pi
difference between these two update directions. Thus, SCAF-
Δwit , Δc ← LocalTraining(i, wt , ct )
FOLD corrects the local updates by adding the drift in the i
local training (Line 20 of Algorithm 2). SCAFFOLD proposes 9 wt+1 ← wt − η i∈St |Dn | Δwkt
two approaches to update the local control variates (Line 23 of 10 ct+1 ← ct + N1 Δc
Algorithm 2), by computing the gradient of the local data at the 11 return wT
global model or by reusing the previously computed gradients.
The second approach has a lower computation cost while 12 Party executes:

the first one may be more stable. Compared with FedAvg, 13 L(w; b) = (x,y)∈b (w; x; y)
intuitively, SCAFFOLD doubles the communication size per 14 ci ← 0
round due to the additional control variates. 15 LocalTraining(i, wt , ct ):
16 wit ← wt
D. Other Studies 17 τi ← 0
18 for epoch k = 1, 2, ..., E do
When preparing this paper, there are other contemporary
19 for each batch b = {x, y} of Di do
works [2], [39], [47], [72] on federated learning under non-IID
20 wit ← wit − η(∇L(wit ; b)−cti + c)
setting. [2] proposes FedDyn, which adds a regularization term
21 τi ← τi + 1
in the local training based on the global model and the model
from the previous round. [47] proposes FedBN for feature 22 Δwit ← wt − wit
1
shift non-IID setting, where the client batch-norm layers are 23 c∗i ← (i)∇L(wit ), or(ii)ci − c + τi η (w
t
− wit )
updated locally without communicating to the server. [72] 24 Δc ← c∗i − ci
applies a monitor to detect class imbalance in the training 25 ci ← c∗i
process, and proposes a new loss function to address it. [39] 26 return Δwit , Δc to the server
proposes model-contrastive learning. Their approach corrects
the local training by comparing the representations learned
by the current local model, the local model from the previous
IV. S IMULATING N ON -IID DATA S ETTING
round, and the global model. We leave the comparison between
these studies as future studies. As existing studies only adopt limited partitioning strategies,
they cannot represent a comprehensive view of non-IID cases.
E. Motivation of this study To bridge this gap, we develop a benchmark named NIID-
Bench.
Non-IID is a key and common data challenge for developing
effective federated learning algorithms. Although previous A. Research Problems
studies [34], [44], [71] have demonstrated preliminary and We need to address two key research problems. The first one
promising results over FedAvg on non-IID data, as we will is on data sets: whether to use real-world non-IID datasets
summarize in Table I in later section, all above studies have or synthetic datasets. The second one is on how to design
evaluated only one or two non-IID distributions, and tried comprehensive non-IID scenarios.
rigid data partitioning strategies in the experiments. There is For the first problem, we choose to synthesize the distributed
still no standard benchmark or a systematic study to evaluate non-IID datasets by partitioning a real-world dataset into
the effectiveness of these FL algorithms. This motivates us to multiple smaller subsets. Many existing studies [34], [55], [71]
develop a benchmark with more comprehensive data distribu- use the partitioning approach to simulate the non-IID federated
tions as well as data partitioning strategies, and then we can setting. Compared with using real federated datasets [6], [28],
evaluate the pros and cons of existing algorithms and outline adopting partitioning strategies has the following advantages.
the challenges and opportunities for future federated learning First, while it is challenging to evaluate the imbalance prop-
on non-IID data. erties (e.g., imbalanced level and imbalanced case) in real

968

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
federated datasets, partitioning strategies can easily quantify
and control the imbalance properties of the local data. Thus,
researchers can easily investigate the behavior of algorithms by
trying different imbalanced settings, which is essential to the
development of FL algorithms. Second, when using synthetic
datasets, one can easily set different factors (e.g., number of
parties, size of data) that are important in the FL experiments.
However, a real federated dataset usually corresponds to a (a) The label distribution for Criteo. The value in cell (a, b) is the
fixed federated setting. Last, due to data regulation and privacy amount of data samples of class b belonging to Party a.
concerns, meaningful real federated datasets are difficult to
obtain [28]. Even if we can obtain such real datasets, they
do not have the previous two advantages of synthetic data
sets. It is more flexible to develop partitioning strategies on
existing widely used public datasets, which already have lots
of centralized training knowledge as reference, as well as to
simulate different non-IID scenarios. There are also limitations
of using generated datasets compared with using real federated
datasets. The generated datasets may not fully capture the real
data distributions, which can be complicated and challenging
to quantify. Note that the usage of generated federated datasets
and real federated datasets are orthogonal. It is an interesting
future study to find and study meaningful real-world data sets
and application scenarios.
For the second problem, an existing study [32] gives a very (b) The feature distribution for Digits. The triangles are the
good and comprehensive summary on non-IID data cases from visualized features of SVHN and the circles are the visualized
a distribution perspective. Specifically, considering the local features of MNIST.
data distribution P (xi , yi ) = P (xi |yi )P (yi ) or P (xi , yi ) = Fig. 3. The non-IID properties of Criteo and Digits.
P (yi |xi )P (xi ), the previous study [32] summaries five dif-
ferent non-IID cases: (1) label distribution skew (i.e., P (yi )
is different among parties); (2) feature distribution skew (i.e., the label distribution as shown in Figure 3a. We can observe
P (xi ) is different among parties); (3) same label but different that there exists both label distribution skew (e.g., Party 0
features (i.e., P (xi |yi ) is different among parties); (4) same and Party 4) and quantity skew (e.g., Party 0 and Party 8)
features but different labels (i.e., P (yi |xi ) is different among among the parties. In Digits, taking each subset (e.g., MNIST
parties); (5) quantity skew (i.e., P (xi , yi ) is same but the and SVHN) as a party, we train a model using these subsets
amount of data is different among parties). Here the third and draw the feature distribution using t-SNE [52] as shown
case is mainly related to vertical FL (the parties share the in Figure 3b. For each class, althougth MNIST and SVHN
same sample IDs but different features). As mentioned in the have the same label, the feature distributions of MNIST and
third paragraph of Section I, we focus on horizontal FL in SVHN are significantly different from each other. Feature skew
this paper, where each party shares the same feature space but exists in the Digits dataset. These two examples show that the
owns different samples. The fourth case is not applicable in considered non-IID data cases are reasonable and practical.
most FL studies, which assume there is a common knowledge
P (y|x) among the parties to learn. Otherwise, techniques such B. Label Distribution Skew
as domain adaption [60] or personalized federated learning In label distribution skew, the label distributions P (yi ) vary
(i.e., each party learns a personalized local model) [13], [15] across parties. Such a case is common in practice. For ex-
can be applied in federated learning, which is out of the scope ample, some hospitals are more specialized in several specific
of our paper. Thus, we consider label distribution skew, feature kinds of diseases and have more patient records on them. To
distribution skew, and quantity skew as possible non-IID data simulate label distribution skew, we introduce two different
distribution cases in this paper. While the five non-IID data label imbalance settings: quantity-based label imbalance and
cases cover all possible single type of skew, there may be distribution-based label imbalance.
mixed types of skew, which we will discuss in Section V-G. a) Quantity-based label imbalance: Here each party
We use two real-world datasets, Criteo [11] and Digits owns data samples of a fixed number of labels. This is first
[60], to demonstrate the non-IID properties. Criteo contains introduced in the experiments of FedAvg [55], where the data
feature values and click feedback for millions of display ads, samples with the same label are divided into subsets and
which can be used for clickthrough rate prediction. Digits each party is only assigned 2 subsets with different labels.
contains multiple subsets for digit classification. In Criteo, Following FedAvg, such a setting is also used in many other
taking each user as a party, we select ten parties and draw studies [19], [44]. [16] considers a highly extreme case, where

969

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
Fig. 6. The visualization of our FCUBE dataset. The data points within the
upper four cubes have label 0 and within the lower four cubes have label 1.
There are a total of eight cubes with four colors. The data points with the
same color are assigned to a party.
Fig. 4. An example of distribution-based label imbalance partition on MNIST
[37] dataset with β = 0.5. The value in each rectangle is the number of data
samples of a class belonging to a certain party.
C. Feature Distribution Skew

In feature distribution skew, the feature distributions P (xi )

vary across parties although the knowledge P (yi |xi ) is same.
For example, cats may vary in coat colors and patterns in
different areas. Here we introduce three different settings to
simulate feature distribution skew: noise-based feature im-
balance, synthetic feature imbalance, and real-world feature
imbalance.
(a) add noises from Gau(0.001) (b) add noises from Gau(0.01) a) Noise-based feature imbalance: We first divide the
Fig. 5. An example of adding noises on FMNIST [75] dataset. On party P1 , whole dataset into multiple parties randomly and equally. For
noises sampled from Gau(0.001) are added into its images. On party P2 , each party, we add different levels of Gaussian noise to its local
noises sampled from Gau(0.01) are added into its images. dataset to achieve different feature distributions. We choose
Gaussian noise due to its popularity especially in images [78].
Specifically, given user-defined noise level σ, we add noises
each party only has data samples with a single label. We x̂ ∼ Gau(σ · i/N ) for Party Pi , where Gau(σ · i/N ) is
introduce a general partitioning strategy to set the number a Gaussian distribution with mean 0 and variance σ · i/N .
of labels that each party has. Suppose each party only has Users can change σ to increase the feature dissimilarity among
data samples of k different labels. We first randomly assign k the parties. Figure 5 is an example of noise-based feature
different label IDs to each party. Then, for the samples of each imbalance on FMNIST dataset [75]. For ease of presentation,
label, we randomly and equally divide them into the parties we use x̂ ∼ Gau(σ) to present such a partitioning strategy.
which own the label. In this way, the number of labels in each
party is fixed, and there is no overlap between the samples of b) Synthetic feature imbalance: We generate a synthetic
different parties. For ease of presentation, we use #C = k to feature imbalance federated dataset named FCUBE. Suppose
denote such a partitioning strategy. the distribution of data points is a cube in three dimensions
b) Distribution-based label imbalance: Another way to (i.e, (x1 , x2 , x3 )) which have two different labels classified by
simulate label imbalance is that each party is allocated a plane x1 = 0. As shown in Figure 6, we divide the cube into
proportion of the samples of each label according to Dirichlet 8 parts by planes x1 = 0, x2 = 0, and x3 = 0. Then, we
distribution. Dirichlet distribution is commonly used as prior allocate two parts which are symmetric of (0,0,0) to a subset
distribution in Bayesian statistics [29] and is an appropriate for each party. In this way, feature distribution varies among
choice to simulate real-world data distribution. Specifically, parties while labels are still balanced.
we sample pk ∼ DirN (β) and allocate a pk,j proportion c) Real-world feature imbalance: The EMNIST dataset
of the instances of class k to party j. Here Dir(·) denotes [8] collects handwritten characters/digits from different writ-
the Dirichlet distribution and β is a concentration parameter ers. Then, like [6], it is natural to partition the dataset into
(β > 0). This partitioning strategy was first used in [77] and different parties according to the writers. Since the character
has been used in many recent studies [40], [49], [70], [71]. An features usually differ among writers (e.g, stroke width, slant),
advantage of this approach is that we can flexibly change the there is a natural feature distribution skew among different
imbalance level by varying the concentration parameter β. If β parties. Specifically, for the digit images of EMNIST, we
is set to a smaller value, then the partition is more unbalanced. divide and assign the writers (and their digits) into each party
An example of such a partitioning strategy is shown in Figure randomly and equally. Since each party has different writers,
4. For ease of presentation, we use pk ∼ Dir(β) to denote the feature distributions are different among the parties. Like
such a partitioning strategy. [6], we call this federated dataset as FEMNIST.

970

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE I
T HE EXPERIMENTAL SETTINGS IN EXISTING STUDIES AND OUR BENCHMARK . N OTE THAT THE QUANTITY- BASED ,NOISED - BASED , AND QUANTITY
SKEW PARTITIONING STRATEGIES IN THE EXISTING STUDIES ARE DIFFERENT FORM THE STRATEGIES PROPOSED IN OUR STUDY.
Partitioning strategies FedAvg FedProx SCAFFOLD FedNova NIID-Bench
quantity-based
Label distribution skew
distribution-based
noise-based
Feature distribution skew synthetic
real-world
Quantity skew

D. Quantity Skew TABLE II

T HE STATISTICS OF DATASETS IN THE EXPERIMENTS .
In quantity skew, the size of the local dataset |Di | varies
Datasets #training instances #test instances #features #classes
across parties. Although data distribution may still be consis- MNIST 60,000 10,000 784 10
tent among the parties, it is interesting to see the effect of FMNIST 60,000 10,000 784 10
the quantity imbalance in FL. Like distribution-based label CIFAR-10 50,000 10,000 1,024 10
SVHN 73,257 26,032 1,024 10
imbalance setting, we use Dirichlet distribution to allocate adult 32,561 16,281 123 2
different amounts of data samples into each party. We sample rcv1 15,182 5,060 47,236 2
q ∼ DirN (β) and allocate a qj proportion of the total data covtype 435,759 145,253 54 2
FCUBE 4,000 1,000 3 2
samples to Pj . The parameter β can be used to control the FEMNIST 341,873 40,832 784 10
imbalance level of the quantity skew. For ease of presentation,
we use q ∼ Dir(β) to denote such a partitioning strategy.
Benchmark metrics. We use the top-1 accuracy on the test
E. Experiments in Existing Studies dataset as a metric to compare the studied algorithms. We run
Table I compares the partitioning strategies in NIID-bench all the studied algorithms for the same number of rounds for
with the experimental settings in existing studies. We can fair comparison. The number of rounds is set to 50 by default
observe that each study only covers partial non-IID cases. unless specified.
It is impossible to directly compare the results presented Due to the page limit, for the experiments on the effect of
in different papers. In contrast, NIID-bench consists of six batch size and model architecture, please refer to Appendix D
partitioning strategies, which are more comprehensive and and E of the technical report [38], respectively.
representative for representing different non-IID data cases.
A. Overall Accuracy Comparison
V. E XPERIMENTS The accuracy of existing approaches including FedAvg, Fed-
To investigate the effectiveness of existing FL algorithms Prox, SCAFFOLD, and FedNova under different non-IID data
on non-IID data setting, we conduct extensive experiments on settings is shown in Table III. For comparison, we also present
nine public datasets, including six image datasets (i.e., MNIST the results for IID scenarios (i.e., homogeneous partitions).
[37], CIFAR-10 [35], FMNIST [75], SVHN [57], FCUBE, Next we show the insights from different perspectives.
FEMNIST [6]) and three tabular datasets (i.e., adult, rcv1, 1) Comparison among different non-IID settings:
and covtype)2 . The statistics of the datasets are summarized Finding (1): The label distribution skew case where each
in Table II. For the image datasets, we use a CNN, which party only has samples of a single class is the most challeng-
has two 5x5 convolution layers followed by 2x2 max pooling ing setting, while the feature distribution skew and quantity
(the first with 6 channels and the second with 16 channels) skew setting have little influence on the accuracy of FedAvg.
and two fully connected layers with ReLU activation (the first From Table III, we can observe that there is a gap be-
with 120 units and the second with 84 units). For the tabular tween the accuracy of existing algorithms on several non-IID
datasets, we use a MLP with three hidden layers. The numbers data settings and on the homogeneous setting. First, among
of hidden units of three layers are 32, 16, and 8. The number different non-IID data settings, all studied FL algorithms
of parties is set to 10 by default, except for FCUBE where perform worse on the label distribution skew case. Second,
the number of parties is set to 4. All parties participate in in label distribution skew setting, the algorithms have the
every round to eliminate the effect of randomness brought by worst accuracy when each party only has data from a single
party sampling by default [55]. We use the SGD optimizer label. As expected, the accuracy increases as the number of
with learning rate 0.1 for rcv1 and learning rate 0.01 for the classes in each party increases. Third, for feature distribution
other datasets (tuned from {0.1, 0.01, 0.001}) and momentum skew setting, except for CIFAR-10, existing algorithms have
0.9. The batch size is set to 64 and the number of local epochs a very close accuracy compared with the IID setting. Last,
is set to 10 by default. in quantity skew setting, FedAvg has almost no accuracy
loss. Since the weighted averaging is adopted in FedAvg,
2 [Link]
it can already handle the quantity imbalance well. Overall,

971

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE III
T HE TOP -1 ACCURACY OF DIFFERENT APPROACHES . W E RUN THREE TRIALS AND REPORT THE MEAN ACCURACY AND STANDARD DERIVATION . F OR
F ED P ROX , WE TUNE μ FROM {0.001, 0.01, 0.1, 1} AND REPORT THE BEST ACCURACY.

category dataset partitioning FedAvg FedProx SCAFFOLD FedNova

pk ∼ Dir(0.5) 98.9%±0.1% 98.9%±0.1% 99.0%±0.1% 98.9%±0.1%
#C = 1 29.8%±7.9% 40.9%±23.1% 9.9%±0.2% 39.2%±22.1%
MNIST
#C = 2 97.0%±0.4% 96.4%±0.3% 95.9%±0.3% 94.5%±1.5%
#C = 3 98.0%±0.2% 97.9%±0.4% 96.6%±1.5% 98.0%±0.3%
pk ∼ Dir(0.5) 88.1%±0.6% 88.1%±0.9% 88.4%±0.5% 88.5%±0.5%
#C = 1 11.2%±2.0% 28.9%±3.9% 12.8%±4.8% 14.8%±5.9%
FMNIST
#C = 2 77.3%±4.9% 74.9%±2.6% 42.8%±28.7% 70.4%±5.1%
#C = 3 80.7%±1.9% 82.5%±1.9% 77.7%±3.8% 78.9%±3.0%
pk ∼ Dir(0.5) 68.2%±0.7% 67.9%±0.7% 69.8%±0.7% 66.8%±1.5%
#C = 1 10.0%±0.0% 12.3%±2.0% 10.0%±0.0% 10.0%±0.0%
Label CIFAR-10
#C = 2 49.8%±3.3% 50.7%±1.7% 49.1%±1.7% 46.5%±3.5%
distribution
#C = 3 58.3%±1.2% 57.1%±1.2% 57.8%±1.4% 54.4%±1.1%
skew
pk ∼ Dir(0.5) 86.1%±0.7% 86.6%±0.9% 86.8%±0.3% 86.4%±0.6%
#C = 1 11.1%±0.0% 19.6%±0.0% 6.7%±0.0% 10.6%±0.8%
SVHN
#C = 2 80.2%±0.8% 79.3%±0.9% 62.7%±11.6% 75.4%±4.8%
#C = 3 82.0%±0.7% 82.1%±1.0% 77.2%±2.0% 80.5%±1.2%
pk ∼ Dir(0.5) 78.4%±0.9% 80.5%±0.7% 76.4%±0.0% 52.3%±26.7%
adult
#C = 1 82.5%±2.2% 76.4%±0.0% 23.6%±0.0% 50.8%±0.9%
pk ∼ Dir(0.5) 48.2%±0.7% 70.3%±13.3% 64.4%±24.3% 49.3%±2.1%
rcv1
#C = 1 51.8%±0.7% 51.8%±0.7% 51.8%±0.7% 51.8%±0.7%
pk ∼ Dir(0.5) 77.2%±7.4% 70.9%±0.7% 67.7%±14.9% 74.8%±12.9%
covtype
#C = 1 48.8%±0.1% 59.1%±2.1% 49.6%±1.4% 50.4%±1.4%
number of times that performs the best 8 11 4 3
MNIST 99.1%±0.1% 99.1%±0.1% 99.1%±0.1% 99.1%±0.1%
FMNIST 89.1%±0.3% 89.0%±0.2% 89.3%±0.0% 89.0%±0.1%
Feature x̂ ∼ Gau(0.1)
CIFAR-10 68.9%±0.3% 69.3%±0.2% 70.1%±0.2% 68.5%±1.3%
distribution
SVHN 88.1%±0.5% 88.1%±0.2% 88.1%±0.4% 88.1%±0.4%
skew
FCUBE synthetic 99.8%±0.2% 99.8%±0.0% 99.7%±0.3% 99.7%±0.1%
FEMNIST real-world 99.4%±0.0% 99.3%±0.1% 99.4%±0.1% 99.3%±0.1%
number of times that performs the best 4 3 5 2
MNIST 99.2%±0.1% 99.2%±0.1% 99.1%±0.1% 99.1%±0.1%
FMNIST 89.4%±0.1% 89.7%±0.3% 88.8%±0.4% 86.1%±2.9%
CIFAR-10 72.0%±0.3% 71.2%±0.6% 62.4%±4.1% 10.0%±0.0%
Quantity
SVHN q ∼ Dir(0.5) 88.3%±1.0% 88.4%±0.4% 11.0%±7.4% 41.3%±21.1%
skew
adult 82.2%±0.1% 84.8%±0.2% 81.6%±4.5% 43.2%±33.9%
rcv1 96.7%±0.3% 96.8%±0.4% 49.0%±1.9% 51.8%±0.7%
covtype 88.1%±0.2% 84.6%±0.2% 63.2%±20.8% 51.2%±3.2%
number of times that performs the best 3 5 0 0
MNIST 99.1%±0.1% 99.1%±0.1% 99.2%±0.0% 99.1%±0.1%
FMNIST 89.6%±0.3% 89.5%±0.2% 89.7%±0.2% 89.4%±0.2%
CIFAR-10 70.4%±0.2% 70.2%±0.1% 71.5%±0.3% 69.5%±1.0%
SVHN 88.5%±0.5% 88.5%±0.8% 88.0%±0.8% 88.4%±0.5%
Homogeneous
FCUBE IID 99.7%±0.1% 99.6%±0.2% 99.8%±0.1% 99.9%±0.1%
partition
FEMNIST 99.3%±0.1% 99.4%±0.1% 99.4%±0.0% 99.3%±0.0%
adult 82.6%±0.4% 84.8%±0.2% 83.8%±2.5% 82.6%±0.0%
rcv1 96.8%±0.4% 96.6%±0.6% 80.9%±27.8% 96.6%±0.4%
covtype 87.9%±0.1% 85.2%±0.0% 88.0%±2.3% 87.9%±0.2%
number of times that performs the best 2 3 5 1

the label distribution skew inﬂuences the accuracy of FL best algorithm for FL. If the local datasets have almost the
algorithms most among all non-IID settings. There is room same data distribution but different sizes (e.g., databases with
for existing algorithms to be improved to handle scenarios different capacities), then FedProx is likely the appropriate
such as quantity-based label imbalance. algorithm. If there is no prior knowledge on the local datasets,
We draw a decision tree to summarize the suitable FL how to determine the distribution is a challenging problem and
algorithm for each non-IID setting as shown in Figure 7 more research efforts are needed (see Section VI-A).
according to our observations. This decision tree is helpful 2) Comparison among different algorithms:
for users to choose the algorithm for their learning according Finding (2): No algorithm consistently outperforms the other
to the non-IID distribution and the datasets. For example, if the algorithms in all settings. The state-of-the-art algorithms
local datasets are likely to have feature distribution skew (e.g., signiﬁcantly outperform FedAvg only in several cases.
the digits from different writers), then SCAFFOLD may be the We have the following observations in aspect of different

972

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
Non-IID data setting

Label Distribution skew Feature Distribution skew Quantity skew

SCAFFOLD FedProx

Distribution-based label imbalance Quantity-based label imbalance

FedAvg/
Image datasets Tabular datasets
FedProx
(a) pk ∼ Dir(0.5) (b) x̂ ∼ Gau(0.1)
SCAFFOLD FedProx
Fig. 8. The training curves of different approaches on CIFAR-10.

Fig. 7. The decision tree to determine the (almost) best FL algorithm given
the non-IID setting.

algorithms. First, in label distribution skew and quantity skew

cases, FedProx usually achieves the best accuracy. In feature
distribution skew case, SCAFFOLD usually achieves the best
accuracy. Second, in some cases (e.g., pk ∼ Dir(0.5), feature
distribution skew and quantity skew), the improvement of the
three non-IID FL algorithms is insignificant compared with
FedAvg, which is smaller than 1%. Third, when #C = 1, Fed- (a) pk ∼ Dir(0.5) (b) x̂ ∼ Gau(0.1)
Prox can significantly outperform FedAvg, SCAFFOLD and Fig. 9. The test accuracy with different numbers of local epochs on CIFAR-
10.
FedNova. Fourth, for SCAFFOLD, its accuracy is quite unsta-
ble. It can significantly outperform the other two approaches
in some cases (e.g., Dir(0.5) and K = 1 on CIFAR-10). other partitioning strategies and other datasets, please refer to
However, it may also have much worse accuracy than the other Appendix A of our technical report [38]. For FedProx, we
two approaches (e.g., K = 1 and K = 2 on SVHN). Last, for show the curve with the best μ. First, for the #C = 1 setting,
FedNova, it does not show much superiority compared with FedAvg and FedProx are very unstable, while SCAFFOLD
other FL algorithms. Compared with the accuracy of FedAvg and FedNova even cannot improve as the number of rounds
on the homogeneous partition, there is still a lot of room for increases. Second, for the q ∼ Dir(0.5) setting, FedNova is
improvement in the non-IID setting. quite unstable and the accuracy changes rapidly as the number
3) Comparison among different tasks: of communication rounds increases. Moreover, FedProx is
Finding (3): CIFAR-10 and tabular datasets are challenging very close to FedAvg during the whole training process in
tasks under non-IID settings. MNIST is a simple task under many cases. Since the best μ is always small, the regularization
most non-IID settings where the studied algorithms perform term in FedProx has little influence on the training. Thus,
similarly well. FedProx and FedAvg usually have similar convergence speed
Among nine different datasets, while heterogeneity signifi- and final accuracy. How to achieve stable learning and fast
cantly degrades the accuracy of FL algorithms on CIFAR-10 convergence is still an open problem on non-IID data.
and tabular datasets, such influence is smaller in other datasets.
Among image datasets, the classification task on CIFAR-10 is C. Robustness to Local Updates
more complex than the other datasets in a centralized setting. Finding (5): The number of local epochs can have a large
Thus, when each party only has a skewed subset, the task effect on the accuracy of existing algorithms. The optimal
will be more challenging and the accuracy is worse. Also, value of the number of local epochs is very sensitive to non-
it is interesting that all the four algorithms cannot handle IID distributions.
tabular datasets well in the non-IID setting. The accuracy loss We vary the number of local epochs from {10, 20, 40, 80}
is quite large especially for the label distribution skew case. and report the final accuracy on CIFAR-10 in Figure 9. Please
We suggest that the challenging tasks like CIFAR-10 and rcv1 refer to Appendix B of our technical report [38] for the results
should be included in the benchmark for distributed data silos. of other settings and datasets. On the one hand, we can find
B. Communication Efficiency that the number of local epochs has a large effect on the
Finding (4): FedProx has almost the same convergence speed accuracy of FL algorithms. For example, when #C = 2,
compared with FedAvg, while SCAFFOLD and FedNova are the accuracy of all algorithms generally degrades significantly
more unstable in training. when the number of local epochs is set to 80. On the other
hand, the optimal number of local epochs differ in different
Figure 8 shows the training curves of the studied algorithms
settings. For example, when #C = 1 and #C = 2, the optimal
on CIFAR-10. Here we try two different partitioning strategies
number of local epochs is 20 for FedAvg, and is 10 on the
that cover label skew and feature skew. For the results on

973

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
(a) pk ∼ Dir(0.5) (b) q ∼ Dir(0.5) (a) pk ∼ Dir(0.5) (b) x̂ ∼ Gau(0.1)
Fig. 10. The training curves of different approaches on CIFAR-10 with 100 Fig. 11. The test accuracy with different number of parties on CIFAR-10.
parties and sample fraction 0.1.
TABLE IV
T HE COMPUTATION TIME ( SECOND ) AND COMMUNICATION SIZE (MB)
settings pk ∼ Dir(0.5) and #C = 3. In summary, existing PER ROUND OF DIFFERENT APPROACHES .
algorithms are not robust enough against large local updates.
MNIST CIFAR-10 adult rcv1
Non-IID distributions have to be considered to determine the FedAvg 73s 193s 15s 66s
best number of local epochs. FedProx 133s 233s 44s 76s
D. Party Sampling SCAFFOLD 77s 197s 14s 66s
FedNova 73s 189s 17s 65s
Finding (6): In the partial participation setting, SCAFFOLD
FedAvg 1.95MB 2.73MB 0.20MB 66.54MB
cannot work effectively, while the other FL algorithms have
FedProx 1.95MB 2.73MB 0.20MB 66.54MB
a very unstable accuracy during training. SCAFFOLD 3.91MB 5.46MB 0.41MB 133.08MB
In some scenarios, not all the data silos will participate FedNova 1.95MB 2.73MB 0.20MB 66.54MB
the entire training process. In such a setting, the sampling
technique is usually applied (Line 6 of Algorithm 1). To
simulate this scenario, we set the number of parties to 100 and To compare the efficiency of different FL algorithms, we
the sample fraction to 0.1. We run experiments on CIFAR- show the overall computation time and communication costs
10 and the results are shown in Figure 10. Please refer to of each approach in Table IV. We can observe that the
Appendix C of the technical report [38] for the results with computation costs of FedAvg, SCAFFOLD, and FedNova are
other partitioning strategies. We can find that the training close. FedProx has a much higher computation cost than
curves are quite unstable in most non-IID settings. Due to the other algorithms. From Algorithm 1, FedProx directly
the sampling technique, the local distributions among different modifies the objective, which causes additional computation
rounds can vary, and thus the averaged gradients may have overhead in the gradient descent of each batch. FedNova and
very different directions among rounds. Moreover, we can find SCAFFOLD only introduce very small number of addition
that SCAFFOLD has a bad accuracy on all settings. Since the and multiplication operations each round, which is negligible.
frequency of updating local control variates (Lines 23-25 of For the communication costs, since SCAFFOLD needs to
Algorithm 2) is low, the estimation of the update direction communicate control variates in each round as shown in
may be very inaccurate using the control variates. Algorithm 2, its communication cost is twice of that of the
other algorithms.
E. Scalability
Finding (7): The accuracy of all approaches decrease when G. Mixed Types of Skew
increasing the number of parties. Finding (9): FL is more challenging when there exists mixed
We study the effect of number of clients on studied ap- types of skew among the local data.
proaches as shown in Figure 11. Here we run all approaches In practice, there may exist mixed types of skew among
for 50 rounds. We can observe that the accuracy decreases parties. Here we combine multiple partitioning strategies to
significantly when increasing the number of clients. When the generate such cases. We try two different settings: 1) we first
number of parties is large, the amount of local data is small divide the whole dataset into each party by the distribution-
and it is easy to overfit in the local training stage. How to based label imbalanced partitioning strategy. Then, we add
design effective and communication-efficient algorithms on a noises to the data of each party according to the noise-based
large-scale setting with small data in the client is still an open feature imbalance strategy. Therefore, there exists both label
problem. distribution skew and feature distribution skew among the local
data of different parties. 2) we first divide the whole dataset
F. Efficiency
into each party by the quantity imbalanced partitioning strat-
Finding (8): The computation overhead of FedProx is large egy. Then, we add noises to the data of each party according
compared with FedAvg. Moreover, the communication cost to the noise-based feature imbalance strategy. Therefore, there
of SCAFFOLD is twice of that of FedAvg. exists both feature distribution skew and quantity skew among

974

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE V important to investigate effective algorithms working on
T HE PERFORMANCE OF DIFFERENT APPROACHES WITH DIFFERENT
IMBALANCE CASES ON CIFAR-10.
multiple types of skew, which is more practical in reality.

Case 1 FedAvg FedProx SCAFFOLD FedNova VI. F UTURE D IRECTIONS

label skew 68.2% 67.9% 69.8% 66.8%
feature skew 68.9% 69.3% 70.1% 68.5% We present some following promising future directions
label and feature skew 66.1% 64.8% 67.8% 65.9% for data management and federated learning on non-IID dis-
Case 2 FedAvg FedProx SCAFFOLD FedNova tributed databases.
feature skew 68.9% 69.3% 70.1% 68.5%
quantity skew 72.0% 71.2% 62.4% 10.0%
feature and quantity skew 69.1% 69.2% 62.2% 10.0%
A. Opportunities for data management
Integration with learned database systems: Existing learned
systems are mostly based on centralized databases, such as
the local data of different parties. The results are shown in learned index structures [12], [53] and learned cost estima-
Table V. tion [24], [54]. We believe that, as the concerns on data
For the first case, we can observe that the accuracies of privacy and data regulation grow, we will see more distributed
all approaches degrade when there exists mixed types of skew databases and existing learned systems and algorithms need
compared with a single type of skew, which is reasonable since to be revisited. For example, it could be very interesting to
both label imbalance and feature imbalance bring challenges enable federated search and develop learned index structures
in the training process. for multiple “data silos” without exchanging the local data.
For the second case, while quantity skew does not affect Light-weight data techniques for profiling non-IID data:
the accuracy of FedAvg and FedProx, the accuracy of both From our experimental study, different non-IID distributions
feature and quantity skew setting is close to the accuracy of the have a large effect on the accuracy and stability of FL
feature skew setting. However, for SCAFFOLD and FedNova, algorithms. Thus, it would be helpful if we can know the non-
the accuracy of both feature and quantity skew setting is poor IID distribution in prior before conducting FL. This made a
since quantity skew degrades the accuracy significantly. decade of database research relevant, such as data sampling [7]
Overall, as we observe more significant model quality and sketching [20]. Another potential approach is to use mete
degradation in mixed non-IID settings, it is important to design data to represent the non-IID distributions. However, it is still
algorithms for settings with mixed types of skew, which are an open problem on how to extend current statistics estimation
common in reality. For example, the images taken in different (such as cardinality estimation) to non-IID distribution.
areas have different label distributions, while the feature Non-IID resistant sampling for partial participation: As in
distributions also differ due to the cameras (e.g., contrast). Finding (8), the sampling approach can bring instability in FL.
Instead of random sampling, selective sampling according to
H. Insights on the Experimental Results
the data distribution features of the parties may significantly
We summarized the insights from the experimental studies increase the learning stability. One inspiration is from the skew
as follows. resistant data techniques [18], [36], which can be potentially
• The design and evaluation of future FL algorithms should extended to the partial participation in FL training. Moreover,
consider more comprehensive settings, including different stratified sampling [58] can be a good solution. By classifying
non-IID data partitioning strategies and tasks. There is not the parties to subgroups, representative parties can be selected
a single studied algorithm that consistently outperforms in each round in a more balanced way [1].
the other algorithms or has a good performance in all Privacy-preserving data mining: Although there is no raw
settings. Thus, it is still a promising research direction to data transfer in FL, the model may still leak sensitive informa-
address issues in distributed data silos with FL. tion about the training data due to possible inference attacks
• Accuracy and communication efficiency are two impor- [17], [67]. Thus, techniques such as differential privacy [14]
tant metrics in the evaluation of FL algorithms under non- are useful to protect the local databases. How to decrease the
IID data settings. Our study demonstrates the trade-off accuracy loss while ensuring the differential privacy guarantee
between them, and also the stability of those two metrics is a challenge research direction.
in the training process. Query on Federated Databases: As we focus on distributed
• FL introduces new training factors (e.g., number of local databases due to privacy concerns, federated databases [66]
epochs, batch normalization, party sampling, number of also need to be revisited. On the one hand, how to combine
parties) compared with centralized training due to non- the SQL query with machine learning on federated databases
IID data setting, while some training factors share the is an important problem. On the other hand, how to preserve
similar behavior as the centralized training (e.g., batch the data privacy while supporting both query and learning on
size). These challenging factors deserve more attention federated databases also needs to be investigated.
in the evaluation of future FL studies.
• Mixed types of skew brings more challenges than a B. Opportunities for better FL design
single type of skew. As we observe more significant A Party with a Single Label: From Table III, the accuracy
model quality degradation in mixed non-IID settings, it is of FL algorithms is very bad if each party only has data of a

975

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
single label. This setting is seemingly unrealistic. However, it Among these six partitioning strategies, the two partitioning
has many real-world applications in practice. For example, we strategies in Section IV-A-b and Section IV-B-c are adopted
can use FL to train a speaker recognition model, while each from existing FL studies due to their popularity, while the
mobile device only has the voices of its single user. other four effective partitioning strategies are designed by our
Fast Training: From Figure 8, the training speed of existing study. Next, we introduce these partitioning strategies in detail.
FL algorithms are usually close to each other. FedProx, There are some existing benchmarks for federated learn-
SCAFFOLD, and FedNova do not show much superiority on ing [6], [26], [28], [48]. LEAF [6] provides some realistic
the communication efficiency. To improve the training speed, federated datasets including images and texts. Specifically,
researchers can work on the following two directions. One LEAF partitions the existing datasets according to its data
possible solution is to develop communication-efficient FL al- recourses, e.g., partitioning the data in Extended MNIST [8]
gorithms with only a few rounds. There are some studies [21], based on the writer of the digit or character. OARF [28] pro-
[40] that propose FL algorithms using a single communication poses federated datasets by combining multiple related real-
round. In their studies, a public dataset is needed, which may world public datasets. Moreover, it provides various metrics
potentially limit the applications. Another possible solution is including utility, communication overhead, privacy loss, and
to develop fast initialization approach to reduce the number mimics the federated systems in the real world. However,
of rounds while achieving the same accuracy for FL. In the both LEAF and OARF do not provide an algorithm-level
experiments of a previous study [40], they show that their comparison. FedML [26] provides reference implementations
approach is also promising if applied as an initialization step. of federated learning algorithms such as FedAvg, FedNOVA
Automated Parameter Tuning for FL: FL algorithms suffer [71] and FedOpt [61]. There are no new datasets, metrics,
from large local updates. The number of local epochs is an and settings in FedML. FLBench [48] is proposed for isolated
important parameter in FL. While one traditional way is to data island scenario. Its framework covers domains including
develop approaches robustness to the local updates, another medical, finance, and AIoT. However, currently, FLBench is
way is to design efficient parameter tuning approaches for not open-sourced and it does not provide any experiments.
FL. A previous paper [9] studies Bayesian optimization in The above benchmarks do not provide analysis of existing
the federated setting, which can be used to search hyper- federated learning algorithms on different non-IID settings,
parameters. Approaches for the setting of number of local which is our focus in this paper. To the best of our knowledge,
epochs need to be investigated. there is one existing benchmark [50] for federated learning
Towards Robust Algorithms against Different Non-IID Set- on the non-IID data setting. However, it only provides two
tings: As in Finding (2), no algorithm consistently performs partitioning approaches: random split and split by labels. In
the best in all settings. It is a natural question whether and this paper, we provide comprehensive partitioning strategies
how we can develop a robust algorithm for different non-IID and datasets to cover different non-IID settings. Moreover,
settings. We may have to first investigate the common charac- we conduct extensive experiments to compare and analyze
teristics of FL processes under different non-IID settings. The existing federated learning algorithms.
intuitions of existing algorithms are same: the local model
updates towards the local optima, and the averaged model VIII. C ONCLUSION
is far from the global optima. We believe the design of FL There has been a growing interest in exploiting distributed
algorithms under non-IID settings can be improved if we can databases (e.g., in different organizations and countries) to
observe more detailed and common behaviours in the training. improve the effectiveness of machine learning services. In this
Aggregation of Heterogeneous Batch Normalization: From paper, we study non-IID data as one key challenge in such
our Finding (7), simple averaging is not a good choice for distributed databases, and develop a benchmark named NIID-
batch normalization. Since the batch normalization in each bench. Specifically, we introduce six data partitioning strate-
party records the statistics of local data distribution, there is gies which are much more comprehensive than the previous
also heterogeneity among the batch normalization layers of studies. Furthermore, we conduct comprehensive experiments
different parties. The averaged batch normalization layer may to compare existing algorithms and demonstrate their strength
not catch the local distribution after sending back to the parties. and weakness. This study sheds light on some future directions
A possible solution is to only average the learned parameters to build effective machine learning services on distributed
but leave the statistics (i.e., mean and variance) alone [4]. More databases.
specialized designs for particular layers in deep learning need
to be investigated. ACKNOWLEDGEMENTS
This research is supported by the National Research Foun-
VII. R ELATED W ORK dation, Singapore under its AI Singapore Programme (AISG
Although the existing study [32] provides non-IID data Award No: AISG2-RP-2020-018). Any opinions, findings and
cases, it does not provide the partitioning strategies to generate conclusions or recommendations expressed in this material
the corresponding non-IID data distributions. We go beyond are those of the authors and do not reflect the views of
the previous study and summarize six different partitioning National Research Foundation, Singapore. Qinbin is also in
strategies to generate three non-IID data distribution cases. part supported by a Google PhD Fellowship.

976

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [22] F. Hanzely, S. Hanzely, S. Horváth, and P. Richtárik. Lower bounds
and optimal algorithms for personalized federated learning. Advances
in Neural Information Processing Systems, 2020.
[1] S. AbdulRahman, H. Tout, A. Mourad, and C. Talhi. Fedmccs: [23] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augen-
multicriteria client selection model for optimal iot federated learning. stein, H. Eichner, C. Kiddon, and D. Ramage. Federated learning for
IEEE Internet of Things Journal, 8(6):4723–4735, 2020. mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.
[2] D. A. E. Acar, Y. Zhao, R. Matas, M. Mattina, P. Whatmough, and [24] S. Hasan, S. Thirumuruganathan, J. Augustine, N. Koudas, and G. Das.
V. Saligrama. Federated learning based on dynamic regularization. In Deep learning models for selectivity estimation of multi-attribute
International Conference on Learning Representations, 2021. queries. In Proceedings of the 2020 ACM SIGMOD International
[3] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Conference on Management of Data, SIGMOD ’20, page 1035–1050,
Proceedings of the 2000 ACM SIGMOD international conference on New York, NY, USA, 2020. Association for Computing Machinery.
Management of data, pages 439–450, 2000. [25] C. He, M. Annavaram, and S. Avestimehr. Group knowledge transfer:
[4] M. Andreux, J. O. du Terrail, C. Beguier, and E. W. Tramel. Siloed Federated learning of large cnns at the edge. Advances in Neural
federated learning for multi-centric histopathology datasets. In Domain Information Processing Systems, 33, 2020.
Adaptation and Representation Transfer, and Distributed and Collabo- [26] C. He, S. Li, J. So, M. Zhang, H. Wang, X. Wang, P. Vepakomma,
rative Learning, pages 129–139. Springer, 2020. A. Singh, H. Qiu, L. Shen, et al. Fedml: A research library and bench-
[5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, mark for federated machine learning. arXiv preprint arXiv:2007.13518,
V. Ivanov, C. M. Kiddon, J. Konečný, S. Mazzocchi, B. McMahan, T. V. 2020.
Overveldt, D. Petrou, D. Ramage, and J. Roselander. Towards federated [27] T.-M. H. Hsu, H. Qi, and M. Brown. Measuring the effects of non-
learning at scale: System design. In SysML, 2019. identical data distribution for federated visual classification. arXiv
[6] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, preprint arXiv:1909.06335, 2019.
V. Smith, and A. Talwalkar. Leaf: A benchmark for federated settings. [28] S. Hu, Y. Li, X. Liu, Q. Li, Z. Wu, and B. He. The oarf benchmark
arXiv preprint arXiv:1812.01097, 2018. suite: Characterization and implications for federated learning systems.
[7] S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for arXiv preprint arXiv:2006.07856, 2020.
histogram construction: How much is enough? ACM SIGMOD Record, [29] J. Huang. Maximum likelihood estimation of dirichlet distribution
27(2):436–447, 1998. parameters. CMU Technique Report, 2005.
[8] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik. Emnist: Extending [30] N. Hynes, D. Dao, D. Yan, R. Cheng, and D. Song. A demonstration
mnist to handwritten letters. In 2017 International Joint Conference on of sterling: A privacy-preserving data marketplace. Proceedings of the
Neural Networks (IJCNN), pages 2921–2926. IEEE, 2017. VLDB Endowment, 11(12):2086–2089, 2018.
[9] Z. Dai, B. K. H. Low, and P. Jaillet. Federated bayesian optimization [31] R. Johnson and T. Zhang. Accelerating stochastic gradient descent
via thompson sampling. Advances in Neural Information Processing using predictive variance reduction. Advances in neural information
Systems, 33, 2020. processing systems, 26:315–323, 2013.
[10] Y. Deng, M. M. Kamani, and M. Mahdavi. Distributionally robust fed- [32] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N.
erated averaging. Advances in Neural Information Processing Systems, Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al.
33, 2020. Advances and open problems in federated learning. arXiv preprint
[11] Diemert Eustache, Meynet Julien, P. Galland, and D. Lefortier. Attri- arXiv:1912.04977, 2019.
bution modeling increases efficiency of bidding in display advertising. [33] G. A. Kaissis, M. R. Makowski, D. Rückert, and R. F. Braren. Secure,
In Proceedings of the AdKDD and TargetAd Workshop, KDD, Halifax, privacy-preserving and federated machine learning in medical imaging.
NS, Canada, August, 14, 2017, page To appear. ACM, 2017. Nature Machine Intelligence, pages 1–7, 2020.
[12] J. Ding, U. F. Minhas, J. Yu, C. Wang, J. Do, Y. Li, H. Zhang, [34] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and
B. Chandramouli, J. Gehrke, D. Kossmann, D. Lomet, and T. Kraska. A. T. Suresh. Scaffold: Stochastic controlled averaging for on-device
Alex: An updatable adaptive learned index. In Proceedings of the federated learning. In Proceedings of the 37th International Conference
2020 ACM SIGMOD International Conference on Management of Data, on Machine Learning. PMLR, 2020.
SIGMOD ’20, page 969–984, New York, NY, USA, 2020. Association [35] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features
for Computing Machinery. from tiny images. 2009.
[13] C. T. Dinh, N. H. Tran, and T. D. Nguyen. Personalized federated [36] Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel
learning with moreau envelopes. Advances in Neural Information processing of feature-extracting scientific user-defined functions. In
Processing Systems, 2020. Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC
[14] C. Dwork. Differential privacy. Encyclopedia of Cryptography and ’10, page 75–86, New York, NY, USA, 2010. Association for Computing
Security, pages 338–340, 2011. Machinery.
[15] A. Fallah, A. Mokhtari, and A. Ozdaglar. Personalized federated learning [37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
with theoretical guarantees: A model-agnostic meta-learning approach. applied to document recognition. Proceedings of the IEEE, 86(11):2278–
Advances in Neural Information Processing Systems, 33, 2020. 2324, 1998.
[16] X. Y. Felix, A. S. Rawat, A. K. Menon, and S. Kumar. Federated learning [38] Q. Li, Y. Diao, Q. Chen, and B. He. Federated learning on non-iid data
with only positive labels. arXiv preprint arXiv:2004.10342, 2020. silos: An experimental study. arXiv preprint arXiv:2102.02079, 2021.
[17] M. Fredrikson, S. Jha, and T. Ristenpart. Model inversion attacks [39] Q. Li, B. He, and D. Song. Model-contrastive federated learning. In
that exploit confidence information and basic countermeasures. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Proceedings of the 22nd ACM SIGSAC Conference on Computer and Pattern Recognition, 2021.
Communications Security, pages 1322–1333. ACM, 2015. [40] Q. Li, B. He, and D. Song. Practical one-shot federated learning for
[18] S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal cross-silo setting. IJCAI, 2021.
sampling for skew-resistant join size estimation. In Proceedings of the [41] Q. Li, Z. Wen, and B. He. Practical federated gradient boosting decision
1996 ACM SIGMOD International Conference on Management of Data, trees. In AAAI, pages 4642–4649, 2020.
SIGMOD ’96, page 271–281, New York, NY, USA, 1996. Association [42] Q. Li, Z. Wen, Z. Wu, S. Hu, N. Wang, and B. He. A survey on
for Computing Machinery. federated learning systems: Vision, hype and reality for data privacy
[19] R. C. Geyer, T. Klein, and M. Nabi. Differentially private federated and protection. arXiv preprint arXiv:1907.09693, 2019.
learning: A client level perspective. arXiv preprint arXiv:1712.07557, [43] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith. Federated learning: Chal-
2017. lenges, methods, and future directions. arXiv preprint arXiv:1908.07873,
[20] A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and 2019.
M. J. Strauss. Fast, small-space algorithms for approximate histogram [44] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith.
maintenance. In Proceedings of the thiry-fourth annual ACM symposium Federated optimization in heterogeneous networks. In MLSys, 2020.
on Theory of computing, pages 389–398, 2002. [45] T. Li, J. Zhong, J. Liu, W. Wu, and C. Zhang. [Link]: Towards multi-
[21] N. Guha, A. Talwlkar, and V. Smith. One-shot federated learning. arXiv tenant resource sharing for machine learning workloads. 11(5):607–620,
preprint arXiv:1902.11175, 2019. Jan. 2018.

977

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
[46] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang. On the convergence [70] H. Wang, M. Yurochkin, Y. Sun, D. Papailiopoulos, and Y. Khazaeni.
of fedavg on non-iid data. In International Conference on Learning Federated learning with matched averaging. In International Conference
Representations, 2020. on Learning Representations, 2020.
[47] X. Li, M. JIANG, X. Zhang, M. Kamp, and Q. Dou. Fed{bn}: [71] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor. Tackling the ob-
Federated learning on non-{iid} features via local batch normalization. jective inconsistency problem in heterogeneous federated optimization.
In International Conference on Learning Representations, 2021. Advances in Neural Information Processing Systems, 33, 2020.
[48] Y. Liang, Y. Guo, Y. Gong, C. Luo, J. Zhan, and Y. Huang. An isolated [72] L. Wang, S. Xu, X. Wang, and Q. Zhu. Addressing class imbalance in
data island benchmark suite for federated learning. arXiv preprint federated learning. In AAAI, 2021.
arXiv:2008.07257, 2020. [73] W. Wang, J. Gao, M. Zhang, S. Wang, G. Chen, T. K. Ng, B. C. Ooi,
[49] T. Lin, L. Kong, S. U. Stich, and M. Jaggi. Ensemble distillation J. Shao, and M. Reyad. Rafiki: Machine learning as an analytics service
for robust model fusion in federated learning. Advances in Neural system. Proc. VLDB Endow., 12(2):128–140, Oct. 2018.
Information Processing Systems, 33, 2020. [74] Y. Wu, S. Cai, X. Xiao, G. Chen, and B. C. Ooi. Privacy preserving
[50] L. Liu, F. Zhang, J. Xiao, and C. Wu. Evaluation framework for large- vertical federated learning for tree-based models. Proceedings of the
scale federated learning. arXiv preprint arXiv:2003.01575, 2020. VLDB Endowment, 2020.
[75] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image
[51] Y. Liu, Y. Kang, C. Xing, T. Chen, and Q. Yang. A secure federated
dataset for benchmarking machine learning algorithms. arXiv preprint
transfer learning framework. IEEE Intelligent Systems, 2020.
arXiv:1708.07747, 2017.
[52] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of [76] Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning:
machine learning research, 9(Nov):2579–2605, 2008. Concept and applications. ACM Transactions on Intelligent Systems and
[53] R. Marcus, A. Kipf, A. van Renen, M. Stoian, S. Misra, A. Kemper, Technology (TIST), 10(2):1–19, 2019.
T. Neumann, and T. Kraska. Benchmarking learned indexes. Proc. [77] M. Yurochkin, M. Agarwal, S. Ghosh, K. Greenewald, N. Hoang, and
VLDB Endow., 14(1):1–13, Sept. 2020. Y. Khazaeni. Bayesian nonparametric federated learning of neural
[54] R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, networks. In Proceedings of the 36th International Conference on
O. Papaemmanouil, and N. Tatbul. Neo: A learned query optimizer. Machine Learning. PMLR, 2019.
Proc. VLDB Endow., 12(11):1705–1718, July 2019. [78] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian
[55] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. denoiser: Residual learning of deep cnn for image denoising. IEEE
Communication-efficient learning of deep networks from decentralized transactions on image processing, 26(7):3142–3155, 2017.
data. arXiv preprint arXiv:1602.05629, 2016.
[56] M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning.
In International Conference on Machine Learning, pages 4615–4625.
PMLR, 2019.
[57] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng.
Reading digits in natural images with unsupervised feature learning.
2011.
[58] J. Neyman. On the two different aspects of the representative method:
the method of stratified sampling and the method of purposive selection.
In Breakthroughs in statistics, pages 123–150. Springer, 1992.
[59] C. Niu, Z. Zheng, F. Wu, X. Gao, and G. Chen. Trading data in good
faith: Integrating truthfulness and privacy preservation in data markets.
In 2017 IEEE 33rd International Conference on Data Engineering
(ICDE), pages 223–226. IEEE, 2017.
[60] X. Peng, Z. Huang, Y. Zhu, and K. Saenko. Federated adversarial domain
adaptation. In International Conference on Learning Representations,
2020.
[61] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ,
S. Kumar, and H. B. McMahan. Adaptive federated optimization. arXiv
preprint arXiv:2003.00295, 2020.
[62] A. Reisizadeh, F. Farnia, R. Pedarsani, and A. Jadbabaie. Robust
federated learning: The case of affine distribution shifts. Advances in
Neural Information Processing Systems, 2020.
[63] S. J. Rizvi and J. R. Haritsa. Maintaining data privacy in association rule
mining. In VLDB’02: Proceedings of the 28th International Conference
on Very Large Databases, pages 682–693. Elsevier, 2002.
[64] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the
stochastic average gradient. Mathematical Programming, 162(1-2):83–
112, 2017.
[65] S. Shastri, V. Banakar, M. Wasserman, A. Kumar, and V. Chidambaram.
Understanding and benchmarking the impact of gdpr on database
systems. Proc. VLDB Endow., 13(7):1064–1077, Mar. 2020.
[66] A. P. Sheth and J. A. Larson. Federated database systems for managing
distributed, heterogeneous, and autonomous databases. ACM Computing
Surveys (CSUR), 22(3):183–236, 1990.
[67] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership
inference attacks against machine learning models. In 2017 IEEE
Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017.
[68] M. J. Smith, C. Sala, J. M. Kanter, and K. Veeramachaneni. The machine
learning bazaar: Harnessing the ml ecosystem for effective system
development. In Proceedings of the 2020 ACM SIGMOD International
Conference on Management of Data, SIGMOD ’20, page 785–800, New
York, NY, USA, 2020. Association for Computing Machinery.
[69] P. Voigt and A. Von dem Bussche. The eu general data protection regu-
lation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International
Publishing, 2017.

978

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.

Federated Learning with Non-IID Data
No ratings yet
Federated Learning with Non-IID Data
20 pages
Non-IID Data in Federated Learning Survey
No ratings yet
Non-IID Data in Federated Learning Survey
25 pages
Survey on Federated Learning with Non-IID Data
No ratings yet
Survey on Federated Learning with Non-IID Data
29 pages
Decentralized Fed Learning With Non Iid
No ratings yet
Decentralized Fed Learning With Non Iid
41 pages
Federated Learning On Non IID Data A Survey-2
No ratings yet
Federated Learning On Non IID Data A Survey-2
30 pages
Decentralized Federated Learning Survey
No ratings yet
Decentralized Federated Learning Survey
44 pages
(2024) Federated Learning With Non-IID Data A Survey
No ratings yet
(2024) Federated Learning With Non-IID Data A Survey
22 pages
Data Distribution in Federated Learning
No ratings yet
Data Distribution in Federated Learning
17 pages
Federated Learning: A Comprehensive Survey
No ratings yet
Federated Learning: A Comprehensive Survey
33 pages
Survey on Federated Learning Techniques
No ratings yet
Survey on Federated Learning Techniques
36 pages
Federated Learning Literature Review
No ratings yet
Federated Learning Literature Review
7 pages
Federated Learning: Advances & Challenges
No ratings yet
Federated Learning: Advances & Challenges
7 pages
Federated Learning for Privacy in Data Analysis
No ratings yet
Federated Learning for Privacy in Data Analysis
10 pages
Federated Learning: Privacy-Preserving AI
No ratings yet
Federated Learning: Privacy-Preserving AI
6 pages
Personalized Federated Learning Survey
No ratings yet
Personalized Federated Learning Survey
17 pages
Sanskriti 2
No ratings yet
Sanskriti 2
15 pages
A Scalable Federated Learning Approach For Collaborative Smart Healthcare Systems With Intermittent Clients Using Medical Imaging
No ratings yet
A Scalable Federated Learning Approach For Collaborative Smart Healthcare Systems With Intermittent Clients Using Medical Imaging
12 pages
Federated Learning Systems Overview
No ratings yet
Federated Learning Systems Overview
41 pages
Top Open-Source FL Frameworks 2024-2025
No ratings yet
Top Open-Source FL Frameworks 2024-2025
26 pages
Data Augmentation in Federated Learning
No ratings yet
Data Augmentation in Federated Learning
18 pages
Semi-Supervised Federated Learning for Data Streams
No ratings yet
Semi-Supervised Federated Learning for Data Streams
22 pages
Valadi - FedVal - Different Good or Different Bad in FL (USENIX, 2023)
No ratings yet
Valadi - FedVal - Different Good or Different Bad in FL (USENIX, 2023)
17 pages
Federated Learning: Privacy-Preserving Insights
No ratings yet
Federated Learning: Privacy-Preserving Insights
7 pages
Jawad Ur Rahman 2021
No ratings yet
Jawad Ur Rahman 2021
19 pages
A Survey On Cluster-Based Federated Learning
No ratings yet
A Survey On Cluster-Based Federated Learning
22 pages
Federated Learning: Generalization and Fairness Survey
No ratings yet
Federated Learning: Generalization and Fairness Survey
20 pages
FedCCS: Optimizing Federated Learning
No ratings yet
FedCCS: Optimizing Federated Learning
13 pages
Wjarr 2022 0711
No ratings yet
Wjarr 2022 0711
13 pages
Paper2 Federated Learning
No ratings yet
Paper2 Federated Learning
4 pages
Fahola 2023 - Federated Learning For Data Analytics in Education
No ratings yet
Fahola 2023 - Federated Learning For Data Analytics in Education
16 pages
Federated Learning: Applications, Challenges and Future Directions
No ratings yet
Federated Learning: Applications, Challenges and Future Directions
35 pages
1.federated Learning For Privacy-Preserving Data Mining
No ratings yet
1.federated Learning For Privacy-Preserving Data Mining
6 pages
Federated Learning: Applications & Challenges
No ratings yet
Federated Learning: Applications & Challenges
17 pages
Ive Fede 2024 Knowledge Ba
No ratings yet
Ive Fede 2024 Knowledge Ba
12 pages
Survey of Federated Learning Threats
No ratings yet
Survey of Federated Learning Threats
7 pages
Personalized Approaches in Federated Learning
No ratings yet
Personalized Approaches in Federated Learning
7 pages
Applications of Federated Learning Review
No ratings yet
Applications of Federated Learning Review
58 pages
Heterogeneous Federated Learning Survey
No ratings yet
Heterogeneous Federated Learning Survey
46 pages
Split-Fed Learning: Enhancing Data Privacy
No ratings yet
Split-Fed Learning: Enhancing Data Privacy
24 pages
8 Jsee3522
No ratings yet
8 Jsee3522
22 pages
Decentralized Personalized Federated Learning
No ratings yet
Decentralized Personalized Federated Learning
16 pages
Federated Learning: Attacks and Defenses
No ratings yet
Federated Learning: Attacks and Defenses
10 pages
Survey on Federated Learning Security
100% (1)
Survey on Federated Learning Security
26 pages
Federated Learning with Fuzzy Cognitive Maps
No ratings yet
Federated Learning with Fuzzy Cognitive Maps
11 pages
Federated Learning in Healthcare Review
No ratings yet
Federated Learning in Healthcare Review
31 pages
Personalized Federated Learning Optimization
No ratings yet
Personalized Federated Learning Optimization
17 pages
FedLab: A Flexible FL Framework
No ratings yet
FedLab: A Flexible FL Framework
10 pages
Federated Learning: Overview and Future Directions
No ratings yet
Federated Learning: Overview and Future Directions
24 pages
Federated Learning for Privacy-Preserving AI
No ratings yet
Federated Learning for Privacy-Preserving AI
6 pages
Research Paper 8
No ratings yet
Research Paper 8
7 pages
FedML A Research Library and Benchmark For Federat
No ratings yet
FedML A Research Library and Benchmark For Federat
19 pages
Federated Learning for Malware Detection
No ratings yet
Federated Learning for Malware Detection
28 pages
Decentralized AI Data Management in FL
No ratings yet
Decentralized AI Data Management in FL
4 pages
Nguyen Et Al. - 2022 - FedDRL Deep Reinforcement Learning-Based Adaptive Aggregation For Non-IID Data in Federated Learnin
No ratings yet
Nguyen Et Al. - 2022 - FedDRL Deep Reinforcement Learning-Based Adaptive Aggregation For Non-IID Data in Federated Learnin
11 pages
Iterative Federated Clustering Algorithm
No ratings yet
Iterative Federated Clustering Algorithm
12 pages
Blockchain-Based Federated Learning Framework
No ratings yet
Blockchain-Based Federated Learning Framework
11 pages
Federated Learning Research Synthesis
No ratings yet
Federated Learning Research Synthesis
3 pages
Three-Schema Architecture Overview
No ratings yet
Three-Schema Architecture Overview
24 pages
AI-Driven Plant Disease Detection System
No ratings yet
AI-Driven Plant Disease Detection System
13 pages
Big Data Analytics Overview and Benefits
No ratings yet
Big Data Analytics Overview and Benefits
18 pages
Movie Recommendation System Report
No ratings yet
Movie Recommendation System Report
2 pages
SQL Server DBA Training Overview
No ratings yet
SQL Server DBA Training Overview
28 pages
Data Breach Prevention in Healthcare
No ratings yet
Data Breach Prevention in Healthcare
22 pages
Lis6711 Dublin Core Record - Crystal Stephenson
No ratings yet
Lis6711 Dublin Core Record - Crystal Stephenson
5 pages
Relevance Feedback and Query Expansion
No ratings yet
Relevance Feedback and Query Expansion
4 pages
Building Information Systems: Managing The Digital Firm, 12 Edition
100% (1)
Building Information Systems: Managing The Digital Firm, 12 Edition
25 pages
Azure Data Factory Developer Resume
No ratings yet
Azure Data Factory Developer Resume
5 pages
AI's Role in Modern Cybersecurity
No ratings yet
AI's Role in Modern Cybersecurity
36 pages
PDF Scanner Document Overview
No ratings yet
PDF Scanner Document Overview
190 pages
Big Data Processing Optimization Strategies
No ratings yet
Big Data Processing Optimization Strategies
6 pages
AI and IoT: Integration and Impact
No ratings yet
AI and IoT: Integration and Impact
14 pages
Understanding Relational Database Models
No ratings yet
Understanding Relational Database Models
26 pages
Chuka University Database Systems Exam
No ratings yet
Chuka University Database Systems Exam
3 pages
Azure Machine Learning Expertise Summary
No ratings yet
Azure Machine Learning Expertise Summary
6 pages
Library Cataloguing Theory Overview
No ratings yet
Library Cataloguing Theory Overview
7 pages
Metro Dumaguete College System Analysis
No ratings yet
Metro Dumaguete College System Analysis
4 pages
6-Day SQL Study Plan for Analysts
No ratings yet
6-Day SQL Study Plan for Analysts
5 pages
SQL Query Design Best Practices Guide
No ratings yet
SQL Query Design Best Practices Guide
21 pages
Test Cases for Event Management System
No ratings yet
Test Cases for Event Management System
5 pages
ETL Test PLan and Test Strategy
No ratings yet
ETL Test PLan and Test Strategy
8 pages
Data Mining Concepts and Techniques
No ratings yet
Data Mining Concepts and Techniques
9 pages
Software Engineering
No ratings yet
Software Engineering
1 page
NB81 CloudCatalyst DR
No ratings yet
NB81 CloudCatalyst DR
12 pages
MongoDB CRUD Operations Guide
No ratings yet
MongoDB CRUD Operations Guide
9 pages
Decision Support Systems Overview
No ratings yet
Decision Support Systems Overview
15 pages
System 1 Classic Overview and Features
No ratings yet
System 1 Classic Overview and Features
15 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
40 pages

Research Paper1

Uploaded by

Research Paper1

Uploaded by

2022 IEEE 38th International Conference on Data Engineering (ICDE)

Federated Learning on Non-IID Data Silos: An

regulations, training data have been increasingly fragmented, many countries.

2375-026X/22/$31.00 ©2022 IEEE 965

In feature distribution skew, the feature distributions P (xi )

D. Quantity Skew TABLE II

category dataset partitioning FedAvg FedProx SCAFFOLD FedNova

Label Distribution skew Feature Distribution skew Quantity skew

Distribution-based label imbalance Quantity-based label imbalance

algorithms. First, in label distribution skew and quantity skew

Case 1 FedAvg FedProx SCAFFOLD FedNova VI. F UTURE D IRECTIONS

You might also like