Research Paper1
Research Paper1
Abstract—Due to the increasing privacy concerns and data centralized to a single country due to the data regulations in
2022 IEEE 38th International Conference on Data Engineering (ICDE) | 978-1-6654-0883-7/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICDE53745.2022.00077
966
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
IID setting non-IID setting ∗ Algorithm 1: A summary of FL algorithms including
FedAvg/FedProx/FedNova. We use red and orange
∗
colors to mark the part specially included in FedProx
∗ ∗
∗ ∗ and FedNova, respectively.
Input: local datasets Di , number of parties N , number
local model global model local optima global optima of communication rounds T , number of local
epochs E, learning rate η
Fig. 2. Example of a drift under the non-IID setting. Output: The final model wT
1 Server executes:
exists a drift in the local updates [34]. In other words, in 2 initialize x0
the local training stage, each model is updated towards its 3 for t = 0, 1, ..., T − 1 do
own local optima, which can be far from the global optima. 4 Sample a set of parties St
The averaged model may also be far from the global optima 5 n ← i∈St |Di |
especially when the local updates are large (e.g., a large 6 for i ∈ St in parallel do
number of local epochs) [34], [44], [70], [71]. Eventually, the 7 send the global model wt to party Pi
converged global model has much worse accuracy than IID 8 Δwit , τi ← LocalTraining(i, wt )
setting. Figure 2 demonstrates the issue of FedAvg under the 9 For FedAvg/FedProx:
non-IID data setting. Under the IID setting, the global optima |D i |
wt+1 ← wt − η i∈St n Δwk
t
w∗ is close to the local optima w1∗ and w2∗ . Thus, the averaged
10 For FedNova:
model wt+1 is also close to the global optima. However, under
|D i |τi |D i |Δwit
the non-IID setting, since w∗ is far from w1∗ , wt+1 can be far wt+1 ← wt − η i∈St
n i∈St nτi
from w∗ . It is challenging to design an effective FL algorithm 11 return wT
under the non-IID setting. We will present the FL algorithms
on handling non-IID data in the next section. 12 Party executes:
13 For FedAvg/FedNova: L(w; b) = (x,y)∈b (w; x; y)
III. FL A LGORITHMS ON N ON -IID DATA 14 For FedProx:
2
There have been some studies [34], [44], [71] trying to L(w; b) = (x,y)∈b (w; x; y)+ μ2 w − wt
address the drift issue in FL. Here we summarize several state- 15 LocalTraining(i, wt ):
of-the-art and popular approaches as shown in Algorithm 1 16 wit ← wt
(FedAvg [55], FedProx [44], FedNova [71]) and Algorithm 17 τi ← 0
2 (SCAFFOLD [34]). These approaches are all based on 18 for epoch k = 1, 2, ..., E do
FedAvg, and we use colors to mark the parts that specially 19 for each batch b = {x, y} of Di do
designed in FedProx (red), SCAFFOLD (blue), and FedNova 20 wit ← wit − η∇L(wit ; b)
(orange). Note that the studied approaches have the same 21 τi ← τi + 1
objective, i.e., learning an effective global model under the
22 Δwit ← wt − wit
non-IID data setting. There are also other FL studies related
23 return Δwit , τi to the server
to non-IID data setting, such as personalizing the local models
for each party [13], [15], [22] and designing robust algorithms
against different combinations of local distributions [10], [56],
[62], which are out of the scope of this paper. is too small, then the regularization term has almost no effect.
If μ is too big, then the local updates are very small and the
A. FedProx convergence speed is slow.
FedProx [44] improves the local objective based on FedAvg.
It directly limits the size of local updates. Specifically, as B. FedNova
shown in Line 14 of Algorithm 1, it introduces an additional Another recent study, FedNova [71], improves FedAvg in
L2 regularization term in the local objective function to limit the aggregation stage. It considers that different parties may
the distance between the local model and the global model. conduct different numbers of local steps (i.e., the number of
This is a straightforward way to limit the local updates so mini-batches in the local training) each round. This can happen
that the averaged model is not so far from the global optima. when parties have different computation power given the same
A hyper-parameter μ is introduced to control the weight of time constraint or parties have different local dataset size given
the L2 regularization. Overall, the modification to FedAvg is the same number of local epochs and batch size. Intuitively, the
lightweight and easy to implement. FedProx introduces addi- parties with a larger number of local steps will have a larger
tional computation overhead and does not introduce additional local update, which will have a more significant influence on
communication overhead. However, one drawback is that users the global updates if simply averaged. Thus, to ensure that the
may need to carefully tune μ to achieve good accuracy. If μ global updates are not biased, FedNova normalizes and scales
967
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
the local updates of each party according to their number of Algorithm 2: The SCAFFOLD algorithm. We use
local steps before updating the global model (see Line 10 blue color to mark the part specially included in
of Algorithm 1). FedNova also only introduces lightweight SCAFFOLD compared with FedAvg.
modifications to FedAvg, and negligible computation overhead Input: same as Algorithm 1
when updating the global model. Output: The final model wT
C. SCAFFOLD 1 Server executes:
2 initialize x0
SCAFFOLD [34] models non-IID as introducing variance
3 ct ← 0
among the parties and applies the variance reduction technique
4 for t = 0, 1, ..., T − 1 do
[31], [64]. It introduces control variates for the server (i.e., c)
and parties (i.e., ci ), which are used to estimate the update
5
sample a set of parties St
Randomly
6 n ← i∈St |Di |
direction of the server model and the update direction of each
7 for i ∈ St in parallel do
client. Then, the drift of local training is approximated by the
8 send the global model wt to party Pi
difference between these two update directions. Thus, SCAF-
Δwit , Δc ← LocalTraining(i, wt , ct )
FOLD corrects the local updates by adding the drift in the i
local training (Line 20 of Algorithm 2). SCAFFOLD proposes 9 wt+1 ← wt − η i∈St |Dn | Δwkt
two approaches to update the local control variates (Line 23 of 10 ct+1 ← ct + N1 Δc
Algorithm 2), by computing the gradient of the local data at the 11 return wT
global model or by reusing the previously computed gradients.
The second approach has a lower computation cost while 12 Party executes:
the first one may be more stable. Compared with FedAvg, 13 L(w; b) = (x,y)∈b (w; x; y)
intuitively, SCAFFOLD doubles the communication size per 14 ci ← 0
round due to the additional control variates. 15 LocalTraining(i, wt , ct ):
16 wit ← wt
D. Other Studies 17 τi ← 0
18 for epoch k = 1, 2, ..., E do
When preparing this paper, there are other contemporary
19 for each batch b = {x, y} of Di do
works [2], [39], [47], [72] on federated learning under non-IID
20 wit ← wit − η(∇L(wit ; b)−cti + c)
setting. [2] proposes FedDyn, which adds a regularization term
21 τi ← τi + 1
in the local training based on the global model and the model
from the previous round. [47] proposes FedBN for feature 22 Δwit ← wt − wit
1
shift non-IID setting, where the client batch-norm layers are 23 c∗i ← (i)∇L(wit ), or(ii)ci − c + τi η (w
t
− wit )
updated locally without communicating to the server. [72] 24 Δc ← c∗i − ci
applies a monitor to detect class imbalance in the training 25 ci ← c∗i
process, and proposes a new loss function to address it. [39] 26 return Δwit , Δc to the server
proposes model-contrastive learning. Their approach corrects
the local training by comparing the representations learned
by the current local model, the local model from the previous
IV. S IMULATING N ON -IID DATA S ETTING
round, and the global model. We leave the comparison between
these studies as future studies. As existing studies only adopt limited partitioning strategies,
they cannot represent a comprehensive view of non-IID cases.
E. Motivation of this study To bridge this gap, we develop a benchmark named NIID-
Bench.
Non-IID is a key and common data challenge for developing
effective federated learning algorithms. Although previous A. Research Problems
studies [34], [44], [71] have demonstrated preliminary and We need to address two key research problems. The first one
promising results over FedAvg on non-IID data, as we will is on data sets: whether to use real-world non-IID datasets
summarize in Table I in later section, all above studies have or synthetic datasets. The second one is on how to design
evaluated only one or two non-IID distributions, and tried comprehensive non-IID scenarios.
rigid data partitioning strategies in the experiments. There is For the first problem, we choose to synthesize the distributed
still no standard benchmark or a systematic study to evaluate non-IID datasets by partitioning a real-world dataset into
the effectiveness of these FL algorithms. This motivates us to multiple smaller subsets. Many existing studies [34], [55], [71]
develop a benchmark with more comprehensive data distribu- use the partitioning approach to simulate the non-IID federated
tions as well as data partitioning strategies, and then we can setting. Compared with using real federated datasets [6], [28],
evaluate the pros and cons of existing algorithms and outline adopting partitioning strategies has the following advantages.
the challenges and opportunities for future federated learning First, while it is challenging to evaluate the imbalance prop-
on non-IID data. erties (e.g., imbalanced level and imbalanced case) in real
968
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
federated datasets, partitioning strategies can easily quantify
and control the imbalance properties of the local data. Thus,
researchers can easily investigate the behavior of algorithms by
trying different imbalanced settings, which is essential to the
development of FL algorithms. Second, when using synthetic
datasets, one can easily set different factors (e.g., number of
parties, size of data) that are important in the FL experiments.
However, a real federated dataset usually corresponds to a (a) The label distribution for Criteo. The value in cell (a, b) is the
fixed federated setting. Last, due to data regulation and privacy amount of data samples of class b belonging to Party a.
concerns, meaningful real federated datasets are difficult to
obtain [28]. Even if we can obtain such real datasets, they
do not have the previous two advantages of synthetic data
sets. It is more flexible to develop partitioning strategies on
existing widely used public datasets, which already have lots
of centralized training knowledge as reference, as well as to
simulate different non-IID scenarios. There are also limitations
of using generated datasets compared with using real federated
datasets. The generated datasets may not fully capture the real
data distributions, which can be complicated and challenging
to quantify. Note that the usage of generated federated datasets
and real federated datasets are orthogonal. It is an interesting
future study to find and study meaningful real-world data sets
and application scenarios.
For the second problem, an existing study [32] gives a very (b) The feature distribution for Digits. The triangles are the
good and comprehensive summary on non-IID data cases from visualized features of SVHN and the circles are the visualized
a distribution perspective. Specifically, considering the local features of MNIST.
data distribution P (xi , yi ) = P (xi |yi )P (yi ) or P (xi , yi ) = Fig. 3. The non-IID properties of Criteo and Digits.
P (yi |xi )P (xi ), the previous study [32] summaries five dif-
ferent non-IID cases: (1) label distribution skew (i.e., P (yi )
is different among parties); (2) feature distribution skew (i.e., the label distribution as shown in Figure 3a. We can observe
P (xi ) is different among parties); (3) same label but different that there exists both label distribution skew (e.g., Party 0
features (i.e., P (xi |yi ) is different among parties); (4) same and Party 4) and quantity skew (e.g., Party 0 and Party 8)
features but different labels (i.e., P (yi |xi ) is different among among the parties. In Digits, taking each subset (e.g., MNIST
parties); (5) quantity skew (i.e., P (xi , yi ) is same but the and SVHN) as a party, we train a model using these subsets
amount of data is different among parties). Here the third and draw the feature distribution using t-SNE [52] as shown
case is mainly related to vertical FL (the parties share the in Figure 3b. For each class, althougth MNIST and SVHN
same sample IDs but different features). As mentioned in the have the same label, the feature distributions of MNIST and
third paragraph of Section I, we focus on horizontal FL in SVHN are significantly different from each other. Feature skew
this paper, where each party shares the same feature space but exists in the Digits dataset. These two examples show that the
owns different samples. The fourth case is not applicable in considered non-IID data cases are reasonable and practical.
most FL studies, which assume there is a common knowledge
P (y|x) among the parties to learn. Otherwise, techniques such B. Label Distribution Skew
as domain adaption [60] or personalized federated learning In label distribution skew, the label distributions P (yi ) vary
(i.e., each party learns a personalized local model) [13], [15] across parties. Such a case is common in practice. For ex-
can be applied in federated learning, which is out of the scope ample, some hospitals are more specialized in several specific
of our paper. Thus, we consider label distribution skew, feature kinds of diseases and have more patient records on them. To
distribution skew, and quantity skew as possible non-IID data simulate label distribution skew, we introduce two different
distribution cases in this paper. While the five non-IID data label imbalance settings: quantity-based label imbalance and
cases cover all possible single type of skew, there may be distribution-based label imbalance.
mixed types of skew, which we will discuss in Section V-G. a) Quantity-based label imbalance: Here each party
We use two real-world datasets, Criteo [11] and Digits owns data samples of a fixed number of labels. This is first
[60], to demonstrate the non-IID properties. Criteo contains introduced in the experiments of FedAvg [55], where the data
feature values and click feedback for millions of display ads, samples with the same label are divided into subsets and
which can be used for clickthrough rate prediction. Digits each party is only assigned 2 subsets with different labels.
contains multiple subsets for digit classification. In Criteo, Following FedAvg, such a setting is also used in many other
taking each user as a party, we select ten parties and draw studies [19], [44]. [16] considers a highly extreme case, where
969
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
Fig. 6. The visualization of our FCUBE dataset. The data points within the
upper four cubes have label 0 and within the lower four cubes have label 1.
There are a total of eight cubes with four colors. The data points with the
same color are assigned to a party.
Fig. 4. An example of distribution-based label imbalance partition on MNIST
[37] dataset with β = 0.5. The value in each rectangle is the number of data
samples of a class belonging to a certain party.
C. Feature Distribution Skew
970
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE I
T HE EXPERIMENTAL SETTINGS IN EXISTING STUDIES AND OUR BENCHMARK . N OTE THAT THE QUANTITY- BASED ,NOISED - BASED , AND QUANTITY
SKEW PARTITIONING STRATEGIES IN THE EXISTING STUDIES ARE DIFFERENT FORM THE STRATEGIES PROPOSED IN OUR STUDY.
Partitioning strategies FedAvg FedProx SCAFFOLD FedNova NIID-Bench
quantity-based
Label distribution skew
distribution-based
noise-based
Feature distribution skew synthetic
real-world
Quantity skew
971
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE III
T HE TOP -1 ACCURACY OF DIFFERENT APPROACHES . W E RUN THREE TRIALS AND REPORT THE MEAN ACCURACY AND STANDARD DERIVATION . F OR
F ED P ROX , WE TUNE μ FROM {0.001, 0.01, 0.1, 1} AND REPORT THE BEST ACCURACY.
the label distribution skew influences the accuracy of FL best algorithm for FL. If the local datasets have almost the
algorithms most among all non-IID settings. There is room same data distribution but different sizes (e.g., databases with
for existing algorithms to be improved to handle scenarios different capacities), then FedProx is likely the appropriate
such as quantity-based label imbalance. algorithm. If there is no prior knowledge on the local datasets,
We draw a decision tree to summarize the suitable FL how to determine the distribution is a challenging problem and
algorithm for each non-IID setting as shown in Figure 7 more research efforts are needed (see Section VI-A).
according to our observations. This decision tree is helpful 2) Comparison among different algorithms:
for users to choose the algorithm for their learning according Finding (2): No algorithm consistently outperforms the other
to the non-IID distribution and the datasets. For example, if the algorithms in all settings. The state-of-the-art algorithms
local datasets are likely to have feature distribution skew (e.g., significantly outperform FedAvg only in several cases.
the digits from different writers), then SCAFFOLD may be the We have the following observations in aspect of different
972
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
Non-IID data setting
SCAFFOLD FedProx
FedAvg/
Image datasets Tabular datasets
FedProx
(a) pk ∼ Dir(0.5) (b) x̂ ∼ Gau(0.1)
SCAFFOLD FedProx
Fig. 8. The training curves of different approaches on CIFAR-10.
Fig. 7. The decision tree to determine the (almost) best FL algorithm given
the non-IID setting.
973
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
(a) pk ∼ Dir(0.5) (b) q ∼ Dir(0.5) (a) pk ∼ Dir(0.5) (b) x̂ ∼ Gau(0.1)
Fig. 10. The training curves of different approaches on CIFAR-10 with 100 Fig. 11. The test accuracy with different number of parties on CIFAR-10.
parties and sample fraction 0.1.
TABLE IV
T HE COMPUTATION TIME ( SECOND ) AND COMMUNICATION SIZE (MB)
settings pk ∼ Dir(0.5) and #C = 3. In summary, existing PER ROUND OF DIFFERENT APPROACHES .
algorithms are not robust enough against large local updates.
MNIST CIFAR-10 adult rcv1
Non-IID distributions have to be considered to determine the FedAvg 73s 193s 15s 66s
best number of local epochs. FedProx 133s 233s 44s 76s
D. Party Sampling SCAFFOLD 77s 197s 14s 66s
FedNova 73s 189s 17s 65s
Finding (6): In the partial participation setting, SCAFFOLD
FedAvg 1.95MB 2.73MB 0.20MB 66.54MB
cannot work effectively, while the other FL algorithms have
FedProx 1.95MB 2.73MB 0.20MB 66.54MB
a very unstable accuracy during training. SCAFFOLD 3.91MB 5.46MB 0.41MB 133.08MB
In some scenarios, not all the data silos will participate FedNova 1.95MB 2.73MB 0.20MB 66.54MB
the entire training process. In such a setting, the sampling
technique is usually applied (Line 6 of Algorithm 1). To
simulate this scenario, we set the number of parties to 100 and To compare the efficiency of different FL algorithms, we
the sample fraction to 0.1. We run experiments on CIFAR- show the overall computation time and communication costs
10 and the results are shown in Figure 10. Please refer to of each approach in Table IV. We can observe that the
Appendix C of the technical report [38] for the results with computation costs of FedAvg, SCAFFOLD, and FedNova are
other partitioning strategies. We can find that the training close. FedProx has a much higher computation cost than
curves are quite unstable in most non-IID settings. Due to the other algorithms. From Algorithm 1, FedProx directly
the sampling technique, the local distributions among different modifies the objective, which causes additional computation
rounds can vary, and thus the averaged gradients may have overhead in the gradient descent of each batch. FedNova and
very different directions among rounds. Moreover, we can find SCAFFOLD only introduce very small number of addition
that SCAFFOLD has a bad accuracy on all settings. Since the and multiplication operations each round, which is negligible.
frequency of updating local control variates (Lines 23-25 of For the communication costs, since SCAFFOLD needs to
Algorithm 2) is low, the estimation of the update direction communicate control variates in each round as shown in
may be very inaccurate using the control variates. Algorithm 2, its communication cost is twice of that of the
other algorithms.
E. Scalability
Finding (7): The accuracy of all approaches decrease when G. Mixed Types of Skew
increasing the number of parties. Finding (9): FL is more challenging when there exists mixed
We study the effect of number of clients on studied ap- types of skew among the local data.
proaches as shown in Figure 11. Here we run all approaches In practice, there may exist mixed types of skew among
for 50 rounds. We can observe that the accuracy decreases parties. Here we combine multiple partitioning strategies to
significantly when increasing the number of clients. When the generate such cases. We try two different settings: 1) we first
number of parties is large, the amount of local data is small divide the whole dataset into each party by the distribution-
and it is easy to overfit in the local training stage. How to based label imbalanced partitioning strategy. Then, we add
design effective and communication-efficient algorithms on a noises to the data of each party according to the noise-based
large-scale setting with small data in the client is still an open feature imbalance strategy. Therefore, there exists both label
problem. distribution skew and feature distribution skew among the local
data of different parties. 2) we first divide the whole dataset
F. Efficiency
into each party by the quantity imbalanced partitioning strat-
Finding (8): The computation overhead of FedProx is large egy. Then, we add noises to the data of each party according
compared with FedAvg. Moreover, the communication cost to the noise-based feature imbalance strategy. Therefore, there
of SCAFFOLD is twice of that of FedAvg. exists both feature distribution skew and quantity skew among
974
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE V important to investigate effective algorithms working on
T HE PERFORMANCE OF DIFFERENT APPROACHES WITH DIFFERENT
IMBALANCE CASES ON CIFAR-10.
multiple types of skew, which is more practical in reality.
975
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
single label. This setting is seemingly unrealistic. However, it Among these six partitioning strategies, the two partitioning
has many real-world applications in practice. For example, we strategies in Section IV-A-b and Section IV-B-c are adopted
can use FL to train a speaker recognition model, while each from existing FL studies due to their popularity, while the
mobile device only has the voices of its single user. other four effective partitioning strategies are designed by our
Fast Training: From Figure 8, the training speed of existing study. Next, we introduce these partitioning strategies in detail.
FL algorithms are usually close to each other. FedProx, There are some existing benchmarks for federated learn-
SCAFFOLD, and FedNova do not show much superiority on ing [6], [26], [28], [48]. LEAF [6] provides some realistic
the communication efficiency. To improve the training speed, federated datasets including images and texts. Specifically,
researchers can work on the following two directions. One LEAF partitions the existing datasets according to its data
possible solution is to develop communication-efficient FL al- recourses, e.g., partitioning the data in Extended MNIST [8]
gorithms with only a few rounds. There are some studies [21], based on the writer of the digit or character. OARF [28] pro-
[40] that propose FL algorithms using a single communication poses federated datasets by combining multiple related real-
round. In their studies, a public dataset is needed, which may world public datasets. Moreover, it provides various metrics
potentially limit the applications. Another possible solution is including utility, communication overhead, privacy loss, and
to develop fast initialization approach to reduce the number mimics the federated systems in the real world. However,
of rounds while achieving the same accuracy for FL. In the both LEAF and OARF do not provide an algorithm-level
experiments of a previous study [40], they show that their comparison. FedML [26] provides reference implementations
approach is also promising if applied as an initialization step. of federated learning algorithms such as FedAvg, FedNOVA
Automated Parameter Tuning for FL: FL algorithms suffer [71] and FedOpt [61]. There are no new datasets, metrics,
from large local updates. The number of local epochs is an and settings in FedML. FLBench [48] is proposed for isolated
important parameter in FL. While one traditional way is to data island scenario. Its framework covers domains including
develop approaches robustness to the local updates, another medical, finance, and AIoT. However, currently, FLBench is
way is to design efficient parameter tuning approaches for not open-sourced and it does not provide any experiments.
FL. A previous paper [9] studies Bayesian optimization in The above benchmarks do not provide analysis of existing
the federated setting, which can be used to search hyper- federated learning algorithms on different non-IID settings,
parameters. Approaches for the setting of number of local which is our focus in this paper. To the best of our knowledge,
epochs need to be investigated. there is one existing benchmark [50] for federated learning
Towards Robust Algorithms against Different Non-IID Set- on the non-IID data setting. However, it only provides two
tings: As in Finding (2), no algorithm consistently performs partitioning approaches: random split and split by labels. In
the best in all settings. It is a natural question whether and this paper, we provide comprehensive partitioning strategies
how we can develop a robust algorithm for different non-IID and datasets to cover different non-IID settings. Moreover,
settings. We may have to first investigate the common charac- we conduct extensive experiments to compare and analyze
teristics of FL processes under different non-IID settings. The existing federated learning algorithms.
intuitions of existing algorithms are same: the local model
updates towards the local optima, and the averaged model VIII. C ONCLUSION
is far from the global optima. We believe the design of FL There has been a growing interest in exploiting distributed
algorithms under non-IID settings can be improved if we can databases (e.g., in different organizations and countries) to
observe more detailed and common behaviours in the training. improve the effectiveness of machine learning services. In this
Aggregation of Heterogeneous Batch Normalization: From paper, we study non-IID data as one key challenge in such
our Finding (7), simple averaging is not a good choice for distributed databases, and develop a benchmark named NIID-
batch normalization. Since the batch normalization in each bench. Specifically, we introduce six data partitioning strate-
party records the statistics of local data distribution, there is gies which are much more comprehensive than the previous
also heterogeneity among the batch normalization layers of studies. Furthermore, we conduct comprehensive experiments
different parties. The averaged batch normalization layer may to compare existing algorithms and demonstrate their strength
not catch the local distribution after sending back to the parties. and weakness. This study sheds light on some future directions
A possible solution is to only average the learned parameters to build effective machine learning services on distributed
but leave the statistics (i.e., mean and variance) alone [4]. More databases.
specialized designs for particular layers in deep learning need
to be investigated. ACKNOWLEDGEMENTS
This research is supported by the National Research Foun-
VII. R ELATED W ORK dation, Singapore under its AI Singapore Programme (AISG
Although the existing study [32] provides non-IID data Award No: AISG2-RP-2020-018). Any opinions, findings and
cases, it does not provide the partitioning strategies to generate conclusions or recommendations expressed in this material
the corresponding non-IID data distributions. We go beyond are those of the authors and do not reflect the views of
the previous study and summarize six different partitioning National Research Foundation, Singapore. Qinbin is also in
strategies to generate three non-IID data distribution cases. part supported by a Google PhD Fellowship.
976
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [22] F. Hanzely, S. Hanzely, S. Horváth, and P. Richtárik. Lower bounds
and optimal algorithms for personalized federated learning. Advances
in Neural Information Processing Systems, 2020.
[1] S. AbdulRahman, H. Tout, A. Mourad, and C. Talhi. Fedmccs: [23] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augen-
multicriteria client selection model for optimal iot federated learning. stein, H. Eichner, C. Kiddon, and D. Ramage. Federated learning for
IEEE Internet of Things Journal, 8(6):4723–4735, 2020. mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.
[2] D. A. E. Acar, Y. Zhao, R. Matas, M. Mattina, P. Whatmough, and [24] S. Hasan, S. Thirumuruganathan, J. Augustine, N. Koudas, and G. Das.
V. Saligrama. Federated learning based on dynamic regularization. In Deep learning models for selectivity estimation of multi-attribute
International Conference on Learning Representations, 2021. queries. In Proceedings of the 2020 ACM SIGMOD International
[3] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Conference on Management of Data, SIGMOD ’20, page 1035–1050,
Proceedings of the 2000 ACM SIGMOD international conference on New York, NY, USA, 2020. Association for Computing Machinery.
Management of data, pages 439–450, 2000. [25] C. He, M. Annavaram, and S. Avestimehr. Group knowledge transfer:
[4] M. Andreux, J. O. du Terrail, C. Beguier, and E. W. Tramel. Siloed Federated learning of large cnns at the edge. Advances in Neural
federated learning for multi-centric histopathology datasets. In Domain Information Processing Systems, 33, 2020.
Adaptation and Representation Transfer, and Distributed and Collabo- [26] C. He, S. Li, J. So, M. Zhang, H. Wang, X. Wang, P. Vepakomma,
rative Learning, pages 129–139. Springer, 2020. A. Singh, H. Qiu, L. Shen, et al. Fedml: A research library and bench-
[5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, mark for federated machine learning. arXiv preprint arXiv:2007.13518,
V. Ivanov, C. M. Kiddon, J. Konečný, S. Mazzocchi, B. McMahan, T. V. 2020.
Overveldt, D. Petrou, D. Ramage, and J. Roselander. Towards federated [27] T.-M. H. Hsu, H. Qi, and M. Brown. Measuring the effects of non-
learning at scale: System design. In SysML, 2019. identical data distribution for federated visual classification. arXiv
[6] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, preprint arXiv:1909.06335, 2019.
V. Smith, and A. Talwalkar. Leaf: A benchmark for federated settings. [28] S. Hu, Y. Li, X. Liu, Q. Li, Z. Wu, and B. He. The oarf benchmark
arXiv preprint arXiv:1812.01097, 2018. suite: Characterization and implications for federated learning systems.
[7] S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for arXiv preprint arXiv:2006.07856, 2020.
histogram construction: How much is enough? ACM SIGMOD Record, [29] J. Huang. Maximum likelihood estimation of dirichlet distribution
27(2):436–447, 1998. parameters. CMU Technique Report, 2005.
[8] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik. Emnist: Extending [30] N. Hynes, D. Dao, D. Yan, R. Cheng, and D. Song. A demonstration
mnist to handwritten letters. In 2017 International Joint Conference on of sterling: A privacy-preserving data marketplace. Proceedings of the
Neural Networks (IJCNN), pages 2921–2926. IEEE, 2017. VLDB Endowment, 11(12):2086–2089, 2018.
[9] Z. Dai, B. K. H. Low, and P. Jaillet. Federated bayesian optimization [31] R. Johnson and T. Zhang. Accelerating stochastic gradient descent
via thompson sampling. Advances in Neural Information Processing using predictive variance reduction. Advances in neural information
Systems, 33, 2020. processing systems, 26:315–323, 2013.
[10] Y. Deng, M. M. Kamani, and M. Mahdavi. Distributionally robust fed- [32] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N.
erated averaging. Advances in Neural Information Processing Systems, Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al.
33, 2020. Advances and open problems in federated learning. arXiv preprint
[11] Diemert Eustache, Meynet Julien, P. Galland, and D. Lefortier. Attri- arXiv:1912.04977, 2019.
bution modeling increases efficiency of bidding in display advertising. [33] G. A. Kaissis, M. R. Makowski, D. Rückert, and R. F. Braren. Secure,
In Proceedings of the AdKDD and TargetAd Workshop, KDD, Halifax, privacy-preserving and federated machine learning in medical imaging.
NS, Canada, August, 14, 2017, page To appear. ACM, 2017. Nature Machine Intelligence, pages 1–7, 2020.
[12] J. Ding, U. F. Minhas, J. Yu, C. Wang, J. Do, Y. Li, H. Zhang, [34] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and
B. Chandramouli, J. Gehrke, D. Kossmann, D. Lomet, and T. Kraska. A. T. Suresh. Scaffold: Stochastic controlled averaging for on-device
Alex: An updatable adaptive learned index. In Proceedings of the federated learning. In Proceedings of the 37th International Conference
2020 ACM SIGMOD International Conference on Management of Data, on Machine Learning. PMLR, 2020.
SIGMOD ’20, page 969–984, New York, NY, USA, 2020. Association [35] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features
for Computing Machinery. from tiny images. 2009.
[13] C. T. Dinh, N. H. Tran, and T. D. Nguyen. Personalized federated [36] Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel
learning with moreau envelopes. Advances in Neural Information processing of feature-extracting scientific user-defined functions. In
Processing Systems, 2020. Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC
[14] C. Dwork. Differential privacy. Encyclopedia of Cryptography and ’10, page 75–86, New York, NY, USA, 2010. Association for Computing
Security, pages 338–340, 2011. Machinery.
[15] A. Fallah, A. Mokhtari, and A. Ozdaglar. Personalized federated learning [37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
with theoretical guarantees: A model-agnostic meta-learning approach. applied to document recognition. Proceedings of the IEEE, 86(11):2278–
Advances in Neural Information Processing Systems, 33, 2020. 2324, 1998.
[16] X. Y. Felix, A. S. Rawat, A. K. Menon, and S. Kumar. Federated learning [38] Q. Li, Y. Diao, Q. Chen, and B. He. Federated learning on non-iid data
with only positive labels. arXiv preprint arXiv:2004.10342, 2020. silos: An experimental study. arXiv preprint arXiv:2102.02079, 2021.
[17] M. Fredrikson, S. Jha, and T. Ristenpart. Model inversion attacks [39] Q. Li, B. He, and D. Song. Model-contrastive federated learning. In
that exploit confidence information and basic countermeasures. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Proceedings of the 22nd ACM SIGSAC Conference on Computer and Pattern Recognition, 2021.
Communications Security, pages 1322–1333. ACM, 2015. [40] Q. Li, B. He, and D. Song. Practical one-shot federated learning for
[18] S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal cross-silo setting. IJCAI, 2021.
sampling for skew-resistant join size estimation. In Proceedings of the [41] Q. Li, Z. Wen, and B. He. Practical federated gradient boosting decision
1996 ACM SIGMOD International Conference on Management of Data, trees. In AAAI, pages 4642–4649, 2020.
SIGMOD ’96, page 271–281, New York, NY, USA, 1996. Association [42] Q. Li, Z. Wen, Z. Wu, S. Hu, N. Wang, and B. He. A survey on
for Computing Machinery. federated learning systems: Vision, hype and reality for data privacy
[19] R. C. Geyer, T. Klein, and M. Nabi. Differentially private federated and protection. arXiv preprint arXiv:1907.09693, 2019.
learning: A client level perspective. arXiv preprint arXiv:1712.07557, [43] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith. Federated learning: Chal-
2017. lenges, methods, and future directions. arXiv preprint arXiv:1908.07873,
[20] A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and 2019.
M. J. Strauss. Fast, small-space algorithms for approximate histogram [44] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith.
maintenance. In Proceedings of the thiry-fourth annual ACM symposium Federated optimization in heterogeneous networks. In MLSys, 2020.
on Theory of computing, pages 389–398, 2002. [45] T. Li, J. Zhong, J. Liu, W. Wu, and C. Zhang. [Link]: Towards multi-
[21] N. Guha, A. Talwlkar, and V. Smith. One-shot federated learning. arXiv tenant resource sharing for machine learning workloads. 11(5):607–620,
preprint arXiv:1902.11175, 2019. Jan. 2018.
977
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
[46] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang. On the convergence [70] H. Wang, M. Yurochkin, Y. Sun, D. Papailiopoulos, and Y. Khazaeni.
of fedavg on non-iid data. In International Conference on Learning Federated learning with matched averaging. In International Conference
Representations, 2020. on Learning Representations, 2020.
[47] X. Li, M. JIANG, X. Zhang, M. Kamp, and Q. Dou. Fed{bn}: [71] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor. Tackling the ob-
Federated learning on non-{iid} features via local batch normalization. jective inconsistency problem in heterogeneous federated optimization.
In International Conference on Learning Representations, 2021. Advances in Neural Information Processing Systems, 33, 2020.
[48] Y. Liang, Y. Guo, Y. Gong, C. Luo, J. Zhan, and Y. Huang. An isolated [72] L. Wang, S. Xu, X. Wang, and Q. Zhu. Addressing class imbalance in
data island benchmark suite for federated learning. arXiv preprint federated learning. In AAAI, 2021.
arXiv:2008.07257, 2020. [73] W. Wang, J. Gao, M. Zhang, S. Wang, G. Chen, T. K. Ng, B. C. Ooi,
[49] T. Lin, L. Kong, S. U. Stich, and M. Jaggi. Ensemble distillation J. Shao, and M. Reyad. Rafiki: Machine learning as an analytics service
for robust model fusion in federated learning. Advances in Neural system. Proc. VLDB Endow., 12(2):128–140, Oct. 2018.
Information Processing Systems, 33, 2020. [74] Y. Wu, S. Cai, X. Xiao, G. Chen, and B. C. Ooi. Privacy preserving
[50] L. Liu, F. Zhang, J. Xiao, and C. Wu. Evaluation framework for large- vertical federated learning for tree-based models. Proceedings of the
scale federated learning. arXiv preprint arXiv:2003.01575, 2020. VLDB Endowment, 2020.
[75] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image
[51] Y. Liu, Y. Kang, C. Xing, T. Chen, and Q. Yang. A secure federated
dataset for benchmarking machine learning algorithms. arXiv preprint
transfer learning framework. IEEE Intelligent Systems, 2020.
arXiv:1708.07747, 2017.
[52] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of [76] Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning:
machine learning research, 9(Nov):2579–2605, 2008. Concept and applications. ACM Transactions on Intelligent Systems and
[53] R. Marcus, A. Kipf, A. van Renen, M. Stoian, S. Misra, A. Kemper, Technology (TIST), 10(2):1–19, 2019.
T. Neumann, and T. Kraska. Benchmarking learned indexes. Proc. [77] M. Yurochkin, M. Agarwal, S. Ghosh, K. Greenewald, N. Hoang, and
VLDB Endow., 14(1):1–13, Sept. 2020. Y. Khazaeni. Bayesian nonparametric federated learning of neural
[54] R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, networks. In Proceedings of the 36th International Conference on
O. Papaemmanouil, and N. Tatbul. Neo: A learned query optimizer. Machine Learning. PMLR, 2019.
Proc. VLDB Endow., 12(11):1705–1718, July 2019. [78] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian
[55] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. denoiser: Residual learning of deep cnn for image denoising. IEEE
Communication-efficient learning of deep networks from decentralized transactions on image processing, 26(7):3142–3155, 2017.
data. arXiv preprint arXiv:1602.05629, 2016.
[56] M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning.
In International Conference on Machine Learning, pages 4615–4625.
PMLR, 2019.
[57] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng.
Reading digits in natural images with unsupervised feature learning.
2011.
[58] J. Neyman. On the two different aspects of the representative method:
the method of stratified sampling and the method of purposive selection.
In Breakthroughs in statistics, pages 123–150. Springer, 1992.
[59] C. Niu, Z. Zheng, F. Wu, X. Gao, and G. Chen. Trading data in good
faith: Integrating truthfulness and privacy preservation in data markets.
In 2017 IEEE 33rd International Conference on Data Engineering
(ICDE), pages 223–226. IEEE, 2017.
[60] X. Peng, Z. Huang, Y. Zhu, and K. Saenko. Federated adversarial domain
adaptation. In International Conference on Learning Representations,
2020.
[61] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ,
S. Kumar, and H. B. McMahan. Adaptive federated optimization. arXiv
preprint arXiv:2003.00295, 2020.
[62] A. Reisizadeh, F. Farnia, R. Pedarsani, and A. Jadbabaie. Robust
federated learning: The case of affine distribution shifts. Advances in
Neural Information Processing Systems, 2020.
[63] S. J. Rizvi and J. R. Haritsa. Maintaining data privacy in association rule
mining. In VLDB’02: Proceedings of the 28th International Conference
on Very Large Databases, pages 682–693. Elsevier, 2002.
[64] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the
stochastic average gradient. Mathematical Programming, 162(1-2):83–
112, 2017.
[65] S. Shastri, V. Banakar, M. Wasserman, A. Kumar, and V. Chidambaram.
Understanding and benchmarking the impact of gdpr on database
systems. Proc. VLDB Endow., 13(7):1064–1077, Mar. 2020.
[66] A. P. Sheth and J. A. Larson. Federated database systems for managing
distributed, heterogeneous, and autonomous databases. ACM Computing
Surveys (CSUR), 22(3):183–236, 1990.
[67] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership
inference attacks against machine learning models. In 2017 IEEE
Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017.
[68] M. J. Smith, C. Sala, J. M. Kanter, and K. Veeramachaneni. The machine
learning bazaar: Harnessing the ml ecosystem for effective system
development. In Proceedings of the 2020 ACM SIGMOD International
Conference on Management of Data, SIGMOD ’20, page 785–800, New
York, NY, USA, 2020. Association for Computing Machinery.
[69] P. Voigt and A. Von dem Bussche. The eu general data protection regu-
lation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International
Publishing, 2017.
978
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on September 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.