MegaScale-MoE: Efficient MoE Training
MegaScale-MoE: Efficient MoE Training
Chao Jin1,2,◦,∗ , Ziheng Jiang1,◦ , Zhihao Bai1 , Zheng Zhong1 , Juncai Liu1 , Xiang Li1 ,
Ningxin Zheng1 , Xi Wang1 , Cong Xie1 , Qi Huang1 , Wen Heng1 , Yiyuan Ma1 , Wenlei
Bao1 , Size Zheng1 , Yanghua Peng1 , Haibin Lin1 , Xuanzhe Liu2 , Xin Jin2,† , Xin Liu1,†
arXiv:2505.11432v2 [[Link]] 19 May 2025
1
ByteDance Seed, 2 Peking University
◦
Equal Contribution, ∗ Work done at ByteDance Seed, † Corresponding authors
Abstract
We present MegaScale-MoE, a production system tailored for the efficient training of large-scale
mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language
models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing
MoE training systems experience a degradation in training efficiency, exacerbated by the escalating
scale of MoE models and the continuous evolution of hardware.
Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE
customizes communication-efficient parallelism strategies for attention and FFNs in each MoE
layer and adopts a holistic approach to overlap communication with computation at both inter-
and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with
adjusted communication patterns to lower precision, further improving training efficiency. When
training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training
throughput of 1.41M tokens/s, improving the efficiency by 1.88× compared to Megatron-LM.
We share our operational experience in accelerating MoE training and hope that by offering our
insights in system design, this work will motivate future research in MoE systems.
1
NVLink Bandwidth (TB/s) 2.0
MegaScale-MoE addresses the communication prob-
1.5
B200, BF16, 2024 B200, FP8, 2024 lem in MoE training from three key aspects. First,
H100, BF16, 2022
MegaScale-MoE reduces the communication volume
1.0 H100, FP8, 2022 by customizing parallelism strategies for the attention
0.5
A100, BF16, 2020 and FFN modules in each MoE layer. We compare
V100, FP16, 2017
the parallelism strategies in existing LLM training
0.0
0 2 4 6 8 10 frameworks, comprehensively considering their im-
Tensor Core (PFLOPS)
pact on large-scale training, including the communi-
Figure 1 Evolution of NVIDIA GPUs.
1
2
input LayerNorm gating
LayerNorm Router weights SP LayerNorm SP LayerNorm SP LayerNorm
All-Gather QKV Proj QKV Proj
LinearQKV expert TP QKV Proj All-to-All
LinearFC1 LinearFC3 Q, K, V Q All-Gather
Self Q, K, V K, V
Attention SwiGLU Attention TP Attention SP Attention
LinearFC2 Output Proj All-to-All
LinearOut Reduce-
Scatter Output Proj Output Proj
output SP SP SP
Figure 2 Mixture-of-Experts (MoE) layer. LayerNorm LayerNorm LayerNorm
(a) Megatron-LM (b) DeepSpeed Ulysses (c) Megatron-LM
Tensor parallelism. Sequence parallelism. Context parallelism.
lored quantization strategies and FP32 reduction to
decrease communication volume while preserving con- Figure 3 Different parallelism strategies for self-attention.
vergence stability. "TP" denotes partitioning along the dimension of hidden
size, while "SP" denotes partitioning along the dimension
MegaScale-MoE is deployed in our datacenters to of sequence length.
train MoE models for our products. Compared to
the state-of-the-art open-source LLM training frame-
work, Megatron-LM [46], MegaScale-MoE achieves approach has limitations that prevent relying on a
up to 1.88× higher MFU (Model FLOPs Utiliza- single method for effective scaling.
tion) when training a 352B MoE model on 1,440
NVIDIA Hopper GPUs. With comprehensive com- Data parallelism uniformly distributes the training
munication optimizations, MegaScale-MoE powers data across all devices, with each device replicat-
large-scale training in our production, efficiently scal- ing the model parameters and optimizer states. To
ing to trillions of parameters and thousands of GPUs synchronize the parameters after each training iter-
while saving millions of GPU hours. ation, data parallelism performs an all-reduce com-
munication operation. Zero Redundancy Optimizer
(ZeRO) [38] improves over data parallelism by dis-
2 Background tributing model states across all participating devices.
ZeRO unfolds across three progressive stages, each de-
2.1 Mixture-of-Experts for Transformer
signed to increasingly conserve memory, though this
The Mixture of Experts (MoE) mechanism is an ad- comes with the trade-off of elevated communication.
vanced approach designed to boost the performance
of Transformer [49] models, which are increasingly Tensor parallelism distributes compute-intensive ten-
pivotal in the realm of LLMs [2, 7, 16, 25]. It extends sor operations over multiple devices, enabling parallel
the Transformer architecture by integrating multi- computation and significantly accelerating the train-
ple expert networks within the feed-forward network ing process. The specific partitioning strategy and
(FFN) component. As illustrated in Figure 2, MoE the dependencies among operators within the model
models dynamically route input tokens to the most dictate that tensor parallelism may necessitate gath-
relevant experts based on their characteristics. This ering split inputs (all-gather) or merging outputs
routing is managed by a trainable gating mechanism (reduce-scatter). In LLM training, operators like Lay-
that selects the best-suited experts for each token. erNorm and Dropout, though less compute-intensive,
This architectural innovation enables MoE models to require substantial activation memory. To tackle this
scale in capacity without a proportional increase in problem, a variant of tensor parallelism known as
inference costs, as only a subset of experts is activated sequence parallelism [18] is proposed, which par-
for each input. titions these operators along the dimension of se-
quence length. For long-context training, several
2.2 Large-scale LLM Training works [1, 14, 46] apply sequence parallelism or ten-
sor parallelism to different operators in self-attention.
Training large language models at scale on tens of Figure 3 illustrates the mainstream parallelism strate-
thousands of GPUs is a complex system engineering gies for attention, namely tensor, sequence, and con-
challenge that requires multiple systems techniques. text parallelism (TP, SP, and CP), which we analyze
To distribute the training workload, a combination in §3.1.
of parallelism strategies such as data, tensor, and
pipeline parallelism is necessary [17, 41, 46], as each Pipeline parallelism enhances efficiency by dividing
3
Inter-node Parallelism Symbol Description
b micro-batch size
Inter-node: PP Inter-node: TP or EP
communication across nodes per layer, s sequence length
high overhead h hidden dimension size
Intra-node Parallelism n model parallelism (TP, SP, or EP) size
m the ratio between the number of query heads and
FFN: TP FFN: EP that of key-value heads
lower GEMM efficiency and k number of experts that each token is routed to
more communication
Table 1 Description of symbols.
Attention: DP Attention: TP Attention: Attention: CP
n× activations due comm. volume on Ulysses SP Imbalanced computation
to global batch
size limit,
critical path:
𝟐𝒃𝒔𝒉(𝒏 − 𝟏)/𝒏
comm. volume:
𝟐𝒃𝒔𝒉(𝒏 − 𝟏)/𝒏×
due to masked attention, reduce communication, and overlap communication
comm. volume:
poor scalability (𝟐 + 𝟐/𝒎)/𝒏
𝟐𝒃𝒔𝒉(𝒏 − 𝟏)/𝒏×𝟐/𝒎 of different micro-batches.
Figure 4 Design space for large-scale MoE training.
Prior large-scale MoE training systems, such as
Megatron-LM [46] and DeepSpeed-MoE [40], incorpo-
model layers into stages that are processed on dif- rate tensor parallelism to scale up training by parti-
ferent devices, enabling pipelined execution. Each tioning the model parameters within the node. How-
batch is split into several micro-batches for this pur- ever, in our practice, we observe two issues with this
pose. To minimize pipeline bubbles, various schedul- approach: (1) TP partitions the expert dimension,
ing strategies have been developed, e.g., GPipe [12], which negatively impacts GEMM efficiency; and (2)
PipeDream 1F1B [31] and Interleaved 1F1B [32], TP introduces significant communication overhead,
etc. Megatron-LM adopts Interleaved 1F1B pipeline which remains constant as the parallelism size in-
scheduling, further dividing each stage on one device creases, eventually causing communication to exceed
into multiple virtual stages to reduce the pipeline computation on modern hardware.
bubble rate. To address these issues, we tailor parallelism strate-
gies for MoE model components. For feed-forward
Expert parallelism is tailored for training MoE mod- networks (i.e., experts), we replace tensor parallelism
els by distributing experts across multiple devices, with expert parallelism and use custom communica-
alleviating memory pressure and enabling parallel tion modes optimized for varying top-k and expert
processing. To efficiently assign tokens to the appro- sizes, ensuring communication overhead stays lower
priate experts and retrieve their outputs, all-to-all than tensor parallelism. For other components, we
communication is typically employed. apply sequence parallelism, partitioning along the
sequence dimension instead of the batch dimension,
3 Communication-Efficient Paral- allowing scaling without increasing global batch size.
lelism This also reduces communication on critical paths
compared to tensor parallelism. The additional mem-
With the rise of MoE models and the evolution of ory and DP communication overhead remain man-
hardware compute capabilities, communication over- ageable due to the parameter asymmetry across com-
head has become increasingly critical in MoE training ponents. We detail the rationale and analysis of
in production. In this section, we delve into the paral- this intra-node parallelism strategy in the following
lelism strategies employed to reduce communication sections. Table 1 lists the key symbols.
volume and meet other training requirements, such
as high GEMM (General Matrix Multiplication) effi- 3.1 Sequence Parallelism for Attention
ciency.
Due to the inherent parallelizability of the expert
Figure 4 shows the design space of parallelism strate- components in MoE models, most prior work on
gies for large-scale MoE training, excluding the out- MoE training [20, 40] focuses on optimizing expert
ermost data parallelism. We start with inter-node parallelism, while data parallelism (DP) is typically
parallelism. Expert parallelism alleviates memory applied to the non-MoE components such as atten-
pressure from MoE models’ large parameter size tion. However, when scaling up MoE training, this
by distributing experts across nodes but incurs per- approach proves insufficient due to the n× activation
layer cross-node communication, harming training memory consumption. This issue arises because DP
efficiency. Similarly, tensor parallelism’s high commu- splits the batch dimension both across and within
nication overhead makes it more efficient to limit TP nodes. Compared to other intra-node parallelism
to a single node. Following prior work [17], we adopt strategies shown in Figure 4, applying DP to atten-
pipeline parallelism to distribute model parameters, tion forces each GPU within a node to process one
4
micro-batch simultaneously, increasing the activation Intra-node RS Inter-node RS Inter-node AG Intra-node AG
size by 8×, which often results in out-of-memory Node0
issues. Part 0
Part 1
To enable scalable MoE training, implementing intra- Part 2
node parallelism for the attention module is crucial. GPU0 GPU1 GPU2
Tensor parallelism (TP) is commonly employed to
parallelize attention operations within nodes. How- Part 0
Part 1
ever, it introduces inevitable communication costs Part 2
due to all-gathering and reduce-scattering activations GPU0 GPU1 GPU2 Node1
along the critical path. With the increasing gap (a) Four-step hierarchical parameter synchronization.
between computational FLOPs and communication
RS RS AG AG
bandwidth, we find that the TP communication over- Intra-node 1 2 3 4 1 234 1 2 1 3243 4
head can even surpass the computation time of self- Inter-node 1 234 1 234 pipelining 1 1 2 2 3 3 4 4
attention. This communication-dominated bottle- (b) Overlapping in hierarchical communication.
neck limits the ability to overlap communication and
Figure 5 Hierarchical communication for parameter
computation, ultimately reducing training efficiency. synchronization in SP attention.
We adopt sequence parallelism (SP), as proposed in
DeepSpeed-Ulysses [14], to scale MoE training and
effectively reduce communication along the critical inter-node bandwidth asymmetry and the adoption
path. SP is commonly used in long-context train- of hierarchical communication operations in modern
ing to address memory challenges associated with communication libraries [33] as shown in Figure 5
long inputs. We find it also works well in large-scale and analyzed in Appendix A.1, although SP atten-
MoE training. First, it significantly reduces commu- tion requires synchronization of n× more parameters
nication overhead compared to TP, especially when compared to TP attention, the difference in commu-
using grouped-query attention [4]. Second, while it nication overhead is minimal in practical scenarios.
introduces some parameter redundancy and increased
communication overhead during parameter synchro- On the other hand, the additional GPU memory
nization, the unique characteristics of MoE models consumption introduced by SP attention is minimal
make these trade-offs manageable and acceptable. in MoE training. For large-scale MoE models with
tens to hundreds of experts, the majority of GPU
Communication efficiency. When utilizing TP, the memory is consumed by the expert parameters. Our
communication volume in attention is experiments, detailed in §6.2, confirm that the extra
parameter synchronization and memory overhead of
2bsh(n − 1)/n. (1) SP attention remain manageable.
With SP, the communication volume decreases to
2bsh(n − 1)/n × (2 + 2/m)/n, (2) Balanced vs. imbalanced. In addition to the Ulysses-
style SP attention, we also explored other forms,
where m represents the ratio between the number of including context parallelism (CP) [1], which parti-
query heads and that of key-value heads. Assuming tions all activations along the sequence dimension.
the model is trained on a NVIDIA Hopper GPU CP attention, however, faces workload imbalance
workstation with an NVLink domain of size 8, the due to causal masking in attention, as each token
communication latency for sequence parallel attention only attends to previous tokens. To mitigate this,
can be reduced to about one-fourth of that required we attempted the zigzag strategy by grouping the
by tensor-parallel attention. head and tail partitions of the sequence on the same
GPU, although achieving perfect balance remains
Data communication & memory overhead. A notable challenging. Consequently, in large-scale training,
difference between SP and TP attention is how pa- the entire training process is often constrained by the
rameters are distributed across devices: TP shards most imbalanced data batch. Moreover, this imbal-
the attention weights, while SP replicates them. This ance disturbs the training pipeline, thereby reducing
raises the concern about the potential increase in com- overall training efficiency.
munication overhead for synchronizing gradients and
parameters. Counterintuitively, given the intra- and
5
All-Gather Reduce-Scatter All-to-All
1000
(a) Typical EP implementation.
800
Time (us)
600
400
(b)(b)
EPEP Implementation
implementation inin MoESurge when
MegaScale-MoE top-k
when > n. > n.
top-k 200
Figure 6 Communication-efficient expert parallelism. e 0
1 2 3 4 5 6 7 8
represents the number of tokens routed to the worker. Top-k
6
backward propagation, which limits the flexibility of / Heavy/light operators w Gating weights for each expert
/ Retained/discarded activations W Model weights
communication overlap. In contrast, MegaScale-MoE ln2_in
decomposes the attention and FFN modules of each hiddeni RMSNorm
MoE layer into operators that run as GPU kernels, SP RMSNorm SP ln2_out Router
ln1_out wgating
enabling fine-grained communication overlap through WQKV QKV Proj All-Gather
flexible scheduling. q k v Scatter & Drop
RoPE RoPE WFC1 ffn_in WFC3
q_rope k_rope GroupedGEMM GroupedGEMM
4.1 Inter-operator Overlap fc1_out fc3_out
All-to-All All-to-All All-to-All SiLU
We overlap communication operators with indepen- q_a2a k_a2a v_a2a w'gating
Self Attention WFC2 fc2_in
dent computation operators by executing them asyn- GroupedGEMM
TP attn
chronously on different CUDA streams. To achieve All-to-All EP Fillfc2_out & Gather
optimal performance during the training process, we SP attn_a2a Reduce-Scatter
adopt a specifically hand-tailored, holistic scheduling WO Out Proj SP ffn_out
strategy. attn_out
hiddeni+1
(a) Forward pass of a MoE layer.
Holistic scheduling. From the caller’s perspective,
we implement a unified macro module to execute Op(R) Recomputed operator Comm(R) Reperformed communication
∆ffn_out ln2_out
the entire MoE layer’s forward and backward passes, Stream0 All-Gather All-Gather(R)
thereby expanding our scheduling flexibility. For in- Stream1 RMSNorm(R) SiLU(R) Scatter GroupedGEMM'FC2
stance, during the backward pass, various communi- ln2_in fc1_out ∆fc2_out ∆fc2_in
cation operators can be overlapped with dependency- fc3_out fc2_in
free computations, such as activation recomputation, (b) Backward pass snippet with activation rematerialization.
Figure 8 Selective activation rematerialization.
to improve efficiency. From the runtime perspective,
a key challenge is efficiently managing concurrent
to overlap with other computations and communi-
communication tasks by resolving resource conflicts
cations, avoiding delays in the critical path. For
to prevent blocking and maximize throughput. This
example, as shown in Figure 8b, the backward pass
requires careful coordination, such as determining the
of the GroupedGEMM operator for FC2 requires
number of SMs allocated to each communication op-
the activation fc2_in and the gradient of fc2_out
erator, to minimize interference and optimize overall
(denoted as ∆fc2_out) as inputs. MegaScale-MoE re-
throughput.
computes fc2_in and overlaps this operator with gra-
dient communication (i.e., all-gather for ∆ffn_out).
Selective activation rematerialization. The holistic
Similarly, ffn_in is obtained through re-performing
scheduling strategy also helps reduce memory usage
RMSNorm and all-gather, with these operators hidden
without compromising training speed. Compared to
within the preceding communication and the FC2
dense models with equivalent computational require-
GroupedGEMM, respectively. MegaScale-MoE also
ments, MoE models exert significantly higher memory
places the weighted sum of ffn_out immediately af-
pressure during training due to their parameter count
ter the SwiGLU [43] activation function to eliminate
being several times larger. In addition to employing
the need to store ffn_out. This reordering ensures
ZeRO optimizations [38] to eliminate redundant op-
computational consistency by avoiding operators that
timizer states across DP groups, we further optimize
cross non-linear boundaries.
memory usage through selective activation remateri-
alization. This approach reduces activation memory Our analysis in Appendix A.2 and experiments in
requirements by re-performing computation and com- §6.2 show that MegaScale-MoE reduces the activa-
munication operators that can be overlapped with tion memory by ∼ 50% while maintaining the same
other necessary operators. training speed.
Figure 8a illustrates the forward pass of a Mixtral [16]
4.2 Intra-operator Overlap
MoE layer and highlights key activations produced
during this process. MegaScale-MoE strategically re- Although inter-operator overlap effectively hides com-
tains activations that are computationally expensive munication latency, squeezing all bubbles in the exe-
to recompute, while recalculating others generated by cution timeline remains non-trivial—especially in the
memory-intensive operations or communication op- forward pass, where no rematerialization or gradient
erations. This minimizes dependencies on backward computation operators exist to overlap with commu-
computation, enabling rematerialization operations nication. Some forward operators directly depend
7
GEMM Weight (Replicated)
luM
GEMM B
taM
h
...
c G c G c G c G G E
L L L L H H Expert 0
Rank 1
Time Tokens Input Tile GEMM Tile
(a) Data flow in A2A+GEMM (Output Proj). (b) Fine-grained overlap of A2A+GEMM. (c) Fine-grained overlap of AG+Scatter+GroupedGEMM.
on communication, such as token dispatch for ex- communication for remote data starts simultaneously.
pert computation, making overlap impossible unless We leverage dedicated GPU copy engines for data
another micro-batch is introduced, which increases transfer, ensuring that all SMs (streaming multipro-
memory pressure. cessors) are fully utilized for computation. Once a
remote data tile arrives at local memory, a signal no-
A widely adopted solution [17, 48, 50] is to decom- tifies the GEMM kernel to continue its computation
pose operators into smaller parallel ones to enable on the arrived tile. For GEMM+A2A, the all-to-
pipelining by executing them on separate CUDA all operation is fused into the GEMM kernel. Each
streams. However, this approach introduces non- tile of GEMM computation ends with a remote data
negligible overhead: (i) complex stream control, in- transfer that writes the output data tile to remote
volving host interference and causing random bubbles ranks. We also implement all-gather+GEMM and
due to the non-deterministic feature of CPU control; GEMM+reduce-scatter kernels for tensor parallelism,
(ii) imperfect tail computation, increasing overall which are similar to A2A+GEMM and GEMM+A2A.
computation latency.
For A2A+GEMM and GEMM+A2A, we allocate a
To address the above issues, we use intra-operator small number of SMs for communication as all-to-all
overlap to exploit the parallelism for communica- is more complex than all-gather and reduce-scatter.
tion and computation operators with direct depen- The number of SMs for communication is tuned to
dency. The core idea is to fuse these operators and make communication and computation exhibit sim-
break down the workloads into tiles. Following prior ilar latency. Moreover, multiple ranks may simul-
work [5, 15, 51, 53], we implement barriers in device taneously read from or write to the same device,
memory between communication and computation potentially causing contention in NVLink. To mit-
operators. These barriers allow fine-grained noti- igate this, we apply swizzling [5, 51, 53] to reorder
fications between computation and communication tile communication and computation so that the ar-
at the tile level. Additionally, as the barriers re- rival of communication tiles aligns with the pace of
side in device memory, they eliminate the need for computation tiles.
host interference, enabling further performance im-
provements. We implement two types of kernels, Overlapping with GroupedGEMMs For expert par-
overlapping with GEMMs and overlapping with MoE allelism with token dispatch and combine, we aim
GroupedGEMMs, for the attention and FFN mod- to overlap communication with GroupedGEMMs.
ules, respectively. We implement two types of overlapping ker-
nels: all-gather+scatter+GroupedGEMM and
Overlapping with GEMMs. We first introduce the GroupedGEMM+gather+reduce-scatter. Unlike the
intra-operator communication-computation overlap overlapping techniques for GEMM kernels, MoE
for GEMM kernels. Specifically, we implement all- GroupedGEMMs require token shuffling (scatter/-
to-all(A2A)+GEMM and GEMM+A2A kernels for gather). As a result, each computation tile may
Output and QKV Projections in SP attention, re- depend on tokens from multiple ranks. To effectively
spectively, where X+Y means Y executed after X. overlap computation with communication, we sort
Figure 9 shows the data flow and overlapping pat- the token order to minimize the number of dependent
tern in A2A+GEMM. The GEMM on local data and ranks for each computation tile. Additionally, since
8
Gradients Name #layers h #heads m hf f n #experts top-k
9
Iteration Throughput Training Time for Exposed Communication FlashAttention and GEMMs Others
System #GPUs
Time (s) (tokens/s) 1T Tokens (days) 20 100
MFU (%)
Megatron-LM 720 13.70 430.5k 26.88 10 50
960 10.82 550.2k 21.23
1440 7.90 746.6k 15.50 5 25
240 21.61 272.9k (1.81×) 42.41
0 0
480 11.83 498.6k (1.65×) 23.21 H800 A100 H20 H800 A100 H20
MegaScale-MoE 720 7.97 740.1k (1.72×) 15.64 (a) Iteration time breakdown. (b) MFU.
960 6.12 963.8k (1.77×) 12.01
1440 4.19 1407.7k (1.88×) 8.22 Figure 12 Performance breakdown of training Mixtral-
Table 3 Strong-scaling training performance for the 8×7B on different GPUs.
352B MoE model on NVIDIA H800 GPUs. The number
in parentheses in the throughput column represents the Compute Cap- Memory Spec. NVLink
GPU
ability (TFLOPS) Cap. (GB) Bw. (TB/s) Bw. (GB/s)
speedup of MegaScale-MoE compared to Megatron-LM. H800 989 80 3.4 400
A100 312 80 2.0 600
Megatron-LM MegaScale-MoE H20 148 96 4.0 900
Normalized Throughput
per-channel quantization for backward communica- Scalability. Table 3 compares the strong-scaling train-
tion. In backward propagation, we further group ing performance of Megatron-LM and MegaScale-
quantization along the token dimension using a small MoE on the 352B MoE model. We scale the number
group size (e.g., 128). of GPUs while keeping the global batch size fixed
at 720. Across all settings, MegaScale-MoE achieves
6 Evaluation 1.65–1.88× speedups over Megatron-LM. As the num-
ber of GPUs increases, the MFU (Model FLOPs Uti-
In this section, we present a comprehensive evaluation lization) of MegaScale-MoE declines from 32.48% to
of MegaScale-MoE, covering overall training perfor- 27.89%. This is expected, as the batch size is fixed
mance (§6.1), ablation studies of MegaScale-MoE’s and the number of micro-batches for each pipeline
key optimizations (§6.2), and the effectiveness of the decreases with more GPUs, leading to more bubbles.
precision-communication co-design (§6.3). Table 2
Figure 11 presents the weak-scaling training perfor-
lists the configurations of the MoE models used in our
mance of Megatron-LM and MegaScale-MoE on the
evaluation, detailing hidden size (h), FFN intermedi-
same model. We scale the global batch size from 360
ate size (hf f n ), number of experts, and top-k values.
to 1,080 in proportion to the number of GPUs (from
The evaluation is conducted on NVIDIA H800 GPUs
480 to 1,440). MegaScale-MoE achieves a 1.74-1.79×
unless otherwise specified, with the specifications
training throughput compared to Megatron-LM. As
provided in Table 4.
the scale increases, Megatron-LM’s throughput drops
by 2.74% due to increased communication overhead,
6.1 Training Performance whereas MegaScale-MoE exhibits near-linear scala-
MegaScale-MoE is built on top of Megatron-LM [46], bility, benefiting from comprehensive communication
a state-of-the-art open-source LLM training system overlap.
that supports 3D parallelism strategies and is con-
tinuously updated to incorporate the latest optimiza- Performance breakdown on different GPUs. We
tions from the community. Our evaluation uses the conduct a deep dive into MegaScale-MoE to further
Megatron-LM on GitHub [30] with commit hash understand the performance of training a MoE model
f1f03922, selected for its stability at the commence- in production environments. We train Mixtral-8×7B
ment of our experiments months ago. For fair compar- on 32 NVIDIA H800, H20, and A100 GPUs, respec-
ison, we use the same global batch size for Megatron- tively. The specifications of GPUs we used are listed
LM and MegaScale-MoE and choose the optimal in Table 4. We set the DP size as four, the TP size
parallelism configurations for the two systems, re- as eight for Megatron-LM, and the SP and EP size
10
TP+TP SP+TP TP+EP SP+EP (MegaScale-MoE) TP Attention SP Attention (MegaScale-MoE)
1.0
170 170
160 160
0.9
150 150
0.8
140 140
0.7
130 130
0.6 120 120
Internal- Mixtral- Mixtral- Hunyuan- Phi- DeepSeek-
352B 8×7B 8×22B Large 3.5-MoE MoE 500 750 1000 1250 1500 500 750 1000 1250 1500
Attention Parameter Size (MB) Attention Parameter Size (MB)
Figure 13 Parallelism efficiency for different models. (a) DP=4. (b) DP=8.
11
No intra-op communication-computation overlapping MegaScale-MoE Y-axis = Communication Time + Computation Time
0.4 0.4 3 3
0.3 0.3
Time (ms)
Time (ms)
Time (ms)
Time (ms)
2 2
0.2 0.2
1 1
0.1 0.1
0.0 0.0 0 0
M1 M2 M3 M4 M5 M6 M1 M2 M3 M4 M5 M6 M1 M2 M3 M4 M5 M6 M1 M2 M3 M4 M5 M6
(a) QKV Proj.+A2A. (b) A2A+Output Proj. (c) AG+Scatter+GEMM. (d) GEMM+Gather+RS.
Figure 15 Overlapped communication-computation time vs. non-overlapped time of each layer. M1-M6 represent the
six models listed from top to bottom in Table 2; A2A, AG, and RS refer to all-to-all, all-gather, and reduce-scatter,
respectively.
10
Activations Parameters, gradients and optimizer states Others FP32 Reduce-Scatter
Memory Usage (GB)
Training Loss
No SAR MegaScale-MoE No SAR MegaScale-MoE
75 75
MFU (%)
6
50 50
4
25 25
2
0 0 0.1 1 10 100 1000
Mixtral- Mixtral- Mixtral- Mixtral- Consumed Tokens (B)
8×7B 8×22B 8×7B 8×22B
(a) Memory usage. (b) MFU. Figure 17 The training loss curve of MegaScale-MoE
Figure 16 Ablation study of selective activation remate- with DP communication compression.
rialization (SAR). BF16 FP8
12 2.0
10
Training Loss
Training Loss
to the baseline lacking fine-grained overlap. And 8 1.5
Selective activation rematerailization. We compare Figure 18 The loss curve of MegaScale-MoE in FP8 and
MegaScale-MoE to a baseline that disables selective BF16.
activation rematerialization (No SAR), which stores
all activations in GPU memory during training. We 6.3 Model Convergence
evaluate both methods by training Mixtral-8×7B and
Mixtral-8×22B on 128 NVIDIA H800 GPUs. Fig- We evaluate model convergence with MegaScale-MoE.
ure 16 shows the memory usage breakdown and the Figure 18 demonstrates the loss curves of training a
training MFU. Compared to No SAR, MegaScale- 35B MoE model from scratch and continuing training
MoE reduces activation memory consumption by a 176B MoE model from a checkpoint, with results
45.5% and 57.2% for the two models, respectively, shown for both BF16 and FP8 precision. MegaScale-
resulting in overall memory reductions of 21.3% and MoE ensures stable convergence and consistent train-
35%, while maintaining the training performance dif- ing loss across BF16 and FP8 formats.
ference within 0.5%.
7 Experience
Data parallelism communication compression. We In this section, we describe our deployment and op-
validate the effectiveness of our communication com- erational experience of MegaScale-MoE.
pression technique by training a 7B MoE model using
BF16 all-to-all DP communication and FP32 reduce- Deployment experience. MegaScale-MoE has been
scatter communication, as described in §5. Figure 17 deployed in our production environment and is re-
illustrates the training loss curves, which are nearly sponsible for the majority of large-scale MoE training
identical. This optimization compresses only the ac- tasks within our company. It enables the training of
cumulated gradients of the batch and performs con- models with trillions of parameters, supports single
versions between BF16 and FP32 exclusively during training jobs scaling beyond 10,000 GPUs, with indi-
communication, introducing minimal risk. vidual training tasks running for several months. By
combining the aforementioned techniques, MegaScale-
12
MoE minimizes idle communication time and opti- 1.0
13
These manual interventions provide deeper insights as efficient cross-node all-to-all and DualPipe, to
into training dynamics, enabling targeted optimiza- enhance the efficiency of large-scale MoE training.
tions. As training progresses and experience accumu- We are also exploring algorithm-system co-design to
lates, we seek to automate operator scheduling within further reduce training costs.
the search space to optimize the training process at
a fine-grained level and achieve optimal performance. Long-context training. While Megatron-LM [18, 46]
We leave automatic optimization for future work. opts to partition only specific operations along the
sequence dimension, various methods of sequence par-
8 Related Work allelism [19, 22, 27] have been explored for training
models requiring long contexts. The Blockwise Par-
Large model training. LLM research has led to the allel Transformer [26] method implements blockwise
development of scalable, efficient, and robust train- computation of self-attention and the fusion of FFNs
ing techniques to meet the substantial computational based on online softmax calculations. Ring Atten-
demands of these models. DeepSpeed [41] features tion [22, 27] introduces a ring-style communication
the Zero Redundancy Optimizer (ZeRO) [38, 39, 42], mechanism integrated with self-attention calculations,
which shards model parameters, gradients, and opti- facilitating the exchange of key and value chunks. We
mizer states across participating GPUs in data paral- adopt the all-to-all style of SP attention from Deep-
lelism, enabling the scaling of LLMs with manageable Speed Ulysses [14], which partitions attention by
memory consumption. Megatron-LM [46] focuses on heads rather than sequence length, due to its reduced
intra-layer model parallelism techniques, partition- communication volume and balanced computation
ing the parameters and computation of each layer. pattern.
Pipeline parallelism assigns the parameters and com-
putation of a contiguous subset of layers to each
Communication-computation overlap. Several frame-
GPU[12, 31], breaks a batch into micro-batches, and
processes the micro-batches in a pipelined fashion. works [10, 21, 29, 37, 52] focus on overlapping com-
MegaScale [17] shows how combining tensor, pipeline, munication with computation in distributed deep
and data parallelism can be an efficient strategy to learning training with a single parallelism strategy.
train large multi-billion parameter models at unprece- Some compiler-style work [15, 35, 50] provides fine-
dented scale. grained overlap among kernels, but excessive par-
titioning of GEMM kernels can result in low GPU
Mixture-of-Expert training. To address the compu- utilization. Centauri [6] enhances communication
tational challenges of training advanced neural net- overlap for LLM training with 3D parallelism by com-
works, the machine learning field has increasingly munication partitioning and hierarchical scheduling.
adopted Mixture-of-Experts architectures. Subse- Similar to Centauri, our inter-operator communica-
quently, a number of parallel deep learning frame- tion overlap hides communication within independent
works have been proposed for training or running in- computation by reordering operators. We further con-
ference on MoEs on multi-GPU clusters. DeepSpeed- ceal communication on critical paths through intra-
MoE [40] significantly reduces training costs through operator overlap, without compromising GPU utiliza-
model architecture designs and compression tech- tion. DualPipe, proposed by DeepSeek-V3 [25], over-
niques. HetuMoE [34] utilizes a hierarchical all-to- laps communication and computation within different
all communication strategy to achieve performance forward and backward chunks and requires storing
speedup. SE-MoE [45] distinguishes itself by focus- 2× the model parameters. In contrast, MegaScale-
ing on scalable and efficient training with heteroge- MoE achieves this overlap within a single forward or
neous resources like CPU memory and SSDs. Fur- backward chunk without incurring additional memory
thermore, Tutel [13] offers a dynamic solution for overhead.
MoE models, employing adaptive parallelism and
pipelining. FasterMoE [11] introduces a comprehen- 9 Conclusion
sive suite of optimizations such as dynamic shadow-
ing, fine-grained scheduling, and congestion-avoiding In this paper, we offer an in-depth look at the de-
expert selection strategies. Janus [28] proposes a sign, implementation, and deployment of MegaScale-
data-centric paradigm shift for MoE models, aiming MoE, a production-grade system built to effi-
to lower communication demands and boost train- ciently train MoE models. MegaScale-MoE ex-
ing efficiency. Recently, DeepSeek-V3 [25] proposes ploits communication-efficient approaches, includ-
a series of algorithm-system co-optimizations, such ing parallelism strategies with lower communication
14
volume, inter- and intra-operator communication- Cui. GLaM: Efficient scaling of language models with
computation overlap, and communication compres- mixture-of-experts. In International Conference on
sion with adjusted communication patterns to un- Machine Learning (ICML), 2022.
leash the compute capabilities of high-performance
[9] William Fedus, Barret Zoph, and Noam Shazeer.
GPUs. MegaScale-MoE achieves 1.41M tokens/s in Switch transformers: Scaling to trillion parameter
throughput when training a 352B MoE model on models with simple and efficient sparsity. Journal of
1,440 NVIDIA Hopper GPUs, a 1.88× improvement Machine Learning Research, 2022.
over Megatron-LM. By sharing our insights on accel-
erating large-scale MoE training, we hope our work [10] Sayed Hadi Hashemi, Sangeetha Abdu Jyothi,
will inspire future research. and Roy Campbell. Tictac: Accelerating dis-
tributed deep learning with communication schedul-
ing. Proceedings of Machine Learning and Systems,
References 2019.
[1] Context parallelism in megatron-lm, 2025. [11] Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang,
[Link] Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe:
developer-guide/latest/api-guide/context_ modeling and optimizing training of large-scale dy-
[Link]. namic pre-trained models. In ACM PPoPP, 2022.
[2] Introducing dbrx: A new state-of-the-art open llm, [12] Yanping Huang, Youlong Cheng, Ankur Bapna,
2025. URL [Link] Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong
introducing-dbrx-new-state-art-open-llm. Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu,
et al. Gpipe: Efficient training of giant neural net-
[3] Open release of grok-1, 2025. URL [Link] works using pipeline parallelism. Neural Information
blog/grok-os. Processing Systems, 2019.
[4] Joshua Ainslie, James Lee-Thorp, Michiel de Jong,
[13] Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang,
Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang-
Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin
hai. Gqa: Training generalized multi-query trans-
Jose, Prabhat Ram, et al. Tutel: Adaptive mixture-
former models from multi-head checkpoints. arXiv
of-experts at scale. Proceedings of Machine Learning
preprint arXiv:2305.13245, 2023.
and Systems, 2023.
[5] Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan
Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun [14] Sam Ade Jacobs, Masahiro Tanaka, Chengming
Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam
et al. Flux: fast software-based communication over- Rajbhandari, and Yuxiong He. Deepspeed ulysses:
lap on gpus through kernel fusion. arXiv preprint System optimizations for enabling training of ex-
arXiv:2406.06858, 2024. treme long sequence transformer models. arXiv
preprint arXiv:2309.14509, 2023.
[6] Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei
Duan, Peng Sun, Xingcheng Zhang, and Chao [15] Abhinav Jangda, Jun Huang, Guodong Liu, Amir
Yang. Centauri: Enabling efficient scheduling for Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao,
communication-computation overlap in large model Madanlal Musuvathi, Todd Mytkowicz, and Olli
training via communication partitioning. In ACM Saarikivi. Breaking the computation and commu-
ASPLOS, 2024. nication abstraction barrier in distributed machine
learning workloads. In ACM ASPLOS, 2022.
[7] Aakanksha Chowdhery, Sharan Narang, Jacob
Devlin, Maarten Bosma, Gaurav Mishra, Adam [16] Albert Q Jiang, Alexandre Sablayrolles, Antoine
Roberts, Paul Barham, Hyung Won Chung, Charles Roux, Arthur Mensch, Blanche Savary, Chris Bam-
Sutton, Sebastian Gehrmann, et al. Palm: Scal- ford, Devendra Singh Chaplot, Diego de las Casas,
ing language modeling with pathways. Journal of Emma Bou Hanna, Florian Bressand, et al. Mixtral
Machine Learning Research, 2023. of experts. arXiv preprint arXiv:2401.04088, 2024.
[8] Nan Du, Yanping Huang, Andrew M Dai, Simon [17] Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang,
Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li,
Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Fi- Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin
rat, Barret Zoph, Liam Fedus, Maarten P Bosma, Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou,
Zongwei Zhou, Tao Wang, Emma Wang, Kellie Web- Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei,
ster, Marie Pellat, Kevin Robinson, Kathleen Meier- Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao,
Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi
Quoc Le, Yonghui Wu, Zhifeng Chen, and Claire Ye, Xin Jin, and Xin Liu. MegaScale: Scaling large
15
language model training to more than 10,000 GPUs. [29] Kshiteej Mahajan, Ching-Hsiang Chu, Srinivas Srid-
In USENIX NSDI, 2024. haran, and Aditya Akella. Better together: Jointly
optimizing ml collective scheduling and execution
[18] Vijay Anand Korthikanti, Jared Casper, Sangkug planning using {SYNDICATE}. In USENIX NSDI,
Lym, Lawrence McAfee, Michael Andersch, Moham- 2023.
mad Shoeybi, and Bryan Catanzaro. Reducing ac-
tivation recomputation in large transformer mod- [30] Megatron-LM. GPU optimized techniques for train-
els. Proceedings of Machine Learning and Systems, ing transformer models at-scale, 2025. https://
2023. [Link]/NVIDIA/Megatron-LM.
[19] Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, [31] Deepak Narayanan, Aaron Harlap, Amar Phan-
Xuezhe Ma, Ion Stoica, Joseph E. Gonzalez, and ishayee, Vivek Seshadri, Nikhil R Devanur, Gre-
Hao Zhang. Distflashattn: Distributed memory- gory R Ganger, Phillip B Gibbons, and Matei Za-
efficient attention for long-context llms training. haria. Pipedream: generalized pipeline parallelism
arxiv preprint arXiv:2310.03294, 2024. for dnn training. In ACM SOSP, 2019.
[20] Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and [32] Deepak Narayanan, Mohammad Shoeybi, Jared
Hong Xu. Accelerating distributed {MoE} training Casper, Patrick LeGresley, Mostofa Patwary, Vijay
and inference with lina. In USENIX ATC, 2023. Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti,
Julie Bernauer, Bryan Catanzaro, et al. Efficient
[21] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar,
large-scale language model training on gpu clusters
Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith,
using megatron-lm. In International Conference for
Brian Vaughan, Pritam Damania, et al. Pytorch dis-
High Performance Computing, Networking, Storage
tributed: Experiences on accelerating data parallel
and Analysis, 2021.
training. arXiv preprint arXiv:2006.15704, 2020.
[22] Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, [33] NCCL. Optimized primitives for inter-GPU commu-
Yongbin Li, and Yang You. Sequence parallelism: nication, 2025. [Link]
Long sequence training from system perspective. [34] Xiaonan Nie, Pinxue Zhao, Xupeng Miao, Tong
arXiv preprint arXiv:2105.13120, 2021. Zhao, and Bin Cui. Hetumoe: An efficient trillion-
[23] Wanchao Liang, Tianyu Liu, Less Wright, Will scale mixture-of-expert distributed training system.
Constable, Andrew Gu, Chien-Chin Huang, Iris arXiv preprint arXiv:2203.14685, 2022.
Zhang, Wei Feng, Howard Huang, Junjie Wang, et al. [35] Suchita Pati, Shaizeen Aga, Mahzabeen Islam,
Torchtitan: One-stop pytorch native solution for Nuwan Jayasena, and Matthew D. Sinclair. T3:
production ready llm pre-training. arXiv preprint Transparent tracking & triggering for fine-grained
arXiv:2410.06511, 2024. overlap of compute & collectives. In ACM ASPLOS,
[24] Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, 2024.
Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong
[36] Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai
Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A
Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue
strong, economical, and efficient mixture-of-experts
Yang, Bolin Ni, Jingcheng Hu, et al. Fp8-lm:
language model. arXiv preprint arXiv:2405.04434,
Training fp8 large language models. arXiv preprint
2024.
arXiv:2310.18313, 2023.
[25] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang,
[37] Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao,
Bochao Wu, Chengda Lu, Chenggang Zhao,
Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong
Chengqi Deng, Chenyu Zhang, Chong Ruan, et al.
Guo. A generic communication scheduler for dis-
Deepseek-v3 technical report. arXiv preprint
tributed dnn training acceleration. In ACM SOSP,
arXiv:2412.19437, 2024.
2019.
[26] Hao Liu and Pieter Abbeel. Blockwise parallel trans-
[38] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,
formers for large context models. Neural Information
and Yuxiong He. Zero: Memory optimiza-
Processing Systems, 2024.
tions toward training trillion parameter models.
[27] Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- In International Conference for High Performance
tention with blockwise transformers for near-infinite Computing, Networking, Storage and Analysis,
context. arXiv preprint arXiv:2310.01889, 2023. 2020.
[28] Juncai Liu, Jessie Hui Wang, and Yimin Jiang. Janus: [39] Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley,
A unified distributed training framework for sparse Shaden Smith, and Yuxiong He. Zero-infinity:
mixture-of-experts models. In ACM SIGCOMM, Breaking the gpu memory wall for extreme scale
2023. deep learning. In International Conference for High
16
Performance Computing, Networking, Storage and [50] Shibo Wang, Jinliang Wei, Amit Sabne, Andy
Analysis, 2021. Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen,
Karthik Srinivasa Murthy, Marcello Maggioni, Qiao
[40] Samyam Rajbhandari, Conglong Li, Zhewei Yao,
Zhang, et al. Overlap communication with depen-
Minjia Zhang, Reza Yazdani Aminabadi, Am-
dent computation via decomposition in large deep
mar Ahmad Awan, Jeff Rasley, and Yuxiong He.
learning models. In ACM ASPLOS, 2022.
DeepSpeed-MoE: Advancing mixture-of-experts in-
ference and training to power next-generation AI [51] Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng
scale. In International Conference on Machine Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou,
Learning (ICML), 2022. Weihao Cui, Size Zheng, Li-Wen Chang, et al.
Comet: Fine-grained computation-communication
[41] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, overlapping for mixture-of-experts. arXiv preprint
and Yuxiong He. Deepspeed: System optimizations arXiv:2502.19811, 2025.
enable training deep learning models with over 100
billion parameters. In ACM SIGKDD, 2020. [52] Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo,
Chien-Chin Huang, Min Xu, Less Wright, Hamid
[42] Jie Ren, Samyam Rajbhandari, Reza Yazdani Am- Shojanazeri, Myle Ott, Sam Shleifer, Alban Des-
inabadi, Olatunji Ruwase, Shuangyan Yang, Minjia maison, Can Balioglu, Pritam Damania, Bernard
Zhang, Dong Li, and Yuxiong He. Zero-offload: De- Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Math-
mocratizing billion-scale model training. In USENIX ews, and Shen Li. Pytorch fsdp: Experiences on
ATC, 2021. scaling fully sharded data parallel. Proceedings of
[43] Noam Shazeer. Glu variants improve transformer. the VLDB Endowment, 2023.
arXiv preprint arXiv:2002.05202, 2020. [53] Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wen-
[44] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, lei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Wang, Jianxi Ye, Haibin Lin, et al. Tilelink: Gener-
Dean. Outrageously large neural networks: The ating efficient compute-communication overlapping
sparsely-gated mixture-of-experts layer. arXiv kernels using tile-centric primitives. arXiv preprint
preprint arXiv:1701.06538, 2017. arXiv:2503.20313, 2025.
17
A Appendix Activation
hidden
Shape
[b, s/n, h]
Obtained From
# Input
ln1_out [b, s/n, h] # RMSNorm(hidden)
For parameter synchronization in TP attention, com- The ratio of inter-node communication latency and
munication involves data of size P/n across d devices intra-node communication latency is
in two primary steps in LLM training:
1 intra-node bandwidth n(d − 1)
• inter-node reduce-scatter operation, where the × × (10)
data size is P/n, on d devices. n inter-node bandwidth d(n − 1)
• inter-node all-gather operation, where the data Consider a typical training scenario involving an H100
size is P/n, on d devices. SXM machine, where the NVLink bandwidth is 450
leading to primarily inter-node communication, with GB/s, and the inter-device NIC communication band-
a communication volume of 2P/n(d − 1)/d. width is 50 GB/s. In this context, the latency of
inter-node communication can easily surpass that
With SP attention, the parameter synchronization of intra-node communication. This implies that the
involves the entire data of size P across n × d devices. communication within a node can overshadow that
Considering the discrepancy between intra-node and between nodes. Consequently, in such scenarios, the
inter-node network bandwidth, this process can be im- synchronization of gradients and parameters with SP
plemented by four-step hierarchical communication, attention is, in fact, consistent with TP attention.
where the replicated parameters are first reduced
within a node and then reduced across nodes, before A.2 Selective Activation Rematerializa-
being distributed back to each device. Figure 5a illus- tion
trates a hierarchical communication example where
Figure 20 illustrates the shapes of the key activations
n = 3 and d = 2. The detailed steps are as follows.
produced during forward propagation, with the high-
• intra-node reduce-scatter operation, where the lighted activations retained for backward propagation.
data size is P , on n devices. Let the model parallelism size within one MoE layer
be n and the intermediate hidden size of one expert
• inter-node reduce-scatter operation, where the be f h. The total activation of a single MoE layer is
data size is P/n, on d devices.
• inter-node all-gather operation, where the data (2n + 2k + 3kf + 12 + 5/m)bsh/n,
size is P/n, on d devices.
which we have reduced to
• intra-node all-gather operation, where the data
size is P , on n devices. (2kf + 4 + 2/m)bsh/n.
18
MegaScale-MoE achieves higher Model FLOPs Utilization (MFU) than Megatron-LM by implementing communication-efficient parallelism strategies and fine-grained overlapped communication, which reduce non-overlapped communication time. This approach along with communication compression strategies leads to a performance increase up to 1.88× higher MFU during the training of a 352B MoE model .
MegaScale-MoE overcomes the efficiency issues associated with tensor parallelism by employing expert and sequence parallelism strategies. While tensor parallelism incurs significant communication overhead and lower GEMM efficiency due to its partitioning approach, expert parallelism distributes experts across nodes to reduce memory pressure, and sequence parallelism optimizes communication by partitioning along the sequence dimension, thus improving overall system efficiency .
Pipeline parallelism in MegaScale-MoE contributes to training efficiency by distributing model parameters effectively, which helps reduce communication overhead and overlap communication of different micro-batches. This distribution allows for more efficient processing and scaling of models, especially for large-scale systems with complex engineering challenges .
MegaScale-MoE maintains convergence stability in MoE models by incorporating tailored quantization strategies, particularly in FP8 training. The transition from FP32 to BF16 reduces communication overhead without significantly affecting model performance, and the use of FP32 reduction techniques further helps in preserving convergence stability while minimizing communication volume .
Sequence parallelism and expert parallelism improve MoE training efficiency by reducing the communication volume and enhancing parallel processing. Sequence parallelism partitions the model along the sequence dimension, which decreases the communication on critical paths. Expert parallelism distributes experts across nodes, alleviating memory pressure and lowering per-layer cross-node communication overhead compared to tensor parallelism, which has higher communication costs and lower GEMM efficiency due to its partitioning approach .
MegaScale-MoE achieves efficient scaling in the training of transformer models with trillions of parameters by employing comprehensive communication optimizations, such as communication compression and fine-grained communication overlapping, which cut down on communication overhead. It also leverages massive parallelism across thousands of GPUs, using advanced techniques like selective activation rematerialization and expert parallelism, thereby optimizing GPU memory usage and reducing the need for proportional increases in inference costs .
Expert parallelism reduces memory pressure in MoE models by distributing the experts across multiple devices, which allows parallel processing and alleviates memory constraints. This setup enables efficient token assignment to experts and facilitates the retrieval of outputs through all-to-all communication, which is crucial for handling the large parameter sizes of MoE models .
MegaScale-MoE optimizes GPU memory usage by using selective activation rematerialization, storing only a subset of activations in GPU memory during the forward pass, and recomputing or re-communicating to obtain the required activations during the backward pass. This approach hides the rematerialization overhead efficiently. Furthermore, by employing a fine-grained approach to overlap communication with computation, it achieves comparable performance while storing only half of the activations .
MegaScale-MoE employs several communication optimizations to enhance training efficiency, including fine-grained communication overlap with computation by splitting it into tiles, communication compression by reducing inter-node parameter synchronization precision, and using FP8 communication tailored quantization strategies in FP8 training. These methods effectively minimize communication volume while ensuring convergence stability and enable efficient use of GPU resources across nodes .
As GPU compute capability increases, MegaScale-MoE faces challenges associated with memory-intensive operations like routing, local scatter, and gather, which become more time-consuming due to the slower scaling of memory bandwidth compared to compute capabilities. This results in decreased MFU since higher compute capabilities lead to a reliance on memory loading constrained by memory bandwidth, which affects operations like GEMMs .