0% found this document useful (0 votes)
31 views18 pages

MegaScale-MoE: Efficient MoE Training

MegaScale-MoE is a production system designed for the efficient training of large-scale mixture-of-experts (MoE) models, addressing communication bottlenecks that hinder training efficiency. By employing customized communication-efficient parallelism strategies and overlapping communication with computation, MegaScale-MoE achieves a training throughput of 1.41M tokens/s on a 352B MoE model, improving efficiency by 1.88× compared to existing frameworks. This work aims to share insights on system design to inspire future research in MoE training systems.

Uploaded by

chaitanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views18 pages

MegaScale-MoE: Efficient MoE Training

MegaScale-MoE is a production system designed for the efficient training of large-scale mixture-of-experts (MoE) models, addressing communication bottlenecks that hinder training efficiency. By employing customized communication-efficient parallelism strategies and overlapping communication with computation, MegaScale-MoE achieves a training throughput of 1.41M tokens/s on a 352B MoE model, improving efficiency by 1.88× compared to existing frameworks. This work aims to share insights on system design to inspire future research in MoE training systems.

Uploaded by

chaitanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MegaScale-MoE: Large-Scale Communication-Efficient

Training of Mixture-of-Experts Models in Production

Chao Jin1,2,◦,∗ , Ziheng Jiang1,◦ , Zhihao Bai1 , Zheng Zhong1 , Juncai Liu1 , Xiang Li1 ,
Ningxin Zheng1 , Xi Wang1 , Cong Xie1 , Qi Huang1 , Wen Heng1 , Yiyuan Ma1 , Wenlei
Bao1 , Size Zheng1 , Yanghua Peng1 , Haibin Lin1 , Xuanzhe Liu2 , Xin Jin2,† , Xin Liu1,†
arXiv:2505.11432v2 [[Link]] 19 May 2025

1
ByteDance Seed, 2 Peking University

Equal Contribution, ∗ Work done at ByteDance Seed, † Corresponding authors

Abstract
We present MegaScale-MoE, a production system tailored for the efficient training of large-scale
mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language
models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing
MoE training systems experience a degradation in training efficiency, exacerbated by the escalating
scale of MoE models and the continuous evolution of hardware.
Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE
customizes communication-efficient parallelism strategies for attention and FFNs in each MoE
layer and adopts a holistic approach to overlap communication with computation at both inter-
and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with
adjusted communication patterns to lower precision, further improving training efficiency. When
training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training
throughput of 1.41M tokens/s, improving the efficiency by 1.88× compared to Megatron-LM.
We share our operational experience in accelerating MoE training and hope that by offering our
insights in system design, this work will motivate future research in MoE systems.

Correspondence: Xin Jin, Xin Liu

1 Introduction Within the landscape of LLM architectures, Mixture-


of-Experts (MoE) models stand out for their sparse
As the size of Large Language Models (LLMs) [7, 16, activation [7, 9, 16, 44], which dynamically routes
47] grow, so does the scale of their training regimes. input tokens to a selected set of specialized network
The escalation in training scale has made efficiency components, known as experts, rather than to all
improvements not just desirable but crucial [17]. As a parameters. This design leads to sub-linear scaling of
company building AI products for billions of users, we FLOPs required as the model size increases, thereby
remain committed to training LLMs with hundreds significantly reducing the computational cost. Re-
of billions of parameters on thousands of GPUs. Con- cent industrial advancements [2, 3, 8, 25, 40] have
sequently, even marginal gains in training efficiency demonstrated the potential of MoE models, achiev-
can significantly reduce computational resource con- ing an order-of-magnitude reduction in training cost
sumption and training time, directly influencing the compared to dense models with equivalent model
feasibility and sustainability of developing state-of- quality.
the-art LLMs.

1
NVLink Bandwidth (TB/s) 2.0
MegaScale-MoE addresses the communication prob-
1.5
B200, BF16, 2024 B200, FP8, 2024 lem in MoE training from three key aspects. First,
H100, BF16, 2022
MegaScale-MoE reduces the communication volume
1.0 H100, FP8, 2022 by customizing parallelism strategies for the attention
0.5
A100, BF16, 2020 and FFN modules in each MoE layer. We compare
V100, FP16, 2017
the parallelism strategies in existing LLM training
0.0
0 2 4 6 8 10 frameworks, comprehensively considering their im-
Tensor Core (PFLOPS)
pact on large-scale training, including the communi-
Figure 1 Evolution of NVIDIA GPUs.
1

cation volume and whether communication can be


effectively overlapped (i.e., whether it lies on the
Despite the lower training costs of MoE models, we
critical path). Based on this analysis, we select the
observe a critical performance bottleneck during train-
optimal combination of parallelism strategies for MoE
ing from a systems perspective—communication. For
training.
instance, when training an internal model on NVIDIA
Hopper GPUs, communication accounts for 43.6% Second, MegaScale-MoE fully overlaps communi-
of the total time during the forward pass and 32% cation with computation at the operator level.
over the entire training process. Two primary factors MegaScale-MoE partitions the forward and backward
contribute to this bottleneck. First, MoE models passes of each MoE layer into distinct computation
inherently introduce more communication overhead. and communication operators. For inter-operator
Compared to dense model training, MoE model train- overlap, MegaScale-MoE employs a holistic schedul-
ing requires distribution across more GPUs for model ing strategy that carefully reorders communication
parallelism due to its larger parameter size. Sec- and computation operators during both forward and
ond, enabling sparse computation requires two extra backward propagation, hiding communication within
all-to-all communications in both the forward and independent computations. This approach also opti-
backward passes to dispatch and aggregate tokens, mizes GPU memory usage. MegaScale-MoE utilizes
respectively, which hinders ongoing computation. selective activation rematerialization, retaining only
a subset of activations in GPU memory during the
Moreover, as hardware advances, the imbalance be- forward pass, and recomputing or re-communicating
tween computation and communication becomes in- to obtain the required activations during the back-
creasingly pronounced, with communication overhead ward pass. With this holistic scheduling, MegaScale-
growing more dominant. Alongside improvements MoE effectively hides the rematerialization overhead,
in model architectures, hardware capabilities have achieving comparable performance while storing only
evolved rapidly, with GPUs achieving significantly half of the activations.
higher processing speeds (Figure 1). Concurrently,
reductions in training precision have been adopted to To overlap communication on the critical paths,
enhance efficient and cost-effective training [25, 36]. MegaScale-MoE employs a fine-grained approach that
These trends lead to a scenario where the raw compu- splits communication into tiles and aligns with the
tation time decreases, making the relative impact of GPU compute pattern, fusing these tile-level commu-
communication overhead a more critical bottleneck. nications into the compute kernels. For MoE models
For instance, simply extending existing tensor par- with token dispatch, MegaScale-MoE fuses an efficient
allelism to multi-node setups has been observed to local scatter operation into the kernel and reorganizes
push communication overhead beyond 50% in certain the computation tasks along the scattered dimension
cases. As a result, optimizing communication is es- to mitigate communication bottlenecks from multiple
sential for sustaining and improving the scalability data sources. This fine-grained overlap occurs within
of MoE model training, particularly in distributed each node, leveraging the high-bandwidth connectiv-
environments where frequent data synchronization ity between GPUs.
across multiple GPUs is required.
Third, MegaScale-MoE leverages communication
In this paper, we present the design, implementation, compression to further enhance MoE training effi-
and operational experience of MegaScale-MoE, a pro- ciency. Specifically, for widely-used BF16 mixed-
duction system optimized for efficient large-scale MoE precision training, MegaScale-MoE reduces the inter-
training. By meticulously addressing the communi- node parameter synchronization precision from FP32
cation bottleneck, MegaScale-MoE strives to push to BF16, halving the associated overhead. In
the boundaries of MoE training, achieving significant FP8 training, MegaScale-MoE replaces BF16 reduce-
improvements in performance and efficiency. scatter with FP8 communication, incorporating tai-

2
input LayerNorm gating
LayerNorm Router weights SP LayerNorm SP LayerNorm SP LayerNorm
All-Gather QKV Proj QKV Proj
LinearQKV expert TP QKV Proj All-to-All
LinearFC1 LinearFC3 Q, K, V Q All-Gather
Self Q, K, V K, V
Attention SwiGLU Attention TP Attention SP Attention
LinearFC2 Output Proj All-to-All
LinearOut Reduce-
Scatter Output Proj Output Proj
output SP SP SP
Figure 2 Mixture-of-Experts (MoE) layer. LayerNorm LayerNorm LayerNorm
(a) Megatron-LM (b) DeepSpeed Ulysses (c) Megatron-LM
Tensor parallelism. Sequence parallelism. Context parallelism.
lored quantization strategies and FP32 reduction to
decrease communication volume while preserving con- Figure 3 Different parallelism strategies for self-attention.
vergence stability. "TP" denotes partitioning along the dimension of hidden
size, while "SP" denotes partitioning along the dimension
MegaScale-MoE is deployed in our datacenters to of sequence length.
train MoE models for our products. Compared to
the state-of-the-art open-source LLM training frame-
work, Megatron-LM [46], MegaScale-MoE achieves approach has limitations that prevent relying on a
up to 1.88× higher MFU (Model FLOPs Utiliza- single method for effective scaling.
tion) when training a 352B MoE model on 1,440
NVIDIA Hopper GPUs. With comprehensive com- Data parallelism uniformly distributes the training
munication optimizations, MegaScale-MoE powers data across all devices, with each device replicat-
large-scale training in our production, efficiently scal- ing the model parameters and optimizer states. To
ing to trillions of parameters and thousands of GPUs synchronize the parameters after each training iter-
while saving millions of GPU hours. ation, data parallelism performs an all-reduce com-
munication operation. Zero Redundancy Optimizer
(ZeRO) [38] improves over data parallelism by dis-
2 Background tributing model states across all participating devices.
ZeRO unfolds across three progressive stages, each de-
2.1 Mixture-of-Experts for Transformer
signed to increasingly conserve memory, though this
The Mixture of Experts (MoE) mechanism is an ad- comes with the trade-off of elevated communication.
vanced approach designed to boost the performance
of Transformer [49] models, which are increasingly Tensor parallelism distributes compute-intensive ten-
pivotal in the realm of LLMs [2, 7, 16, 25]. It extends sor operations over multiple devices, enabling parallel
the Transformer architecture by integrating multi- computation and significantly accelerating the train-
ple expert networks within the feed-forward network ing process. The specific partitioning strategy and
(FFN) component. As illustrated in Figure 2, MoE the dependencies among operators within the model
models dynamically route input tokens to the most dictate that tensor parallelism may necessitate gath-
relevant experts based on their characteristics. This ering split inputs (all-gather) or merging outputs
routing is managed by a trainable gating mechanism (reduce-scatter). In LLM training, operators like Lay-
that selects the best-suited experts for each token. erNorm and Dropout, though less compute-intensive,
This architectural innovation enables MoE models to require substantial activation memory. To tackle this
scale in capacity without a proportional increase in problem, a variant of tensor parallelism known as
inference costs, as only a subset of experts is activated sequence parallelism [18] is proposed, which par-
for each input. titions these operators along the dimension of se-
quence length. For long-context training, several
2.2 Large-scale LLM Training works [1, 14, 46] apply sequence parallelism or ten-
sor parallelism to different operators in self-attention.
Training large language models at scale on tens of Figure 3 illustrates the mainstream parallelism strate-
thousands of GPUs is a complex system engineering gies for attention, namely tensor, sequence, and con-
challenge that requires multiple systems techniques. text parallelism (TP, SP, and CP), which we analyze
To distribute the training workload, a combination in §3.1.
of parallelism strategies such as data, tensor, and
pipeline parallelism is necessary [17, 41, 46], as each Pipeline parallelism enhances efficiency by dividing

3
Inter-node Parallelism Symbol Description
b micro-batch size
Inter-node: PP Inter-node: TP or EP
communication across nodes per layer, s sequence length
high overhead h hidden dimension size
Intra-node Parallelism n model parallelism (TP, SP, or EP) size
m the ratio between the number of query heads and
FFN: TP FFN: EP that of key-value heads
lower GEMM efficiency and k number of experts that each token is routed to
more communication
Table 1 Description of symbols.
Attention: DP Attention: TP Attention: Attention: CP
n× activations due comm. volume on Ulysses SP Imbalanced computation
to global batch
size limit,
critical path:
𝟐𝒃𝒔𝒉(𝒏 − 𝟏)/𝒏
comm. volume:
𝟐𝒃𝒔𝒉(𝒏 − 𝟏)/𝒏×
due to masked attention, reduce communication, and overlap communication
comm. volume:
poor scalability (𝟐 + 𝟐/𝒎)/𝒏
𝟐𝒃𝒔𝒉(𝒏 − 𝟏)/𝒏×𝟐/𝒎 of different micro-batches.
Figure 4 Design space for large-scale MoE training.
Prior large-scale MoE training systems, such as
Megatron-LM [46] and DeepSpeed-MoE [40], incorpo-
model layers into stages that are processed on dif- rate tensor parallelism to scale up training by parti-
ferent devices, enabling pipelined execution. Each tioning the model parameters within the node. How-
batch is split into several micro-batches for this pur- ever, in our practice, we observe two issues with this
pose. To minimize pipeline bubbles, various schedul- approach: (1) TP partitions the expert dimension,
ing strategies have been developed, e.g., GPipe [12], which negatively impacts GEMM efficiency; and (2)
PipeDream 1F1B [31] and Interleaved 1F1B [32], TP introduces significant communication overhead,
etc. Megatron-LM adopts Interleaved 1F1B pipeline which remains constant as the parallelism size in-
scheduling, further dividing each stage on one device creases, eventually causing communication to exceed
into multiple virtual stages to reduce the pipeline computation on modern hardware.
bubble rate. To address these issues, we tailor parallelism strate-
gies for MoE model components. For feed-forward
Expert parallelism is tailored for training MoE mod- networks (i.e., experts), we replace tensor parallelism
els by distributing experts across multiple devices, with expert parallelism and use custom communica-
alleviating memory pressure and enabling parallel tion modes optimized for varying top-k and expert
processing. To efficiently assign tokens to the appro- sizes, ensuring communication overhead stays lower
priate experts and retrieve their outputs, all-to-all than tensor parallelism. For other components, we
communication is typically employed. apply sequence parallelism, partitioning along the
sequence dimension instead of the batch dimension,
3 Communication-Efficient Paral- allowing scaling without increasing global batch size.
lelism This also reduces communication on critical paths
compared to tensor parallelism. The additional mem-
With the rise of MoE models and the evolution of ory and DP communication overhead remain man-
hardware compute capabilities, communication over- ageable due to the parameter asymmetry across com-
head has become increasingly critical in MoE training ponents. We detail the rationale and analysis of
in production. In this section, we delve into the paral- this intra-node parallelism strategy in the following
lelism strategies employed to reduce communication sections. Table 1 lists the key symbols.
volume and meet other training requirements, such
as high GEMM (General Matrix Multiplication) effi- 3.1 Sequence Parallelism for Attention
ciency.
Due to the inherent parallelizability of the expert
Figure 4 shows the design space of parallelism strate- components in MoE models, most prior work on
gies for large-scale MoE training, excluding the out- MoE training [20, 40] focuses on optimizing expert
ermost data parallelism. We start with inter-node parallelism, while data parallelism (DP) is typically
parallelism. Expert parallelism alleviates memory applied to the non-MoE components such as atten-
pressure from MoE models’ large parameter size tion. However, when scaling up MoE training, this
by distributing experts across nodes but incurs per- approach proves insufficient due to the n× activation
layer cross-node communication, harming training memory consumption. This issue arises because DP
efficiency. Similarly, tensor parallelism’s high commu- splits the batch dimension both across and within
nication overhead makes it more efficient to limit TP nodes. Compared to other intra-node parallelism
to a single node. Following prior work [17], we adopt strategies shown in Figure 4, applying DP to atten-
pipeline parallelism to distribute model parameters, tion forces each GPU within a node to process one

4
micro-batch simultaneously, increasing the activation Intra-node RS Inter-node RS Inter-node AG Intra-node AG
size by 8×, which often results in out-of-memory Node0
issues. Part 0
Part 1
To enable scalable MoE training, implementing intra- Part 2
node parallelism for the attention module is crucial. GPU0 GPU1 GPU2
Tensor parallelism (TP) is commonly employed to
parallelize attention operations within nodes. How- Part 0
Part 1
ever, it introduces inevitable communication costs Part 2
due to all-gathering and reduce-scattering activations GPU0 GPU1 GPU2 Node1
along the critical path. With the increasing gap (a) Four-step hierarchical parameter synchronization.
between computational FLOPs and communication
RS RS AG AG
bandwidth, we find that the TP communication over- Intra-node 1 2 3 4 1 234 1 2 1 3243 4
head can even surpass the computation time of self- Inter-node 1 234 1 234 pipelining 1 1 2 2 3 3 4 4
attention. This communication-dominated bottle- (b) Overlapping in hierarchical communication.
neck limits the ability to overlap communication and
Figure 5 Hierarchical communication for parameter
computation, ultimately reducing training efficiency. synchronization in SP attention.
We adopt sequence parallelism (SP), as proposed in
DeepSpeed-Ulysses [14], to scale MoE training and
effectively reduce communication along the critical inter-node bandwidth asymmetry and the adoption
path. SP is commonly used in long-context train- of hierarchical communication operations in modern
ing to address memory challenges associated with communication libraries [33] as shown in Figure 5
long inputs. We find it also works well in large-scale and analyzed in Appendix A.1, although SP atten-
MoE training. First, it significantly reduces commu- tion requires synchronization of n× more parameters
nication overhead compared to TP, especially when compared to TP attention, the difference in commu-
using grouped-query attention [4]. Second, while it nication overhead is minimal in practical scenarios.
introduces some parameter redundancy and increased
communication overhead during parameter synchro- On the other hand, the additional GPU memory
nization, the unique characteristics of MoE models consumption introduced by SP attention is minimal
make these trade-offs manageable and acceptable. in MoE training. For large-scale MoE models with
tens to hundreds of experts, the majority of GPU
Communication efficiency. When utilizing TP, the memory is consumed by the expert parameters. Our
communication volume in attention is experiments, detailed in §6.2, confirm that the extra
parameter synchronization and memory overhead of
2bsh(n − 1)/n. (1) SP attention remain manageable.
With SP, the communication volume decreases to

2bsh(n − 1)/n × (2 + 2/m)/n, (2) Balanced vs. imbalanced. In addition to the Ulysses-
style SP attention, we also explored other forms,
where m represents the ratio between the number of including context parallelism (CP) [1], which parti-
query heads and that of key-value heads. Assuming tions all activations along the sequence dimension.
the model is trained on a NVIDIA Hopper GPU CP attention, however, faces workload imbalance
workstation with an NVLink domain of size 8, the due to causal masking in attention, as each token
communication latency for sequence parallel attention only attends to previous tokens. To mitigate this,
can be reduced to about one-fourth of that required we attempted the zigzag strategy by grouping the
by tensor-parallel attention. head and tail partitions of the sequence on the same
GPU, although achieving perfect balance remains
Data communication & memory overhead. A notable challenging. Consequently, in large-scale training,
difference between SP and TP attention is how pa- the entire training process is often constrained by the
rameters are distributed across devices: TP shards most imbalanced data batch. Moreover, this imbal-
the attention weights, while SP replicates them. This ance disturbs the training pipeline, thereby reducing
raises the concern about the potential increase in com- overall training efficiency.
munication overhead for synchronizing gradients and
parameters. Counterintuitively, given the intra- and

5
All-Gather Reduce-Scatter All-to-All
1000
(a) Typical EP implementation.
800

Time (us)
600
400
(b)(b)
EPEP Implementation
implementation inin MoESurge when
MegaScale-MoE top-k
when > n. > n.
top-k 200
Figure 6 Communication-efficient expert parallelism. e 0
1 2 3 4 5 6 7 8
represents the number of tokens routed to the worker. Top-k

Figure 7 Comparison of AG, RS, and A2A for token


1

3.2 Expert Parallelism for Feed-forward dispatch.


Network
whereas all-gather and reduce-scatter follow a ring-
In the choice of parallelism strategies for the feed- based communication pattern with only neighboring
forward network component, expert parallelism (EP) workers. As shown in Figure 7, the communication
consistently outperforms tensor parallelism. TP par- time for these three operations in Mixtral-8×7B re-
titions the hidden dimension of each expert, reducing veals that when top-k > 6, the all-gather-based EP
GEMM efficiency, whereas EP maintains full expert implementation is more efficient.
computation on each device. Theoretically, the com-
munication cost for EP is
Efficient operators. Instead of using
2k/n × bsh(n − 1)/n, (3) torch.scatter_add and [Link] for tensor
scattering and gathering like Megatron-LM, we
while for TP it is develop efficient scatter and gather operators directly
using CUDA. Based on the token routing results, we
2bsh(n − 1)/n. (4) pre-calculate the mapping from each row of the input
tensor (representing a token) to the corresponding
Although their relative efficiency depends on the ratio row in the output tensor. The scatter and gather
k/n, we design an adaptive communication strategy operators then perform data transfers efficiently
for different top-k values to minimize the communi- according to this mapping.
cation volume of EP.
Load balance. A well-known challenge in MoE model
Efficient communication pattern. Figure 6 com- training is load balancing across experts [20, 24]. To
pares the typical EP implementation with MegaScale- address this, we use auxiliary loss and token dropping
MoE’s approach. The standard EP implementation to balance the workload across GPUs within each
requires two all-to-all communications for token dis- node. Similar to DeepSeek-V2 [24], we treat the
patch and aggregation. Additionally, a scatter op- experts placed on the same GPU as a group and
eration may be required before sending and after calculate the balance loss and computational capacity
receiving tokens to ensure that tokens assigned to the for each device rather than for each individual expert.
same expert reside in a contiguous memory space.
When the top-k value exceeds n, we replace tradi- 4 Communication-computation Over-
tional all-to-all communication with all-gather and lap
reduce-scatter. First, an all-gather operation collects
After optimizing parallelism strategies to minimize
tokens from all workers. Then, a local scatter opera-
communication volume, we further reduce the commu-
tion discards unneeded tokens, retaining only those
nication overhead to nearly zero using comprehensive
required by the experts on the current worker. Af-
communication-computation overlapping techniques.
ter expert computation, the tokens are assembled
Training large models involves integrating various
into a complete tensor. This approach enables a
techniques, which increases the complexity of commu-
gather operation before communication, followed by
nication overlap. For instance, at any given moment,
a reduce-scatter to produce the final result, ensuring
the device might concurrently handle computation
that EP’s communication overhead remains equal to
and communication kernels, overlap PP and DP com-
or lower than TP’s.
munications, and manage data transfers between the
In practical training, all-to-all communication is less device and host. Existing frameworks like Megatron-
efficient than all-gather and reduce-scatter, as it re- LM assemble attention and FFN modules into MoE
quires each worker to communicate with all others, layers and rely on the [Link] package for

6
backward propagation, which limits the flexibility of / Heavy/light operators w Gating weights for each expert
/ Retained/discarded activations W Model weights
communication overlap. In contrast, MegaScale-MoE ln2_in
decomposes the attention and FFN modules of each hiddeni RMSNorm
MoE layer into operators that run as GPU kernels, SP RMSNorm SP ln2_out Router
ln1_out wgating
enabling fine-grained communication overlap through WQKV QKV Proj All-Gather
flexible scheduling. q k v Scatter & Drop
RoPE RoPE WFC1 ffn_in WFC3
q_rope k_rope GroupedGEMM GroupedGEMM
4.1 Inter-operator Overlap fc1_out fc3_out
All-to-All All-to-All All-to-All SiLU
We overlap communication operators with indepen- q_a2a k_a2a v_a2a w'gating
Self Attention WFC2 fc2_in
dent computation operators by executing them asyn- GroupedGEMM
TP attn
chronously on different CUDA streams. To achieve All-to-All EP Fillfc2_out & Gather
optimal performance during the training process, we SP attn_a2a Reduce-Scatter
adopt a specifically hand-tailored, holistic scheduling WO Out Proj SP ffn_out
strategy. attn_out
hiddeni+1
(a) Forward pass of a MoE layer.
Holistic scheduling. From the caller’s perspective,
we implement a unified macro module to execute Op(R) Recomputed operator Comm(R) Reperformed communication
∆ffn_out ln2_out
the entire MoE layer’s forward and backward passes, Stream0 All-Gather All-Gather(R)
thereby expanding our scheduling flexibility. For in- Stream1 RMSNorm(R) SiLU(R) Scatter GroupedGEMM'FC2
stance, during the backward pass, various communi- ln2_in fc1_out ∆fc2_out ∆fc2_in
cation operators can be overlapped with dependency- fc3_out fc2_in
free computations, such as activation recomputation, (b) Backward pass snippet with activation rematerialization.
Figure 8 Selective activation rematerialization.
to improve efficiency. From the runtime perspective,
a key challenge is efficiently managing concurrent
to overlap with other computations and communi-
communication tasks by resolving resource conflicts
cations, avoiding delays in the critical path. For
to prevent blocking and maximize throughput. This
example, as shown in Figure 8b, the backward pass
requires careful coordination, such as determining the
of the GroupedGEMM operator for FC2 requires
number of SMs allocated to each communication op-
the activation fc2_in and the gradient of fc2_out
erator, to minimize interference and optimize overall
(denoted as ∆fc2_out) as inputs. MegaScale-MoE re-
throughput.
computes fc2_in and overlaps this operator with gra-
dient communication (i.e., all-gather for ∆ffn_out).
Selective activation rematerialization. The holistic
Similarly, ffn_in is obtained through re-performing
scheduling strategy also helps reduce memory usage
RMSNorm and all-gather, with these operators hidden
without compromising training speed. Compared to
within the preceding communication and the FC2
dense models with equivalent computational require-
GroupedGEMM, respectively. MegaScale-MoE also
ments, MoE models exert significantly higher memory
places the weighted sum of ffn_out immediately af-
pressure during training due to their parameter count
ter the SwiGLU [43] activation function to eliminate
being several times larger. In addition to employing
the need to store ffn_out. This reordering ensures
ZeRO optimizations [38] to eliminate redundant op-
computational consistency by avoiding operators that
timizer states across DP groups, we further optimize
cross non-linear boundaries.
memory usage through selective activation remateri-
alization. This approach reduces activation memory Our analysis in Appendix A.2 and experiments in
requirements by re-performing computation and com- §6.2 show that MegaScale-MoE reduces the activa-
munication operators that can be overlapped with tion memory by ∼ 50% while maintaining the same
other necessary operators. training speed.
Figure 8a illustrates the forward pass of a Mixtral [16]
4.2 Intra-operator Overlap
MoE layer and highlights key activations produced
during this process. MegaScale-MoE strategically re- Although inter-operator overlap effectively hides com-
tains activations that are computationally expensive munication latency, squeezing all bubbles in the exe-
to recompute, while recalculating others generated by cution timeline remains non-trivial—especially in the
memory-intensive operations or communication op- forward pass, where no rematerialization or gradient
erations. This minimizes dependencies on backward computation operators exist to overlap with commu-
computation, enabling rematerialization operations nication. Some forward operators directly depend

7
GEMM Weight (Replicated)

h/n, n=2 Communication, Rank 0


Tile wait for signal A Remote Load &
Fused Scatter

luM
GEMM B

taM
h

sMS gnola spraw tnerrucnoC


Rank 0 s G c G c G c G C
s/n L L L L D Subtask 0
G c G c G c G ... A
All-to-All Input GEMM Input GEMM Output L L L L Rank 1 D Expert 0
Remote Load & Set signal Local Update E
F Subtask 1

...
c G c G c G c G G E
L L L L H H Expert 0
Rank 1
Time Tokens Input Tile GEMM Tile
(a) Data flow in A2A+GEMM (Output Proj). (b) Fine-grained overlap of A2A+GEMM. (c) Fine-grained overlap of AG+Scatter+GroupedGEMM.

Figure 9 Fine-grained intra-operator communication-computation overlap.

on communication, such as token dispatch for ex- communication for remote data starts simultaneously.
pert computation, making overlap impossible unless We leverage dedicated GPU copy engines for data
another micro-batch is introduced, which increases transfer, ensuring that all SMs (streaming multipro-
memory pressure. cessors) are fully utilized for computation. Once a
remote data tile arrives at local memory, a signal no-
A widely adopted solution [17, 48, 50] is to decom- tifies the GEMM kernel to continue its computation
pose operators into smaller parallel ones to enable on the arrived tile. For GEMM+A2A, the all-to-
pipelining by executing them on separate CUDA all operation is fused into the GEMM kernel. Each
streams. However, this approach introduces non- tile of GEMM computation ends with a remote data
negligible overhead: (i) complex stream control, in- transfer that writes the output data tile to remote
volving host interference and causing random bubbles ranks. We also implement all-gather+GEMM and
due to the non-deterministic feature of CPU control; GEMM+reduce-scatter kernels for tensor parallelism,
(ii) imperfect tail computation, increasing overall which are similar to A2A+GEMM and GEMM+A2A.
computation latency.
For A2A+GEMM and GEMM+A2A, we allocate a
To address the above issues, we use intra-operator small number of SMs for communication as all-to-all
overlap to exploit the parallelism for communica- is more complex than all-gather and reduce-scatter.
tion and computation operators with direct depen- The number of SMs for communication is tuned to
dency. The core idea is to fuse these operators and make communication and computation exhibit sim-
break down the workloads into tiles. Following prior ilar latency. Moreover, multiple ranks may simul-
work [5, 15, 51, 53], we implement barriers in device taneously read from or write to the same device,
memory between communication and computation potentially causing contention in NVLink. To mit-
operators. These barriers allow fine-grained noti- igate this, we apply swizzling [5, 51, 53] to reorder
fications between computation and communication tile communication and computation so that the ar-
at the tile level. Additionally, as the barriers re- rival of communication tiles aligns with the pace of
side in device memory, they eliminate the need for computation tiles.
host interference, enabling further performance im-
provements. We implement two types of kernels, Overlapping with GroupedGEMMs For expert par-
overlapping with GEMMs and overlapping with MoE allelism with token dispatch and combine, we aim
GroupedGEMMs, for the attention and FFN mod- to overlap communication with GroupedGEMMs.
ules, respectively. We implement two types of overlapping ker-
nels: all-gather+scatter+GroupedGEMM and
Overlapping with GEMMs. We first introduce the GroupedGEMM+gather+reduce-scatter. Unlike the
intra-operator communication-computation overlap overlapping techniques for GEMM kernels, MoE
for GEMM kernels. Specifically, we implement all- GroupedGEMMs require token shuffling (scatter/-
to-all(A2A)+GEMM and GEMM+A2A kernels for gather). As a result, each computation tile may
Output and QKV Projections in SP attention, re- depend on tokens from multiple ranks. To effectively
spectively, where X+Y means Y executed after X. overlap computation with communication, we sort
Figure 9 shows the data flow and overlapping pat- the token order to minimize the number of dependent
tern in A2A+GEMM. The GEMM on local data and ranks for each computation tile. Additionally, since

8
Gradients Name #layers h #heads m hf f n #experts top-k

A0[FP32] A0[BF16] Internal-352B 60 4096 32 4 14336 32 3


A1[BF16] Mixtral-8×7B 32 4096 32 4 14336 8 2
A1[FP32] Buffer A0[BF16] A0+B0 Mixtral-8×22B 56 6144 48 6 16384 8 2
B0[BF16] [FP32] Hunyuan-Large 64 6400 80 10 18304 16 1
Data ① local cast ② BF16 all-to-all ③ FP32 accumulation Phi-3.5-MoE 32 4096 32 4 6400 16 2
Parallelism DeepSeekMoE 28 2048 16 1 1408 64 6
B0[FP32] B0[BF16]
B1[BF16] Table 2 Model configurations in evaluation.
B1[FP32] Buffer A1[BF16] A1+B1
B1[BF16] [FP32]
Figure 10 DP communication compression. gradients to BF16 and perform all-to-all communi-
cation within the data parallel group to gather the
each tile has its own dependencies, the signal control required gradient shards, which are then locally aggre-
for each tile varies depending on the MoE routing, gated in FP32. Our results show that this approach
which is determined dynamically. introduces negligible precision loss compared to di-
rectly performing reduce-scatter with FP32, while
In detail, for AG+scatter+GroupedGEMM, we re- reducing gradient communication overhead by 50%.
order tokens along the sequence dimension based on
their routed expert index. Then, for each expert, This approach minimizes risk for two key reasons.
we sort the routed tokens according to their source First, it performs a one-time conversion of accumu-
rank index. Finally, we slice the sorted sequence lated gradients to BF16 during communication, while
into blocks and perform GroupedGEMM using a se- the local gradient accumulation is maintained in FP32
quence of computation tiles. Specifically, as shown precision. Second, instead of using ring-style reduce
in Figure 9c, we fuse the local scatter into the kernel for BF16 gradient communication, it employs all-to-
by selecting rows of input data based on the index all communication, with the final reduction computed
mapping. The GroupedGEMM computation for each using FP32 summation. This design prevents preci-
expert is divided into tiles, with each tile depending sion loss that could arise from repeated accumulation
on only a subset or even a single source rank. This of BF16 values in ring-based reductions.
reduces the overall waiting time for each computation We observe that casting large gradients and perform-
block, avoids redundant loading of expert parameters, ing all-to-all communication increases peak memory
and improves the overlap between computation and consumption, potentially causing out-of-memory er-
communication tiles. rors. To mitigate this, we develop a memory-efficient
operator that in-places BF16 gradients into half of the
5 Communication Compression FP32 input buffer while using the remaining half as
the output buffer for BF16 all-to-all communication,
We further reduce communication overhead by ap- preventing peak memory growth.
plying communication compression. To maintain
convergence stability, mixed-precision training frame-
Communication compression for FP8 training. In
works typically transfer tensors awaiting reduction
low-precision FP8 training, the proportion of commu-
in higher precision, such as FP32, to ensure more
nication time increases due to reduced computation
accurate accumulation. A common example of this
time. To mitigate communication overhead, we ex-
is gradient reduce-scatter in data parallelism.
plore compressing communication volume using FP8
precision with appropriate quantization techniques.
DP communication compression. As MoE model
Currently, we apply communication compression in
parameters increase, so does the communication over-
FP8 MoE training with tensor parallelism, focusing
head for parameter and gradient synchronization in
on reduction scenarios prone to overflow or underflow.
data parallelism. Prior work has explored gradient
For example, we adopt the E4M3 format (4-bit expo-
compression to mitigate this cost. In our BF16 mixed-
nent and 3-bit mantissa) for all tensors. Similar to
precision training, we carefully apply FP32-to-BF16
DP reduce-scatter compression, we replace BF16 TP
precision reduction for gradient synchronization, bal-
reduce-scatter with FP8 all-to-all in forward propa-
ancing efficiency and convergence stability.
gation and perform reduction in FP32 precision. In
Specifically, as shown in Figure 10, we retain the main the corresponding backward propagation, we apply
gradients in FP32 during local gradient accumula- FP8 all-gather for gradients. Notably, simply reduc-
tion in pipeline parallelism. After each model stage ing precision leads to loss misalignment with BF16
completes accumulation, instead of relying solely on training. To mitigate this, we apply per-token acti-
reduce-scatter for gradient synchronization, we cast vation quantization for forward communication and

9
Iteration Throughput Training Time for Exposed Communication FlashAttention and GEMMs Others
System #GPUs
Time (s) (tokens/s) 1T Tokens (days) 20 100

Iteration Time (s)


Megatron-LM Megatron-LM
240 39.94 151.1k 76.61 MegaScale-MoE MegaScale-MoE
480 19.56 301.1k 38.38 15 75

MFU (%)
Megatron-LM 720 13.70 430.5k 26.88 10 50
960 10.82 550.2k 21.23
1440 7.90 746.6k 15.50 5 25
240 21.61 272.9k (1.81×) 42.41
0 0
480 11.83 498.6k (1.65×) 23.21 H800 A100 H20 H800 A100 H20
MegaScale-MoE 720 7.97 740.1k (1.72×) 15.64 (a) Iteration time breakdown. (b) MFU.
960 6.12 963.8k (1.77×) 12.01
1440 4.19 1407.7k (1.88×) 8.22 Figure 12 Performance breakdown of training Mixtral-
Table 3 Strong-scaling training performance for the 8×7B on different GPUs.
352B MoE model on NVIDIA H800 GPUs. The number
in parentheses in the throughput column represents the Compute Cap- Memory Spec. NVLink
GPU
ability (TFLOPS) Cap. (GB) Bw. (TB/s) Bw. (GB/s)
speedup of MegaScale-MoE compared to Megatron-LM. H800 989 80 3.4 400
A100 312 80 2.0 600
Megatron-LM MegaScale-MoE H20 148 96 4.0 900
Normalized Throughput

1.0 1.0 0.997 0.998


1.0 Table 4 Specifications of different NVIDIA GPUs.
0.8
0.6 0.574 0.566 0.564 0.558
0.4
0.2
spectively. Specifically, MegaScale-MoE employs SP
0.0
480 720 960 1440
attention and EP within each node, while Megatron-
#GPUs LM adopts TP within each node, with both systems
Figure 11 Weak-scaling training performance for the configured with a PP size of 15. Sequence length is
352B MoE model on NVIDIA H800 GPUs. 8,192 and vocabulary size is 65,536.

per-channel quantization for backward communica- Scalability. Table 3 compares the strong-scaling train-
tion. In backward propagation, we further group ing performance of Megatron-LM and MegaScale-
quantization along the token dimension using a small MoE on the 352B MoE model. We scale the number
group size (e.g., 128). of GPUs while keeping the global batch size fixed
at 720. Across all settings, MegaScale-MoE achieves
6 Evaluation 1.65–1.88× speedups over Megatron-LM. As the num-
ber of GPUs increases, the MFU (Model FLOPs Uti-
In this section, we present a comprehensive evaluation lization) of MegaScale-MoE declines from 32.48% to
of MegaScale-MoE, covering overall training perfor- 27.89%. This is expected, as the batch size is fixed
mance (§6.1), ablation studies of MegaScale-MoE’s and the number of micro-batches for each pipeline
key optimizations (§6.2), and the effectiveness of the decreases with more GPUs, leading to more bubbles.
precision-communication co-design (§6.3). Table 2
Figure 11 presents the weak-scaling training perfor-
lists the configurations of the MoE models used in our
mance of Megatron-LM and MegaScale-MoE on the
evaluation, detailing hidden size (h), FFN intermedi-
same model. We scale the global batch size from 360
ate size (hf f n ), number of experts, and top-k values.
to 1,080 in proportion to the number of GPUs (from
The evaluation is conducted on NVIDIA H800 GPUs
480 to 1,440). MegaScale-MoE achieves a 1.74-1.79×
unless otherwise specified, with the specifications
training throughput compared to Megatron-LM. As
provided in Table 4.
the scale increases, Megatron-LM’s throughput drops
by 2.74% due to increased communication overhead,
6.1 Training Performance whereas MegaScale-MoE exhibits near-linear scala-
MegaScale-MoE is built on top of Megatron-LM [46], bility, benefiting from comprehensive communication
a state-of-the-art open-source LLM training system overlap.
that supports 3D parallelism strategies and is con-
tinuously updated to incorporate the latest optimiza- Performance breakdown on different GPUs. We
tions from the community. Our evaluation uses the conduct a deep dive into MegaScale-MoE to further
Megatron-LM on GitHub [30] with commit hash understand the performance of training a MoE model
f1f03922, selected for its stability at the commence- in production environments. We train Mixtral-8×7B
ment of our experiments months ago. For fair compar- on 32 NVIDIA H800, H20, and A100 GPUs, respec-
ison, we use the same global batch size for Megatron- tively. The specifications of GPUs we used are listed
LM and MegaScale-MoE and choose the optimal in Table 4. We set the DP size as four, the TP size
parallelism configurations for the two systems, re- as eight for Megatron-LM, and the SP and EP size

10
TP+TP SP+TP TP+EP SP+EP (MegaScale-MoE) TP Attention SP Attention (MegaScale-MoE)
1.0
170 170

Sync. Time (ms)

Sync. Time (ms)


Normalized MFU

160 160
0.9
150 150
0.8
140 140
0.7
130 130
0.6 120 120
Internal- Mixtral- Mixtral- Hunyuan- Phi- DeepSeek-
352B 8×7B 8×22B Large 3.5-MoE MoE 500 750 1000 1250 1500 500 750 1000 1250 1500
Attention Parameter Size (MB) Attention Parameter Size (MB)
Figure 13 Parallelism efficiency for different models. (a) DP=4. (b) DP=8.

Figure 14 Parameter synchronization time under SP and


as eight for MegaScale-MoE. As shown in Figure 12b, TP attention.
across the four kinds of GPUs, MegaScale-MoE con-
sistently outperforms Megatron-LM by up to 1.58×
in §3, SP and EP effectively reduce the communi-
in MFU. Figure 12a demonstrates the iteration time
cation volume compared to TP, thereby decreasing
breakdown of Megatron-LM and MegaScale-MoE.
communication overhead. Second, TP partitions the
Exposed communication time represents the commu-
FFN module along the intermediate size dimension,
nication time that is not overlapped with computa-
which results in lower GEMM efficiency.
tion operations. FlashAttention and GEMMs are
the operations we count when calculating MFU. The To provide a more comprehensive evaluation of the
performance gain primarily results from MegaScale- parallelism strategy, we also report the additional
MoE’s communication-efficient parallelism strategies overhead introduced by the replicated attention pa-
and fine-grained overlapped communication. rameters in SP. In terms of memory usage, SP incurs
a 1.2%–5.4% higher memory footprint compared to
Note that the MFU value decreases as GPU com-
TP, requiring 1.7%–8.1% more memory to store pa-
pute capability increases. This is because, unlike
rameters, gradients, and optimizer states across all
dense models, MoE models involve many memory-
seven models. This overhead is manageable consid-
intensive operations like routing, local scatter, and
ering the significant performance gains achieved by
gather, which remain time-consuming since memory
SP.
bandwidth does not scale as quickly as compute ca-
pabilities. Additionally, GEMM efficiency declines For the parameter synchronization time, we follow
with increasing compute capability, as it also relies on large-scale training setups and set the size of the TP
memory loading, constrained by memory bandwidth. or SP to 8, effectively parallelizing each layer within
a single node. The attention parameter size on each
6.2 Ablation Study GPU is varied from 384 MB to 1536 MB, while the
FFN parameter size is fixed at 10 GB per GPU,
Parallelism strategy. We first compare the train-
reflecting typical real-world training setups. We run
ing efficiency under various intra-node parallelism
MegaScale-MoE with SP and TP attention, using
strategies using a single node with eight NVIDIA
4 and 8 DP groups, which correspond to a total of
H800-SXM GPUs. We denote parallelism strategies
32 and 64 GPUs, respectively. Figure 14 shows that
as X+Y, where X represents the parallelism strategy
the synchronization times for SP and TP attention
for attention, and Y corresponds to that for experts.
are consistently comparable, differing by only 0.3%–
The available parallelism strategies for attention in-
3.1%. This aligns with our hypothesis that SP and
clude TP and our SP, whereas for experts, the choices
TP would exhibit similar performance characteristics
are TP and EP. To isolate the performance benefits
in DP communication latency.
of optimized parallelism, we disable other system
optimizations.
Intra-operator commmunication overlap. We then
We measure the training MFU of one internal and measure the duration of four key communication
five open-source MoE models with diverse model con- and the corresponding computation operators in the
figurations as listed in Table 2. The global batch size forward pass: (i) QKV Projection paired with all-
is set to 32, and we adjust the number of layers for to-all, (ii) all-to-all with Output Projection, (iii)
each model to fit within the GPU memory. Figure 13 all-gather with scatter and GroupedGEMM, and (iv)
shows that MegaScale-MoE’s parallelism strategy, GroupedGEMM with gather and reduce-scatter, as
SP+EP, consistently outperforms the other three depicted in Figure 8. Figure 15 demonstrates that
parallelism strategies, achieving 14.9%-32.9% higher across all six models, MegaScale-MoE achieves a
MFU compared to TP+TP. The performance gains 1.2–4.7× reduction in the combined time of com-
are attributed to two main factors. First, as discussed munication and computation operators compared

11
No intra-op communication-computation overlapping MegaScale-MoE Y-axis = Communication Time + Computation Time
0.4 0.4 3 3

0.3 0.3
Time (ms)

Time (ms)

Time (ms)

Time (ms)
2 2
0.2 0.2
1 1
0.1 0.1

0.0 0.0 0 0
M1 M2 M3 M4 M5 M6 M1 M2 M3 M4 M5 M6 M1 M2 M3 M4 M5 M6 M1 M2 M3 M4 M5 M6
(a) QKV Proj.+A2A. (b) A2A+Output Proj. (c) AG+Scatter+GEMM. (d) GEMM+Gather+RS.
Figure 15 Overlapped communication-computation time vs. non-overlapped time of each layer. M1-M6 represent the
six models listed from top to bottom in Table 2; A2A, AG, and RS refer to all-to-all, all-gather, and reduce-scatter,
respectively.

10
Activations Parameters, gradients and optimizer states Others FP32 Reduce-Scatter
Memory Usage (GB)

100 100 8 BF16 All-to-All (MegaScale-MoE)

Training Loss
No SAR MegaScale-MoE No SAR MegaScale-MoE
75 75
MFU (%)

6
50 50
4
25 25
2
0 0 0.1 1 10 100 1000
Mixtral- Mixtral- Mixtral- Mixtral- Consumed Tokens (B)
8×7B 8×22B 8×7B 8×22B
(a) Memory usage. (b) MFU. Figure 17 The training loss curve of MegaScale-MoE
Figure 16 Ablation study of selective activation remate- with DP communication compression.
rialization (SAR). BF16 FP8
12 2.0
10
Training Loss

Training Loss
to the baseline lacking fine-grained overlap. And 8 1.5

MegaScale-MoE reduces the training iteration time 6


1.0
by 7.1%-12.9% due to intra-operator communication- 4
2
computation overlap. 0.1 1 10 100 1000
0.5
8000 8250 8500 8750 9000
Consumed Tokens (B) Consumed Tokens (B)
(a) Training from scratch (35B). (b) Training from checkpoints (176B).
1

Selective activation rematerailization. We compare Figure 18 The loss curve of MegaScale-MoE in FP8 and
MegaScale-MoE to a baseline that disables selective BF16.
activation rematerialization (No SAR), which stores
all activations in GPU memory during training. We 6.3 Model Convergence
evaluate both methods by training Mixtral-8×7B and
Mixtral-8×22B on 128 NVIDIA H800 GPUs. Fig- We evaluate model convergence with MegaScale-MoE.
ure 16 shows the memory usage breakdown and the Figure 18 demonstrates the loss curves of training a
training MFU. Compared to No SAR, MegaScale- 35B MoE model from scratch and continuing training
MoE reduces activation memory consumption by a 176B MoE model from a checkpoint, with results
45.5% and 57.2% for the two models, respectively, shown for both BF16 and FP8 precision. MegaScale-
resulting in overall memory reductions of 21.3% and MoE ensures stable convergence and consistent train-
35%, while maintaining the training performance dif- ing loss across BF16 and FP8 formats.
ference within 0.5%.
7 Experience
Data parallelism communication compression. We In this section, we describe our deployment and op-
validate the effectiveness of our communication com- erational experience of MegaScale-MoE.
pression technique by training a 7B MoE model using
BF16 all-to-all DP communication and FP32 reduce- Deployment experience. MegaScale-MoE has been
scatter communication, as described in §5. Figure 17 deployed in our production environment and is re-
illustrates the training loss curves, which are nearly sponsible for the majority of large-scale MoE training
identical. This optimization compresses only the ac- tasks within our company. It enables the training of
cumulated gradients of the batch and performs con- models with trillions of parameters, supports single
versions between BF16 and FP32 exclusively during training jobs scaling beyond 10,000 GPUs, with indi-
communication, introducing minimal risk. vidual training tasks running for several months. By
combining the aforementioned techniques, MegaScale-

12
MoE minimizes idle communication time and opti- 1.0

Normalized Training Loss


mizes memory usage in MoE training without compro- 0.8

mising model performance, ultimately saving millions 0.6

of GPU hours in large-scale MoE training. Figure 19 0.4


shows the model convergence from a real production
0.2
run, which trains a proprietary MoE model with 200B
parameters, 20B activated for each token. This run
0.0 0.2 0.4 0.6 0.8 1.0
Consumed Tokens Rate
uses over 10,000 GPUs and lasts for months. The loss
continues to converge with a stable training process. Figure 19 The normalized training loss curve of a real
production run on more than 10,000 GPUs for months,
training a MoE model with 20B activated and 200B
FP8 training. We have made extensive efforts to total parameters on multi-trillion tokens. Different colors
maintain the convergence stability of FP8 training. indicate training restarts.
For example, we observe that the SwiGLU operator
significantly expands the numerical range. To address
efficiency when scaling beyond the NVLink domain,
this, we replace per-tensor quantization with higher-
where bandwidth drops to RDMA levels?
precision per-token quantization (1×h). Additionally,
since multiplying SwiGLU with the gating weight Formally, for a SwiGLU structure incorporating a
further amplifies the dynamic numerical range, we MoE mechanism, the ratio R between computation
shift the gating weight multiplication back to after time and communication time is defined as:
the FC2 output, reducing quantization errors. 2k × bsh(n − 1)/n/n
comm_time = , (5)
Beyond ensuring training convergence, we introduce bandwidth
additional engineering optimizations. Existing FP8 3k × bsh × hf f n /n
comp_time = . (6)
training implementations [23, 48] store model param- peak
eters in BF16, requiring frequent FP8 conversion for
GEMM computations, adding casting and transpose comp_time
overhead. To address this, we use a multi-precision R= (7)
comm_time
optimizer to store model parameters directly in FP8,
while keeping main parameters in FP32 with separate bandwidth
= 3/2 × hf f n × × n/(n − 1) (8)
buffers for different data types. This lowers memory peak
consumption and halves parameter all-gather com- ≈ 3/2 × hf f n ×
bandwidth
(9)
munication in data parallelism. peak
To sustain training efficiency, the FFN’s computation
Scale up. When training MoE models, an intrigu-
time must exceed the communication time, ensuring
ing engineering question arises: can we indefinitely
effective overlap of communication overhead. There-
scale the training size by increasing model parameters
fore, our goal is to maintain R > 1, leading to two
without raising computational load? This approach
key insights:
is impractical in tensor parallelism, as scaling up
the model necessitates a higher TP degree to accom- • The value of R is independent of the number of
modate additional parameters. While increased TP experts, top-k, hidden dimension, parallelism size,
reduces per-GPU computation, the communication or input size, providing flexibility in selecting algo-
overhead remains constant, as shown in Formula 1 rithm parameters.
and 4, leading to progressively longer communica-
• R is solely determined by the expert’s intermediate
tion times and reduced training efficiency. In other
dimension, computational peak, and communica-
words, TP has inherent scalability limitations and
tion bandwidth. Consequently, on fixed hardware,
often relies on high-speed intra-node links to mitigate
as long as the expert dimension is sufficiently large,
communication delays.
the MoE model can be scaled while maintaining
In contrast, when scaling training with SP and EP, training efficiency from an engineering perspective.
the communication volume decreases as the paral-
lel size n increases, as shown in Formula 2 and 3. Holistic vs. automatic. We have invested substantial
This implies that, in theory, this parallelism strategy engineering efforts in inter-operator communication-
can scale to significantly larger sizes. However, in computation overlap, including determining operator
practical hierarchical infrastructures, a critical chal- execution order, concurrency of communication and
lenge emerges: can this approach maintain training computation, and SM allocation for communication.

13
These manual interventions provide deeper insights as efficient cross-node all-to-all and DualPipe, to
into training dynamics, enabling targeted optimiza- enhance the efficiency of large-scale MoE training.
tions. As training progresses and experience accumu- We are also exploring algorithm-system co-design to
lates, we seek to automate operator scheduling within further reduce training costs.
the search space to optimize the training process at
a fine-grained level and achieve optimal performance. Long-context training. While Megatron-LM [18, 46]
We leave automatic optimization for future work. opts to partition only specific operations along the
sequence dimension, various methods of sequence par-
8 Related Work allelism [19, 22, 27] have been explored for training
models requiring long contexts. The Blockwise Par-
Large model training. LLM research has led to the allel Transformer [26] method implements blockwise
development of scalable, efficient, and robust train- computation of self-attention and the fusion of FFNs
ing techniques to meet the substantial computational based on online softmax calculations. Ring Atten-
demands of these models. DeepSpeed [41] features tion [22, 27] introduces a ring-style communication
the Zero Redundancy Optimizer (ZeRO) [38, 39, 42], mechanism integrated with self-attention calculations,
which shards model parameters, gradients, and opti- facilitating the exchange of key and value chunks. We
mizer states across participating GPUs in data paral- adopt the all-to-all style of SP attention from Deep-
lelism, enabling the scaling of LLMs with manageable Speed Ulysses [14], which partitions attention by
memory consumption. Megatron-LM [46] focuses on heads rather than sequence length, due to its reduced
intra-layer model parallelism techniques, partition- communication volume and balanced computation
ing the parameters and computation of each layer. pattern.
Pipeline parallelism assigns the parameters and com-
putation of a contiguous subset of layers to each
Communication-computation overlap. Several frame-
GPU[12, 31], breaks a batch into micro-batches, and
processes the micro-batches in a pipelined fashion. works [10, 21, 29, 37, 52] focus on overlapping com-
MegaScale [17] shows how combining tensor, pipeline, munication with computation in distributed deep
and data parallelism can be an efficient strategy to learning training with a single parallelism strategy.
train large multi-billion parameter models at unprece- Some compiler-style work [15, 35, 50] provides fine-
dented scale. grained overlap among kernels, but excessive par-
titioning of GEMM kernels can result in low GPU
Mixture-of-Expert training. To address the compu- utilization. Centauri [6] enhances communication
tational challenges of training advanced neural net- overlap for LLM training with 3D parallelism by com-
works, the machine learning field has increasingly munication partitioning and hierarchical scheduling.
adopted Mixture-of-Experts architectures. Subse- Similar to Centauri, our inter-operator communica-
quently, a number of parallel deep learning frame- tion overlap hides communication within independent
works have been proposed for training or running in- computation by reordering operators. We further con-
ference on MoEs on multi-GPU clusters. DeepSpeed- ceal communication on critical paths through intra-
MoE [40] significantly reduces training costs through operator overlap, without compromising GPU utiliza-
model architecture designs and compression tech- tion. DualPipe, proposed by DeepSeek-V3 [25], over-
niques. HetuMoE [34] utilizes a hierarchical all-to- laps communication and computation within different
all communication strategy to achieve performance forward and backward chunks and requires storing
speedup. SE-MoE [45] distinguishes itself by focus- 2× the model parameters. In contrast, MegaScale-
ing on scalable and efficient training with heteroge- MoE achieves this overlap within a single forward or
neous resources like CPU memory and SSDs. Fur- backward chunk without incurring additional memory
thermore, Tutel [13] offers a dynamic solution for overhead.
MoE models, employing adaptive parallelism and
pipelining. FasterMoE [11] introduces a comprehen- 9 Conclusion
sive suite of optimizations such as dynamic shadow-
ing, fine-grained scheduling, and congestion-avoiding In this paper, we offer an in-depth look at the de-
expert selection strategies. Janus [28] proposes a sign, implementation, and deployment of MegaScale-
data-centric paradigm shift for MoE models, aiming MoE, a production-grade system built to effi-
to lower communication demands and boost train- ciently train MoE models. MegaScale-MoE ex-
ing efficiency. Recently, DeepSeek-V3 [25] proposes ploits communication-efficient approaches, includ-
a series of algorithm-system co-optimizations, such ing parallelism strategies with lower communication

14
volume, inter- and intra-operator communication- Cui. GLaM: Efficient scaling of language models with
computation overlap, and communication compres- mixture-of-experts. In International Conference on
sion with adjusted communication patterns to un- Machine Learning (ICML), 2022.
leash the compute capabilities of high-performance
[9] William Fedus, Barret Zoph, and Noam Shazeer.
GPUs. MegaScale-MoE achieves 1.41M tokens/s in Switch transformers: Scaling to trillion parameter
throughput when training a 352B MoE model on models with simple and efficient sparsity. Journal of
1,440 NVIDIA Hopper GPUs, a 1.88× improvement Machine Learning Research, 2022.
over Megatron-LM. By sharing our insights on accel-
erating large-scale MoE training, we hope our work [10] Sayed Hadi Hashemi, Sangeetha Abdu Jyothi,
will inspire future research. and Roy Campbell. Tictac: Accelerating dis-
tributed deep learning with communication schedul-
ing. Proceedings of Machine Learning and Systems,
References 2019.

[1] Context parallelism in megatron-lm, 2025. [11] Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang,
[Link] Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe:
developer-guide/latest/api-guide/context_ modeling and optimizing training of large-scale dy-
[Link]. namic pre-trained models. In ACM PPoPP, 2022.

[2] Introducing dbrx: A new state-of-the-art open llm, [12] Yanping Huang, Youlong Cheng, Ankur Bapna,
2025. URL [Link] Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong
introducing-dbrx-new-state-art-open-llm. Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu,
et al. Gpipe: Efficient training of giant neural net-
[3] Open release of grok-1, 2025. URL [Link] works using pipeline parallelism. Neural Information
blog/grok-os. Processing Systems, 2019.
[4] Joshua Ainslie, James Lee-Thorp, Michiel de Jong,
[13] Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang,
Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang-
Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin
hai. Gqa: Training generalized multi-query trans-
Jose, Prabhat Ram, et al. Tutel: Adaptive mixture-
former models from multi-head checkpoints. arXiv
of-experts at scale. Proceedings of Machine Learning
preprint arXiv:2305.13245, 2023.
and Systems, 2023.
[5] Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan
Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun [14] Sam Ade Jacobs, Masahiro Tanaka, Chengming
Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam
et al. Flux: fast software-based communication over- Rajbhandari, and Yuxiong He. Deepspeed ulysses:
lap on gpus through kernel fusion. arXiv preprint System optimizations for enabling training of ex-
arXiv:2406.06858, 2024. treme long sequence transformer models. arXiv
preprint arXiv:2309.14509, 2023.
[6] Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei
Duan, Peng Sun, Xingcheng Zhang, and Chao [15] Abhinav Jangda, Jun Huang, Guodong Liu, Amir
Yang. Centauri: Enabling efficient scheduling for Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao,
communication-computation overlap in large model Madanlal Musuvathi, Todd Mytkowicz, and Olli
training via communication partitioning. In ACM Saarikivi. Breaking the computation and commu-
ASPLOS, 2024. nication abstraction barrier in distributed machine
learning workloads. In ACM ASPLOS, 2022.
[7] Aakanksha Chowdhery, Sharan Narang, Jacob
Devlin, Maarten Bosma, Gaurav Mishra, Adam [16] Albert Q Jiang, Alexandre Sablayrolles, Antoine
Roberts, Paul Barham, Hyung Won Chung, Charles Roux, Arthur Mensch, Blanche Savary, Chris Bam-
Sutton, Sebastian Gehrmann, et al. Palm: Scal- ford, Devendra Singh Chaplot, Diego de las Casas,
ing language modeling with pathways. Journal of Emma Bou Hanna, Florian Bressand, et al. Mixtral
Machine Learning Research, 2023. of experts. arXiv preprint arXiv:2401.04088, 2024.

[8] Nan Du, Yanping Huang, Andrew M Dai, Simon [17] Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang,
Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li,
Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Fi- Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin
rat, Barret Zoph, Liam Fedus, Maarten P Bosma, Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou,
Zongwei Zhou, Tao Wang, Emma Wang, Kellie Web- Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei,
ster, Marie Pellat, Kevin Robinson, Kathleen Meier- Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao,
Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi
Quoc Le, Yonghui Wu, Zhifeng Chen, and Claire Ye, Xin Jin, and Xin Liu. MegaScale: Scaling large

15
language model training to more than 10,000 GPUs. [29] Kshiteej Mahajan, Ching-Hsiang Chu, Srinivas Srid-
In USENIX NSDI, 2024. haran, and Aditya Akella. Better together: Jointly
optimizing ml collective scheduling and execution
[18] Vijay Anand Korthikanti, Jared Casper, Sangkug planning using {SYNDICATE}. In USENIX NSDI,
Lym, Lawrence McAfee, Michael Andersch, Moham- 2023.
mad Shoeybi, and Bryan Catanzaro. Reducing ac-
tivation recomputation in large transformer mod- [30] Megatron-LM. GPU optimized techniques for train-
els. Proceedings of Machine Learning and Systems, ing transformer models at-scale, 2025. https://
2023. [Link]/NVIDIA/Megatron-LM.
[19] Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, [31] Deepak Narayanan, Aaron Harlap, Amar Phan-
Xuezhe Ma, Ion Stoica, Joseph E. Gonzalez, and ishayee, Vivek Seshadri, Nikhil R Devanur, Gre-
Hao Zhang. Distflashattn: Distributed memory- gory R Ganger, Phillip B Gibbons, and Matei Za-
efficient attention for long-context llms training. haria. Pipedream: generalized pipeline parallelism
arxiv preprint arXiv:2310.03294, 2024. for dnn training. In ACM SOSP, 2019.
[20] Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and [32] Deepak Narayanan, Mohammad Shoeybi, Jared
Hong Xu. Accelerating distributed {MoE} training Casper, Patrick LeGresley, Mostofa Patwary, Vijay
and inference with lina. In USENIX ATC, 2023. Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti,
Julie Bernauer, Bryan Catanzaro, et al. Efficient
[21] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar,
large-scale language model training on gpu clusters
Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith,
using megatron-lm. In International Conference for
Brian Vaughan, Pritam Damania, et al. Pytorch dis-
High Performance Computing, Networking, Storage
tributed: Experiences on accelerating data parallel
and Analysis, 2021.
training. arXiv preprint arXiv:2006.15704, 2020.
[22] Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, [33] NCCL. Optimized primitives for inter-GPU commu-
Yongbin Li, and Yang You. Sequence parallelism: nication, 2025. [Link]
Long sequence training from system perspective. [34] Xiaonan Nie, Pinxue Zhao, Xupeng Miao, Tong
arXiv preprint arXiv:2105.13120, 2021. Zhao, and Bin Cui. Hetumoe: An efficient trillion-
[23] Wanchao Liang, Tianyu Liu, Less Wright, Will scale mixture-of-expert distributed training system.
Constable, Andrew Gu, Chien-Chin Huang, Iris arXiv preprint arXiv:2203.14685, 2022.
Zhang, Wei Feng, Howard Huang, Junjie Wang, et al. [35] Suchita Pati, Shaizeen Aga, Mahzabeen Islam,
Torchtitan: One-stop pytorch native solution for Nuwan Jayasena, and Matthew D. Sinclair. T3:
production ready llm pre-training. arXiv preprint Transparent tracking & triggering for fine-grained
arXiv:2410.06511, 2024. overlap of compute & collectives. In ACM ASPLOS,
[24] Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, 2024.
Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong
[36] Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai
Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A
Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue
strong, economical, and efficient mixture-of-experts
Yang, Bolin Ni, Jingcheng Hu, et al. Fp8-lm:
language model. arXiv preprint arXiv:2405.04434,
Training fp8 large language models. arXiv preprint
2024.
arXiv:2310.18313, 2023.
[25] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang,
[37] Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao,
Bochao Wu, Chengda Lu, Chenggang Zhao,
Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong
Chengqi Deng, Chenyu Zhang, Chong Ruan, et al.
Guo. A generic communication scheduler for dis-
Deepseek-v3 technical report. arXiv preprint
tributed dnn training acceleration. In ACM SOSP,
arXiv:2412.19437, 2024.
2019.
[26] Hao Liu and Pieter Abbeel. Blockwise parallel trans-
[38] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,
formers for large context models. Neural Information
and Yuxiong He. Zero: Memory optimiza-
Processing Systems, 2024.
tions toward training trillion parameter models.
[27] Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- In International Conference for High Performance
tention with blockwise transformers for near-infinite Computing, Networking, Storage and Analysis,
context. arXiv preprint arXiv:2310.01889, 2023. 2020.

[28] Juncai Liu, Jessie Hui Wang, and Yimin Jiang. Janus: [39] Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley,
A unified distributed training framework for sparse Shaden Smith, and Yuxiong He. Zero-infinity:
mixture-of-experts models. In ACM SIGCOMM, Breaking the gpu memory wall for extreme scale
2023. deep learning. In International Conference for High

16
Performance Computing, Networking, Storage and [50] Shibo Wang, Jinliang Wei, Amit Sabne, Andy
Analysis, 2021. Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen,
Karthik Srinivasa Murthy, Marcello Maggioni, Qiao
[40] Samyam Rajbhandari, Conglong Li, Zhewei Yao,
Zhang, et al. Overlap communication with depen-
Minjia Zhang, Reza Yazdani Aminabadi, Am-
dent computation via decomposition in large deep
mar Ahmad Awan, Jeff Rasley, and Yuxiong He.
learning models. In ACM ASPLOS, 2022.
DeepSpeed-MoE: Advancing mixture-of-experts in-
ference and training to power next-generation AI [51] Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng
scale. In International Conference on Machine Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou,
Learning (ICML), 2022. Weihao Cui, Size Zheng, Li-Wen Chang, et al.
Comet: Fine-grained computation-communication
[41] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, overlapping for mixture-of-experts. arXiv preprint
and Yuxiong He. Deepspeed: System optimizations arXiv:2502.19811, 2025.
enable training deep learning models with over 100
billion parameters. In ACM SIGKDD, 2020. [52] Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo,
Chien-Chin Huang, Min Xu, Less Wright, Hamid
[42] Jie Ren, Samyam Rajbhandari, Reza Yazdani Am- Shojanazeri, Myle Ott, Sam Shleifer, Alban Des-
inabadi, Olatunji Ruwase, Shuangyan Yang, Minjia maison, Can Balioglu, Pritam Damania, Bernard
Zhang, Dong Li, and Yuxiong He. Zero-offload: De- Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Math-
mocratizing billion-scale model training. In USENIX ews, and Shen Li. Pytorch fsdp: Experiences on
ATC, 2021. scaling fully sharded data parallel. Proceedings of
[43] Noam Shazeer. Glu variants improve transformer. the VLDB Endowment, 2023.
arXiv preprint arXiv:2002.05202, 2020. [53] Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wen-
[44] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, lei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Wang, Jianxi Ye, Haibin Lin, et al. Tilelink: Gener-
Dean. Outrageously large neural networks: The ating efficient compute-communication overlapping
sparsely-gated mixture-of-experts layer. arXiv kernels using tile-centric primitives. arXiv preprint
preprint arXiv:1701.06538, 2017. arXiv:2503.20313, 2025.

[45] Liang Shen, Zhihua Wu, WeiBao Gong, Hongxi-


ang Hao, Yangfan Bai, HuaChao Wu, Xinxuan Wu,
Jiang Bian, Haoyi Xiong, Dianhai Yu, et al. Se-
moe: A scalable and efficient mixture-of-experts
distributed training and inference system. arXiv
preprint arXiv:2205.10034, 2022.
[46] Mohammad Shoeybi, Mostofa Patwary, Raul Puri,
Patrick LeGresley, Jared Casper, and Bryan Catan-
zaro. Megatron-lm: Training multi-billion parameter
language models using model parallelism. arXiv
preprint arXiv:1909.08053, 2019.
[47] Hugo Touvron, Louis Martin, Kevin Stone, Peter
Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava,
Shruti Bhosale, et al. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
arXiv:2307.09288, 2023.
[48] TransformerEngine. A library for accelerating Trans-
former models on NVIDIA GPUs, including us-
ing 8-bit floating point (FP8) precision on Hop-
per and Ada GPUs, to provide better performance
with lower memory utilization in both training
and inference., 2025. [Link]
TransformerEngine.
[49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you
need. Neural Information Processing Systems, 2017.

17
A Appendix Activation
hidden
Shape
[b, s/n, h]
Obtained From
# Input
ln1_out [b, s/n, h] # RMSNorm(hidden)

A.1 Hierarchical Communication for Pa- qkv


q_rope
[b, s/n, h(1+2/m)]
[b, s/n, h]
# MatMul(ln1_out, qkv_weight)
# RopeEmbedding(q)
rameter Synchronization k_rope
qkv_a2a
[b, s/n, h/m]
[b, s, h(1+2/m)/n]
# RopeEmbedding(k)
# All-to-All(q_rope, k_rope, v)
attn [b, s, h/n] # SelfAttention(qkv_a2a)
Let the full attention weights size be P , the dimen- attn_a2a [b, s/n, h] # All-to-All(attn)
attn_out [b, s/n, h] # MatMul(attn_a2a, out_weight)
sion of model parallelism (TP or SP) be n, and the ln2_in [b, s/n, h] # Add(hidden, attn_out)
ln2_out [b, s/n, h] # RMSNorm(ln2_in)
data parallel size be d. Typically, GPUs for model ln2_out_ag [b, s, h] # All-Gather(ln2_out)
parallelism are located on the same node, requiring ffn_in
fc1_out
[b*s*k/n, h]
[b*s*k/n, fh]
# Scatter(ln2_out_ag)
# GroupedGEMM(ffn_in, fc1_weight)
intra-node communication, whereas data parallelism fc3_out [b*s*k/n, fh] # GroupedGEMM(ffn_in, fc3_weight)
fc2_in [b*s*k/n, fh] # SiLU(fc1_out, fc3_out)
spans across nodes, requiring inter-node communica- fc2_out [b*s*k/n, h] # GroupedGEMM(fc2_in, fc2_weight)
fc2_out_rs [b, s, h] # Gather(fc2_out)
tion. Consider a data parallelism group containing ffn_out [b, s/n, h] # Reduce-Scatter(fc2_out_rs)
d devices, each holding the identical partition of the hidden(next) [b, s/n, h] # Add(ln2_in, ffn_out)

parameter. Figure 20 Activation shapes in rematerialization.

For parameter synchronization in TP attention, com- The ratio of inter-node communication latency and
munication involves data of size P/n across d devices intra-node communication latency is
in two primary steps in LLM training:
1 intra-node bandwidth n(d − 1)
• inter-node reduce-scatter operation, where the × × (10)
data size is P/n, on d devices. n inter-node bandwidth d(n − 1)

• inter-node all-gather operation, where the data Consider a typical training scenario involving an H100
size is P/n, on d devices. SXM machine, where the NVLink bandwidth is 450
leading to primarily inter-node communication, with GB/s, and the inter-device NIC communication band-
a communication volume of 2P/n(d − 1)/d. width is 50 GB/s. In this context, the latency of
inter-node communication can easily surpass that
With SP attention, the parameter synchronization of intra-node communication. This implies that the
involves the entire data of size P across n × d devices. communication within a node can overshadow that
Considering the discrepancy between intra-node and between nodes. Consequently, in such scenarios, the
inter-node network bandwidth, this process can be im- synchronization of gradients and parameters with SP
plemented by four-step hierarchical communication, attention is, in fact, consistent with TP attention.
where the replicated parameters are first reduced
within a node and then reduced across nodes, before A.2 Selective Activation Rematerializa-
being distributed back to each device. Figure 5a illus- tion
trates a hierarchical communication example where
Figure 20 illustrates the shapes of the key activations
n = 3 and d = 2. The detailed steps are as follows.
produced during forward propagation, with the high-
• intra-node reduce-scatter operation, where the lighted activations retained for backward propagation.
data size is P , on n devices. Let the model parallelism size within one MoE layer
be n and the intermediate hidden size of one expert
• inter-node reduce-scatter operation, where the be f h. The total activation of a single MoE layer is
data size is P/n, on d devices.
• inter-node all-gather operation, where the data (2n + 2k + 3kf + 12 + 5/m)bsh/n,
size is P/n, on d devices.
which we have reduced to
• intra-node all-gather operation, where the data
size is P , on n devices. (2kf + 4 + 2/m)bsh/n.

The inter-node communication volume in SP atten-


tion remains at 2P/n(d − 1)/d, with additional intra-
node volume of 2P (n − 1)/n.
Moreover, due to the distinct resources for intra-
node and inter-node communications, these steps
can be segmented into small chunks and pipelined
to efficiently hide each other as shown in Figure 5b.

18

Common questions

Powered by AI

MegaScale-MoE achieves higher Model FLOPs Utilization (MFU) than Megatron-LM by implementing communication-efficient parallelism strategies and fine-grained overlapped communication, which reduce non-overlapped communication time. This approach along with communication compression strategies leads to a performance increase up to 1.88× higher MFU during the training of a 352B MoE model .

MegaScale-MoE overcomes the efficiency issues associated with tensor parallelism by employing expert and sequence parallelism strategies. While tensor parallelism incurs significant communication overhead and lower GEMM efficiency due to its partitioning approach, expert parallelism distributes experts across nodes to reduce memory pressure, and sequence parallelism optimizes communication by partitioning along the sequence dimension, thus improving overall system efficiency .

Pipeline parallelism in MegaScale-MoE contributes to training efficiency by distributing model parameters effectively, which helps reduce communication overhead and overlap communication of different micro-batches. This distribution allows for more efficient processing and scaling of models, especially for large-scale systems with complex engineering challenges .

MegaScale-MoE maintains convergence stability in MoE models by incorporating tailored quantization strategies, particularly in FP8 training. The transition from FP32 to BF16 reduces communication overhead without significantly affecting model performance, and the use of FP32 reduction techniques further helps in preserving convergence stability while minimizing communication volume .

Sequence parallelism and expert parallelism improve MoE training efficiency by reducing the communication volume and enhancing parallel processing. Sequence parallelism partitions the model along the sequence dimension, which decreases the communication on critical paths. Expert parallelism distributes experts across nodes, alleviating memory pressure and lowering per-layer cross-node communication overhead compared to tensor parallelism, which has higher communication costs and lower GEMM efficiency due to its partitioning approach .

MegaScale-MoE achieves efficient scaling in the training of transformer models with trillions of parameters by employing comprehensive communication optimizations, such as communication compression and fine-grained communication overlapping, which cut down on communication overhead. It also leverages massive parallelism across thousands of GPUs, using advanced techniques like selective activation rematerialization and expert parallelism, thereby optimizing GPU memory usage and reducing the need for proportional increases in inference costs .

Expert parallelism reduces memory pressure in MoE models by distributing the experts across multiple devices, which allows parallel processing and alleviates memory constraints. This setup enables efficient token assignment to experts and facilitates the retrieval of outputs through all-to-all communication, which is crucial for handling the large parameter sizes of MoE models .

MegaScale-MoE optimizes GPU memory usage by using selective activation rematerialization, storing only a subset of activations in GPU memory during the forward pass, and recomputing or re-communicating to obtain the required activations during the backward pass. This approach hides the rematerialization overhead efficiently. Furthermore, by employing a fine-grained approach to overlap communication with computation, it achieves comparable performance while storing only half of the activations .

MegaScale-MoE employs several communication optimizations to enhance training efficiency, including fine-grained communication overlap with computation by splitting it into tiles, communication compression by reducing inter-node parameter synchronization precision, and using FP8 communication tailored quantization strategies in FP8 training. These methods effectively minimize communication volume while ensuring convergence stability and enable efficient use of GPU resources across nodes .

As GPU compute capability increases, MegaScale-MoE faces challenges associated with memory-intensive operations like routing, local scatter, and gather, which become more time-consuming due to the slower scaling of memory bandwidth compared to compute capabilities. This results in decreased MFU since higher compute capabilities lead to a reliance on memory loading constrained by memory bandwidth, which affects operations like GEMMs .

You might also like