0% found this document useful (0 votes)
23 views22 pages

MoE-LLaVA: Efficient Sparse LVLMs

The document presents MoE-LLaVA, a novel architecture for Large Vision-Language Models (LVLMs) that utilizes a Mixture of Experts (MoE) approach to enhance performance while maintaining computational efficiency. It introduces a training strategy called MoE-Tuning, which prevents performance degradation during multi-modal sparsity learning by activating only a subset of model parameters. Experimental results demonstrate that MoE-LLaVA achieves competitive performance on visual understanding tasks with significantly fewer activated parameters compared to existing models.

Uploaded by

khoa.synthesis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

MoE-LLaVA: Efficient Sparse LVLMs

The document presents MoE-LLaVA, a novel architecture for Large Vision-Language Models (LVLMs) that utilizes a Mixture of Experts (MoE) approach to enhance performance while maintaining computational efficiency. It introduces a training strategy called MoE-Tuning, which prevents performance degradation during multi-modal sparsity learning by activating only a subset of model parameters. Experimental results demonstrate that MoE-LLaVA achieves competitive performance on visual understanding tasks with significantly fewer activated parameters compared to existing models.

Uploaded by

khoa.synthesis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Bin Lin 1 Zhenyu Tang 1 Yang Ye 2 Jinfa Huang 3 Junwu Zhang 1 Yatian Pang 4 1 Peng Jin 1 Munan Ning 1 5
Jiebo Luo 3 Li Yuan 1 5

Abstract MoE-LLaVA-1.8B×4 InternVL-Chat-19B

Hallucination Average Performance


Recent advances demonstrate that scaling Large
arXiv:2401.15947v5 [[Link]] 23 Dec 2024

Vision-Language Models (LVLMs) effectively im- LION-12B InternVL-Chat-14B


proves downstream task performances. However, LLaMA-VID-7B OtterHD-8B
LLaMA-VID-13B
LLaVA-1.5-7B
existing scaling methods enable all model pa- LLaVA-1.5-13B
mPLUG-Owl2-7B
rameters to be active for each token in the cal- BLIP-2-13B
LLaVA-Phi-2.7B
culation, which brings massive training and in-
ferring costs. In this work, we propose a sim- MobileVLM-2.7B
ple yet effective training strategy MoE-Tuning Chat-UniVi-1.5-7B

for LVLMs. This strategy innovatively addresses Shikra-7B


the common issue of performance degradation in
multi-modal sparsity learning, consequently con-
structing a sparse model with an outrageous num- Number of Activated Parameters (Billions)
ber of parameters but a constant computational Figure 1. Comparison between MoE-LLaVA-1.8B×4 and open-
cost. Furthermore, we present the MoE-LLaVA, source LVLMs on object hallucination benchmark. We report
a MoE-based sparse LVLM architecture, which the average performance on the POPE (Li et al., 2023d) benchmark,
uniquely activates only the top-k experts through which includes three subsets of Adversarial, Random, and Popular.
routers during deployment, keeping the remain- The red dashed line represents the linear fit to the data points of all
ing experts inactive. Extensive experiments show models except MoE-LLaVA.
the significant performance of MoE-LLaVA in
visual perception capabilities of the Large Language Models
a variety of visual understanding and object hal-
(LLMs). Typically, increasing the model size (Zhang et al.,
lucination benchmarks. Remarkably, with only
2023a; Bai et al., 2023b) and dataset scale (Zhang et al.,
approximately 3B sparsely activated parameters,
2023c; Zhao et al., 2023a; Chen et al., 2023d) can improve
MoE-LLaVA demonstrates performance compa-
model performance. For instance, InternVL (Chen et al.,
rable to the LLaVA-1.5-7B on various visual
2023e) has extended the image encoder to 6B parameters.
understanding datasets and even surpasses the
A series of works (Li et al., 2022; Dai et al., 2023; Liu
LLaVA-1.5-13B in object hallucination bench-
et al., 2023b) have expanded the backend of LVLM to 13B
mark. Through MoE-LLaVA, we aim to establish
parameters and achieved state-of-the-art performance on
a baseline for sparse LVLMs and provide valuable
downstream tasks. IDEFICS (Laurençon et al., 2023) even
insights for future research in developing more
trained an LVLM with 80B parameters. These methods have
efficient and effective multi-modal learning sys-
demonstrated superior performance even in LLMs, which
tems. Code is released at [Link]
are typically pretrained on 34B parameters (SUSTech-IDEA,
YuanGroup/MoE-LLaVA.
2023; 01-ai, 2023; FlagAI-Open, 2023) or 70B parame-
ters (Touvron et al., 2023a;b; Bai et al., 2023a; DeepSeek-
AI, 2024; Zhang & Yang, 2023), and models surpassing
1. Introduction 100B parameters are common (Brown et al., 2020; Zeng
Large Vision-Language Models (LVLMs), such as et al., 2022; Zhang et al., 2022; Scao et al., 2022; Li et al.,
LLaVA (Liu et al., 2023c) and MiniGPT-4 (Zhu et al., 2023), 2023c; falconry, 2023) .
have shown promising results by leveraging an image en- In practical applications, scaling model with high-quality
coder and several visual projection layers to enhance the training data is crucial for improving model perfor-
1
Peking University 2 Sun Yat-sen University 3 University of mance (Lepikhin et al., 2020). However, training and de-
Rochester 4 National University of Singapore 5 Peng Cheng Labora- ploying such large models demand significant computa-
tory. Correspondence to: Li Yuan <yuanli-ece@[Link]>. tional costs and efficient implementation on parallel devices,

1
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

The image capture the beauty and more powerful LVLM.


Trainable parameter grandeur of the statue of liberty …

Non-trainable MoE Layer × N


As a result, in Figure 1, our MoE-LLaVA with only 2.2B
Frozen in Stage I
V1 ······ T1 ······
sparse activated parameters outperforms models with simi-
Trained in Stage II
lar activated parameters and LLaVA-1.5-13B, surpassing it
Convert LLM to LVLM MoE
Layer by a large margin on the POPE object hallucination bench-
FFN
Copy Expert 1 Expert 2 Expert 3 Expert 4
mark. Additionally, MoE-LLaVA achieves comparable per-
Attention weight formance to InternVL-Chat-19B, which has approximately
MLP V1 ······ Vn T1 ······ Tn 8 times the activated parameters. We further scale MoE-
VE WEL Vision Encoder & MLP Word Embedding Layer LLaVA to 3.6B sparse activated parameters, which outper-
Describe the image? form LLaVA-1.5-7B by 1.9%, 0.4%, 0.9%, 30.7%, and 3.8%
in ScienceQA, POPE, MMBench, LLaVAW , and MM-Vet,
(a) Stage I and Stage II (b) Stage III respectively. Extensive experiments validate the rationality
Figure 2. Illustration of MoE-Tuning. The MoE-Tuning consists of our MoE-LLaVA architecture and MoE-Tuning strategy.
of three stages. In stage I, only the MLP is trained. In stage II, We summarize our primary contributions as follows:
all parameters are trained except for the Vision Encoder (VE). In
stage III, FFNs are used to initialize the experts in MoE, and only
the MoE layers are trained. For each MoE layer, only two experts • We explore the MoE-Tuning, a novel three-stage train-
are activated for each token, while the other experts remain silent. ing strategy for adapting MoE to LVLMs and prevent-
ing the model degradation caused by sparsity.
which can be extremely expensive. This is because each to-
ken requires computations with all model parameters, called • We propose MoE-LLaVA, a MoE-based sparse LVLM
the dense model. In contrast, sparse Mixtures of Experts framework, which significantly expands the number of
(MoE) (Jacobs et al., 1991; Eigen et al., 2013) effectively parameters while maintaining computational costs.
scale model capacity by using fixed activated parameters to
• Extensive experiments demonstrate that our MoE-
process data, which has thrived in the field of NLP (Fedus
LLaVA has excellent multi-modal understanding and
et al., 2022; Zoph et al., 2022; Komatsuzaki et al., 2022).
hallucination mitigation abilities. With only approx-
Recently, Mistral LLM (Jiang et al., 2023) equipped with the
imately 3B sparse activated parameters, our method
MoE layers has gained popularity in LLMs. Mixtral-MoE-
achieves comparable performance with SOTA 7B mod-
8×7B (Jiang et al., 2024) achieves performance comparable
els on the visual understanding benchmarks. It is worth
to LLaMA 2-70B with fewer computational resources.
noting that MoE-LLaVA outperforms LLaVA-1.5-13B
However, directly applying MoE to train sparse LVLMs is by 1.1% on the POPE hallucination benchmark with
challenging. We observe that simultaneously converting 2.2B activated parameters.
LLM to LVLM and sparsifying the model leads to signifi-
cant performance degradation. After multiple attempts, we 2. Related Work
find that proper initialization is crucial for sparsifying the
LVLM, Therefore, we introduce a simple yet effective three- 2.1. Large Vision-Language Models
stage training strategy MoE-Tuning. Specifically, as shown
Powerful LLMs (OpenAI, 2023; Touvron et al., 2023a; Wei
in Figure 2, we first train an MLP that adapts visual tokens to
et al., 2022; Touvron et al., 2023b; Zheng et al., 2023; Team,
the LLM in stage I. Then, we pre-empower the LVLM with
2023; Sun et al., 2023; Du et al., 2021; Bai et al., 2023a;
a general multi-modal understanding capability by training
Yang et al., 2023; Penedo et al., 2023; Taori et al., 2023)
the whole LLM’s parameters in stage II. Furthermore, in
with strong instruction-following and generalization capa-
stage III we replicate the FFN as the initialization weights
bilities have been applied to LVLMs. Early works such as
for the experts and only train the MoE layers. Finally, the
BLIP-2 (Li et al., 2023b) and FROMAGe (Koh et al., 2023)
sparse model gradually transitions from a general LVLM
encoded visual signals into a sequence of visual tokens,
initialization to sparse mixture of experts.
successfully adapting vision to LLMs through several pro-
In this work, we explore a baseline for the LVLM with jection layers. Subsequently, recent works have focused on
mixture of experts called MoE-LLaVA, which incorporates improving performance through methods such as expanding
mixture of experts and learnable routers. MoE-LLaVA con- the instruction-tuning dataset (Liu et al., 2023a;c; Zhang
sists of multiple sparse paths where each token is dispatched et al., 2023c; Zhao et al., 2023a; Chen et al., 2023d), op-
to different experts through the router. The activated experts timizing training strategies (Bai et al., 2023b; Chen et al.,
collectively process the tokens, while the inactive paths re- 2023b), increasing resolution of image (Liu et al., 2023b;
main silent. By iteratively stacking MoE encoder layers, Bai et al., 2023b; Wang et al., 2023d) enhancing image en-
MoE-LLaVA provides a sparse path toward a larger and coders (Chen et al., 2023e; Zhang et al., 2023a; Bai et al.,

2
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Response
Activated forward Copy weight Non-activated forward
Add & Norm
Trainable parameter Non-trainable

MoE
Generated Text Caption Response

Copy weight FFN 1 ··· FFN 3 ··· FFN E

Add & Norm Add & Norm


Copy
Copy weight
weight
Copy
for weight
experts
Copy weight
for experts
Copy weight
FFN FFN for experts
Copy
for
for
weight
experts
Copy weight
experts
Copy weight
Router
for experts
for
forexperts
experts

Add & Norm Add & Norm


Copy Add & Norm
weight Copy weight
Self-Attention Self-Attention
Self-Attention

Word Word
Embedding MLP Embedding MLP Word
Embedding MLP
Caption Vision Instruction Vision
Request
Encoder
Request
Encoder Instruction Vision
Encoder
Request
Image Image
Image

(a) Stage I (b) Stage II (c) Stage III


Figure 3. Training framework and strategy. MoE-LLaVA adopts a three-stage training strategy. (a) We solely train the MLP to adapt the
LLM to visual inputs. (b) Training the LLM backend empowers multi-modal understanding capability and MoE layers are not involved.
(c) In this stage, we replicate the weights of the FFN to initialize each expert.

2023b), aligning the input (Lin et al., 2023) and projection making it difficult for soft routers to learn the optimal pat-
layers (Cha et al., 2023; Alayrac et al., 2022; Bai et al., terns for assigning tokens to different experts. A series of
2023b; Dai et al., 2023; Ye et al., 2023; Zhao et al., 2023a). works (Bao et al., 2022; Long et al., 2023; Satar et al., 2022;
These works empowered LVLMs with powerful visual un- Wang et al., 2022; Shen et al., 2023) naturally decouple ex-
derstanding capabilities by expanding the visual instruction perts based on modal categories and pre-define each expert
fine-tuning datasets and model scales. to handle a specific modality. An important feature of these
hard-based routers is that they do not require learning the
Currently, some works have endowed LVLMs with fine-
router. This mode is also widely applied in the task-specific
grained image understanding capabilities, such as region un-
MoE (Li et al., 2023e; Zhu et al., 2022; Ma et al., 2023;
derstanding (Chen et al., 2023c; Zhao et al., 2023b; Liu et al.,
Kudugunta et al., 2021).
2023e), multi-region understanding (Wang et al., 2023c;
Pi et al., 2023; Peng et al., 2023), and pixel-wise ground- Soft Routers. Some works (Shazeer et al., 2017; Lep-
ing (Rasheed et al., 2023; Lai et al., 2023). However, the ikhin et al., 2020; Fedus et al., 2022; Zoph et al., 2022;
cost of scaling up dense visual data and models is chal- Komatsuzaki et al., 2022) in natural language process have
lenging to bear (Liu et al., 2022; Yin et al., 2023). In this explored the MoE based on soft routers. Soft routers en-
work, we aim to make state-of-the-art LVLMs research more able dynamic allocation of data among different experts,
accessible by leveraging mixture of experts. allowing each expert to focus on its expertise and achieve
model sparsity. Therefore, our main focus is on leveraging
2.2. Mixture of Experts in Multi-modal Learning soft routers in the MoE. Small-scale (million-level) models
based on soft routers have also been explored in the context
Mixture of Experts (MoE) (Jacobs et al., 1991; Eigen et al., of multi-modal learning, such as EVE (Chen et al., 2023a)
2013) is a hybrid model consisting of multiple sub-models, and LIMoE (Mustafa et al., 2022), which attempt a fusion
known as experts, which are integrated together. The key of data by using soft routers. The work most relevant to ours
concept of MoE is the use of a router to determine the token is MoCLE (Gou et al., 2023). However, MoCLE clusters
set that each expert handles, thereby reducing interference different instruction sets and distributes them to different
between different types of samples. experts, which compromises the flexibility and autonomy of
Hard Routers. In the hard router mode, each expert is the experts. Differently, MoE-LLaVA relies on knowledge-
typically pre-defined as a specific pattern. This is because rich routers to distribute tokens to different paths.
multi-modal data naturally exhibit gaps (Liang et al., 2022),

3
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Table 1. Architecture details of the MoE-LLaVA model. “FFN Factor” represents the number of linear layers in the FFN. “1.6B×4-Top2”
represents a dense foundation model with 1.6B parameters, which is equipped with a total of four experts, two of them being activated.

MoE FFN Activated Total


Name Experts Top-k Embedding Width Layers FFN Heads
Layers Factor Param Param
StableLM-1.6B (Team) - - - 100352 2560 32 10240 2 32 1.6B 1.6B
MoE-LLaVA-1.6B×4-Top2 4 2 16 100352 2560 32 10240 2 32 2.0B 2.9B
Qwen-1.8B (Bai et al., 2023a) - - - 151936 2048 24 5504 3 16 1.8B 1.8B
MoE-LLaVA-1.8B×4-Top2 4 2 12 151936 2048 24 5504 3 16 2.2B 3.1B
Phi2-2.7B (Microsoft, 2023) - - - 51200 2560 32 10240 2 32 2.7B 2.7B
MoE-LLaVA-2.7B×4-Top2 4 2 16 51200 2560 32 10240 2 32 3.6B 5.3B

3. Method FFNs from stage II to form an ensemble of experts E =


[e1 , e2 , · · · , eE ]. The router is a linear layer that predicts
3.1. Overview the probability of each token being assigned to each expert.
As shown in Figure 3, MoE-LLaVA consists of a vision We formulate as:
encoder, a visual projection layer (MLP), a word embedding ef (x)i
layer, multiple stacked LLM blocks, and MoE blocks. We P(x)i = PE , (5)
f (x)j
first introduce the model architecture of MoE-LLaVA in j e

three stages in Section 3.2. Furthermore, in Section 3.3, we


where the router produces weight logits f (x) = W · x,
explain how to train MoE-LLaVA. Finally, in Section 3.4,
which are normalized by the softmax function. The W ∈
we elaborate on the training objectives of MoE-LLaVA.
RD×E represents the lightweight training parameters and E
represents the number of experts. Therefore, each token is
3.2. Architecture of MoE-LLaVA processed by the top-k experts with the highest probabilities,
As shown in Table 1, we present the detailed configura- and the weighted sum is calculated based on the softmax
tion of MoE-LLaVA and more details can be found in Ap- results of the probabilities:
pendix A.1. Given a RGB image v ∈ RH×W ×3 , where k
X
H and W are the origin resolution. The vision encoder MoE(x) = P(x)i · E(x)i . (6)
processes input images to obtain a visual token sequence i=1
Z = [z1 , z2 , · · · , zP ] ∈ RP ×C , where P = H×W 142 repre-
sents the sequence length of visual tokens. A visual pro- 3.3. MoE-Tuning
jection layer f is used to map Z ∈ RP ×C to V ∈ RP ×D , Stage I: In this stage, our objective is to adapt the image
where D represents the hidden size of LLM. Similarly, the tokens to LLM, allowing the LLM to comprehend the in-
text undergoes a word embedding layer g and is projected to stances in the images. To achieve this, we employ an MLP to
obtain the sequence tokens T = [t1 , t2 , · · · , tN ] ∈ RN ×D , project the image tokens into the input domain of the LLM,
where N represents the sequence length of text tokens. treating the image patches as pseudo-text tokens. During
Subsequently, we concatenate the visual tokens and text this stage, the LLM is trained to describe the images. MoE
tokens together and feed them into a large language model. layers are not applied to the LLM during this stage.
Instead, we solely train the visual projection layer. The large Stage II: Tuning with multi-modal instruction data is a key
language model consists of stacked multi-head self-attention technique to enhance the capabilities and controllability of
(MSA) and feed-forward neural networks (FFN). Layer large models (Zhang et al., 2023b). In this stage, LLM is
normalization (LN) and residual connections are applied adjusted to become an LVLM with multi-modal understand-
within each block (Wang et al., 2019; Baevski & Auli, 2018). ing. We use more complex instructions, including tasks
Therefore, we formulate as: such as image logical reasoning and text recognition, which
x0 = [v1 , v2 , · · · , vP , · · · , t1 , t2 , · · · , tN ], (1) require the model to have a stronger multi-modal under-
standing. Typically, for dense models, the LVLM training is
xℓ′ = MSA(LN(xℓ−1 )) + xℓ−1 , ℓ = 1 . . . L, (2)
′ ′
considered complete at this stage. However, we encounter
xℓ = MoE(LN(x ℓ )) + x ℓ , ℓ = 1 . . . L, (3) challenges in simultaneously transforming the LLM into an
Y = LN(xL ). (4) LVLM and sparsifying the LVLM. Therefore, MoE-LLaVA
utilizes the weights from the second stage as initialization
MoE Forward. Typically, a MoE layer consists of mul- for the third stage to alleviate the learning difficulty of the
tiple FFNs. As an initialization step, we replicate the sparse model.

4
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Stage III: As an initialization, we replicate the FFN multi-


Table 2. Composition of the data groups. For MIMIC-IT, and
ple times to initialize the experts. When image tokens and
SViT datasets, we only use the LA split, and core split, respectively.
text tokens are fed into the MoE layers, the router calculates
the matching weights between each token and the experts. Data group Usage Source #Sample
Each token is then processed by the top-k experts, and the LLaVA-PT Stage I LLaVA 1.5-558k 558k
outputs are aggregated by weighted summation based on the
router’s weights. When the top-k experts are activated, the SViT-157k, LVIS-220k
Hybird-FT Stage II 964k
LRV-331k, MIMIC-IT-256k
remaining experts remain silent. This modeling approach
forms the MoE-LLaVA with infinitely possible sparse path- LLaVA-FT Stage III LLaVA 1.5-mix-665k 665k
ways, offering a wide range of capabilities.

3.4. Training Objectives coder, and the MLP consists of two linear layers with GELU
activation function (Hendrycks & Gimpel, 2016) between
The Ltotal consists of auto-regressive loss Lregressive and auxil- them. Unless otherwise specified, MoE-LLaVA employs an
iary loss Laux , and auxiliary loss are scaled by the balancing alternating replacement of FFN with MoE layers, meaning
coefficient α: that the number of MoE layers is half of the total number
of layers. The value of balancing coefficient α is 0.01. We
Ltotal = Lregressive + α · Laux . (7)
provide additional training details in Appendix A.2.
Auto-Regressive Loss. We optimize the output of LLM Data Details. As shown in Table 2, we reorganize the
through a generative loss in an auto-regressive manner. currently available data for the three-stage training. For
Given an image and text, MoE-LLaVA generates the output the first stage of pretraining, we use the pretrained data
sequence Y = [y1 , y2 , · · · , yK ] ∈ RK×D by progressively of LLaVA 1.5-558k (Liu et al., 2023b). For the second
generating each element, where K = P + N represents the stage, we collect datasets from MIMIC-IT (Li et al., 2023a),
length of the output sequence. The formula is: LRV (Liu et al., 2023a), SViT (Zhao et al., 2023a) and
LVIS (Wang et al., 2023b) to provide a robust initialization
N
X   for MoE-LLaVA. For the third stage, we utilize the same
Lregressive = − log pθ Y [P +i] | V, T [:i−1] , (8)
data pipeline as LLaVA-mix-665k (Liu et al., 2023b).
i=1

where θ is a trainable parameter and we only calculate the 4.2. Image Understanding Evaluation
loss for the generated text.
Zero-shot Image Question Answering. As shown in Ta-
Auxiliary Loss. Due to the presence of multiple experts, ble 3, since MoE-LLaVA is a sparse model equipped with
it is necessary to impose load balancing constraints on the a soft router based on LVLM, we categorize the previous
MoE layer. We incorporate differentiable load balancing models as dense models. We evaluate the performance of
loss (Fedus et al., 2022) into each MoE layer to encourage MoE-LLaVA on five image question-answering benchmarks
experts to handle tokens in a balanced manner as follows: and report the number of activated parameters. Compared
E
X to the state-of-the-art method LLaVA 1.5, MoE-LLaVA
Laux = E · Fi · Gi , (9) demonstrates powerful image understanding capabilities
i=1 and performs very close to LLaVA-1.5 on five benchmarks.
where F represents the fraction of tokens processed by each Specifically, MoE-LLaVA-Phi-2.7B×4 surpasses LLaVA-
expert Ei , and G represents the average routing probability 1.5-7B by 2.7% on SQAI using 3.6B sparse activated param-
of Ei , which can be expressed by the following formulas: eters. Notably, MoE-LLaVA-StableLM-1.6B×4 achieves
comprehensive superiority over IDEFICS-80B with only
E
1 X 2.0B activated parameters. Furthermore, we observe the re-
F= 1{argmax P(x) = i}, (10) cent small-scale vision-language model, LLaVA-Phi. MoE-
K i=1
K LLaVA-Phi-2.7B×4 outperforms LLaVA-Phi by more than
1 X 6.2% on VQAv2 , highlighting the strong comprehension
G= P(x)i . (11)
K i=1 abilities of MoE-LLaVA in natural vision.
Evaluation under Benchmark Toolkits. To comprehen-
4. Experiments sively evaluate the multi-modal understanding capabilities
of MoE-LLaVA, we evaluate its performance on four bench-
4.1. Experimental Setup
mark toolkits. These benchmark toolkits typically involve
Model Settings. Following LLaVA 1.5 (Liu et al., 2023b), open-ended answers, serving as tools to verify a model’s
we utilize CLIP-Large (Radford et al., 2021) as the vision en- ability to engage in natural language questioning. In Ta-

5
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Table 3. Comparison among different LVLMs on image understanding benchmarks. “Res.”, “Act.”, “L”, “V”, “S”, “Q”, “P”, “M”
and “I” respectively represent the input image resolution, activated parameters, LLaMA (Touvron et al., 2023a), Vicuna (Chiang et al.,
2023), StableLM (Team), Qwen (Bai et al., 2023a), Phi-2 (Microsoft, 2023) MobileLLaMA (Chu et al., 2023) and IDEFICS (Laurençon
et al., 2023). Evaluation Benchmarks include VQA-v2 (Goyal et al., 2017); GQA (Hudson & Manning, 2019); VisWiz (Gurari et al.,
2018); SQAI : ScienceQA-IMG (Lu et al., 2022); VQAT : TextVQA (Singh et al., 2019); POPE (Li et al., 2023d); MME (Fu et al., 2023);
MMB: MMBench (Liu et al., 2023d); LLaVAW : LLaVA-Bench (in-the-Wild) (Liu et al., 2023c); MM-Vet (Yu et al., 2023). ∗ donates that
there is some overlap in the training data. † donates that the model is trained with an image resolution of 384. The best results and second
best results are indicated by boldface and underline, respectively.

Image Question Answering Benchmark Toolkit


Methods LLM Act. Res.
VQAv2 GQA VisWiz SQAI VQAT POPE MME MMB LLaVAW MM-Vet
Dense Model
I-80B (Laurençon et al., 2023) L-65B 65B 224 60.0 45.2 36.0 - 30.9 - - 54.5 - -
LLaVA-1.5 (Liu et al., 2023b) V-13B 13B 336 80.0∗ 63.3∗ 53.6 71.6 61.3 85.9 1531.3 67.7 70.7 35.4
Qwen-VL (Bai et al., 2023b) Q-7B 6.7B 448 78.8∗ 59.3∗ 35.2 67.1 63.8 - - 38.2 - -
LLaVA-1.5 (Liu et al., 2023b) V-7B 6.7B 336 78.5∗ 62.0∗ 50.0 66.8 58.2 85.9 1510.7 64.3 63.4 30.5
TinyGPT-V (Yuan et al., 2023) P-2.7B 2.7B 448 - 33.6∗ 33.4 - - - - - - -
MobileVLM (Chu et al., 2023) M-2.7B 2.7B 336 - 59.0∗ - 61.0 47.5 84.9 1288.9 59.6 - -
LLaVA-Phi (Zhu et al., 2024) P-2.7B 2.7B 336 71.4∗ - 35.9 68.4 48.6 85.0 1335.1 59.8 - 28.9
Sparse Model
MoE-LLaVA-1.6B×4-Top2 S-1.6B 2.0B 336 76.7∗ 60.3∗ 36.2 62.6 50.1 85.7 1318.2 60.2 86.8 26.9
MoE-LLaVA-1.8B×4-Top2 Q-1.8B 2.2B 336 76.2∗ 61.5∗ 32.6 63.1 48.0 87.0 1291.6 59.7 88.7 25.3
MoE-LLaVA-2.7B×4-Top2 P-2.7B 3.6B 336 77.6∗ 61.4∗ 43.9 68.5 51.4 86.3 1423.0 65.2 94.1 34.3
MoE-LLaVA-1.6B×4-Top2† S-1.6B 2.0B 384 78.6∗ 61.5∗ 40.5 63.9 54.3 85.9 1335.7 63.3 90.3 32.3
MoE-LLaVA-2.7B×4-Top2† P-2.7B 3.6B 384 79.9∗ 62.6∗ 43.7 70.3 57.0 85.7 1431.3 68.0 97.3 35.9

Table 4. Zero-shot object hallucination evaluation results. “Yes” indicates the proportion of positive responses to the given question.

Adersarial Popular Random


Methods LLM Activated
Acc F1-Score Yes Acc F1-Score Yes Acc F1-Score Yes
Dense Model
mPLUG-Owl (Ye et al., 2023) L-7B 6.7B 82.4 81.6 45.2 85.5 84.3 42.1 86.3 85.3 42.3
MM-GPT (Gong et al., 2023) L-7B 6.7B 50.0 66.7 100.0 50.0 66.7 100.0 50.0 66.7 100.0
LLaVA-1.5 (Liu et al., 2023b) V-13B 13B 85.5 84.4 43.3 87.4 86.2 41.3 88.0 87.1 41.7
Sparse Model
MoE-LLaVA-1.6B×4-Top2 S-1.6B 2.0B 86.9 85.7 41.7 85.3 84.2 43.5 88.0 87.1 41.6
MoE-LLaVA-1.8B×4-Top2 Q-1.8B 2.2B 86.1 85.4 44.9 88.6 87.7 42.5 88.7 88.0 43.0
MoE-LLaVA-2.7B×4-Top2 P-2.7B 3.6B 85.9 84.9 43.2 87.5 86.4 41.8 88.5 87.7 41.8
MoE-LLaVA-1.6B×4-Top2† S-1.6B 2.0B 86.9 85.6 41.5 85.7 84.6 43.0 88.4 87.5 41.5
MoE-LLaVA-2.7B×4-Top2† P-2.7B 3.6B 85.5 84.2 41.9 86.7 84.4 41.7 87.9 86.9 40.6

ble 3, MoE-LLaVA-Qwen-1.8B×4 surpasses Qwen-VL-7B surpasses LLaVA-1.5-13B by 1.0%, 1.5%, and 0.8% in ad-
by 21.5%, on MMBench, despite the latter utilizing higher versarial sampling, popular sampling, and random sampling,
image resolutions. These results collectively demonstrate respectively, with 2.2B activated parameters. Additionally,
that the sparse model MoE-LLaVA achieves comparable we observe that the yes ratio of MoE-LLaVA remains rela-
or even superior performance to dense models with fewer tively balanced, indicating that our sparse model is capable
activated parameters. of providing accurate feedback based on the given questions.

4.3. Object Hallucination Evaluation 4.4. Quantitative Analysis


We adopt the evaluation pipeline of POPE (Li et al., 2023d), Routing Distributions. In Figure 4, we present the ex-
a polling-based query method, to evaluate object hallucina- pert loads (leftmost plot) and the modalities preferences of
tion in MoE-LLaVA. The results are presented in Table 4, different experts (four subplots on the right) through MoE-
where MoE-LLaVA exhibits the best performance, indicat- LLaVA-2.7B×4-Top2 on ScienceQA. More visualization
ing that MoE-LLaVA tends to generate objects consistent can be found in Appendix B.3. To begin with, the expert
with the given image. Specifically, MoE-LLaVA-1.8B×4 loads in all MoE layers are totally balanced. However, as

6
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

All experts Expert 1 Expert 2 Expert 3 Expert 4


Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100%

75%
Percentage

50%

25%

0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer

Figure 4. Distribution of expert loadings. The discontinuous lines represent a perfectly balanced distribution of tokens among different
experts or modalities. The first figure on the left illustrates the workload among experts, while the remaining four figures depict the
preferences of experts towards different modalities.
Text Image
Expert 1 Expert 3 Expert 1 Expert 3
Text
1
Expert 2 Expert 4 Expert 2 Expert 4
100% Top-1

Expert idx
2
Top-2
75%
3 Others
Percentage

4
50% Image
1
Top-1
Expert idx
25% 2
Top-2
3 Others
0%
1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 4
MoE layer idx MoE layer idx 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer idx
Figure 5. Distribution of modalities across different experts.
Interrupted lines mean a perfectly balanced distribution of tokens. Figure 6. Visualization of activated pathways. We highlight the
top-10 activated pathways on the text and image. Among them,
the colorful paths represent the top-2 paths for text and image,
the model gradually becomes sparser, the expert 3 loads for respectively, while the gray paths represent the remaining 8 paths.
layers 17 to 27 suddenly increase, and they even dominate
the workload of almost all tokens. For the shallow layers
(5-11), experts 2, 3, and 4 mainly collaborate. It is worth consistently tends to assign experts 2 and 3 to handle them
noting that expert 1 only works predominantly in the first in the deeper layers of the model. Regarding experts 1 and
few layers, and as the model becomes deeper, expert 1 grad- 4, they tend to handle the tokens during the initialization
ually withdraws from the workload. Therefore, the experts phase. These findings contribute to a better understanding
in MoE-LLaVA have learned a certain pattern that allows of the behavior of sparse models in multi-modal learning.
them to divide their tasks in a specific manner.
4.5. Ablation Study
Furthermore, we show the distribution of modalities across
different experts in Figure 5. Similarly, experts develop In this section, we first validate the necessity of the three-
their own preferences. Additionally, we find that the rout- stage training strategy. We then explore the impact of differ-
ing distributions for text and image are highly similar. For ent base models and conduct ablation studies on the number
example, when expert 3 is actively working in layers 17-27, of experts and active experts, and the MoE structure. We
the proportions of text and image that MoE-LLaVA pro- provide additional results in Appendix B.2.
cesses are similar. Each expert in MoE-LLaVA is capable of
Effect of Training Strategy. In Table 6, we conduct three
handling both text tokens and image tokens simultaneously,
variant experiments to demonstrate the rationale behind us-
which demonstrates that MoE-LLaVA does not exhibit a
ing the second-stage instruction tuning as the initialization
clear preference for any modality. This serves as evidence
for the third-stage MoE tuning. When adapting MoE to
of its strong interaction in multimodal learning.
LVLMs, a straightforward approach is to replace the clas-
Token Pathways. Furthermore, we examine the behavior of sic LLaVA’s FFN with a MoE layer and train it according
experts at the token level. More visualization can be found to the original second-stage script, denoted as variant (a).
in Appendix B.4 and Appendix B.5. We track the trajec- However, variant (a) performs the worst, suggesting that
tories of all tokens on downstream tasks. For all activated the current multi-modal instruction dataset is insufficient to
pathways, we employ PCA (Pearson, 1901) to obtain the support both the conversion from LLM to LVLM and the
top-10 pathways, as shown in Figure 6. We found that for conversion from LVLM to a sparse model simultaneously.
a given unseen text token or image tokens, MoE-LLaVA Therefore, we collect more data, referred to as Hybrid-FT,

7
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Table 5. Ablation study about training setting and architecture design decisions. Settings for results in Table 3 and Table 4 are
highlighted in blue . We report the training time on 8 V100-32G.

(a) Tuning the parameters of different subsets. (b) The number of experts.

Subset GQA VisWiz VQAT POPE LLaVAW Time Experts GQA SQAI VQAT POPE LLaVAW Time
FFN 61.5 32.6 48.0 87.0 88.7 20h 1 60.9 60.2 48.3 86.4 86.3 13h
All 61.3 31.9 47.6 87.0 88.1 27h 2 61.2 60.8 47.0 87.5 86.5 14h

(c) The value of top-k. (d) The architectures of MoE-LLaVA.

Top-k VQAv2 GQA SQAI VQAT POPE Time Architecture VQAv2 GQA SQAI VQAT POPE Time
1 74.5 58.4 58.0 44.0 85.7 19h First-Half 75.9 61.3 62.4 47.0 86.9 20h
2 76.2 61.5 63.1 48.0 88.7 20h Second-Half 76.3 61.2 62.6 47.2 86.9 20h
Interval 76.2 61.5 63.1 48.0 88.7 20h
All 74.5 61.5 62.1 47.1 87.0 32h

Effect of the Number of Experts. Typically, increasing the


Table 6. Ablation study about different training strategies. “LA”
number of experts directly leads to higher performance (Lep-
and “Hb” represent LLaVA-FT and Hybrid-FT in Table 2.
ikhin et al., 2020; Fedus et al., 2022). In Table 5b, we change
MoE Stage II Stage III GQA SQAI POPE LLaVAW the number of experts while keeping the number of activated
experts the same, so the number of activated parameters for
(a) ✔ - LV+Hb 58.4 58.1 81.9 88.0
(b) ✔ Hb LV 61.5 63.1 87.0 88.7 both models remains the same. More sparse experts outper-
(c) ✗ LV+Hb - 60.9 60.2 86.4 86.3 form the single expert dense model by 1.1% on POPE and
(d) ✗ Hb LV 60.9 62.5 86.9 90.1 0.6% on SQAI , respectively. The results demonstrate that
sparse experts can deliver superior performance.
Table 7. Ablation study about the model size of MoE-LLaVA. Effect of the Number of Activated Experts. To evaluate
the effect of the number of activated experts, we compare
Model MoE VQAv2 SQAI VQAT MMB LLaVAW
the performance of using different top-k strategies. With
✗ 74.5 62.0 48.8 58.2 83.2 the number of activated experts changing from 1 to 2, it
StableLM
✔ 76.7 62.6 50.1 60.2 86.8 brings a significant improvement with only 1h training time
✗ 74.9 60.2 48.3 60.6 86.3 increasing. These results show that activating more experts
Qwen
✔ 76.2 63.1 48.0 59.7 88.7 can improve the MOE-LLaVA ability. To leverage the ad-
✗ 75.6 67.8 50.0 65.0 91.3 vantages of the MoE scheme, we set the number of activated
Phi-2
✔ 77.6 68.5 51.4 65.2 94.1 experts to 2.

and initially convert LLM to LVLM in the second stage. Sub- Effect of the Architectures. In Table 5d, we explore four
sequently, in the third stage, LVLM is sparsified by using variations of MoE architecture. Specifically, “First-Half”
the LLaVA-FT dataset, resulting in variant (b). Additionally, indicates that MoE layers are applied only to the first half of
we expand the data of the original LLaVA’s second stage for the model while the second half retains the original dense
fair comparison, denoted as variant (c). The results indicate architecture. “Second-Half” means that MoE layers are
that variants (b) outperformed variants (a) and (c). These placed in the second half of the model while the first half
findings demonstrate that providing a reasonable LVLM remains dense. “Interval” represents alternating occurrences
initialization allows the model to transition rapidly from of MoE layers and dense layers. “All” indicates that all
a dense model to a sparse model, validating the principle layers are sparse MoE layers. Intuitively, it is expected that
behind our three-stage training strategy. incorporating all MoE will enhance performance. However,
using “All” does not yield better results and results in longer
Effect of Tuning the Parameters of Different Subsets. training times compared to other architectures. Therefore,
In Table 5a, we examine the performance of fine-tuning MoE-LLaVA alternates the insertion of MoE layers.
different parts of the parameters. “FFN” represents fine-
tuning all FFN layers and MoE layers in the model. “All” Effect of the Model Size. As shown in Table 7, we compare
indicates fine-tuning all parameters. The results indicate the performance of models with different parameter sizes as
tuning the FFN is sufficient to achieve results comparable the foundation models for MoE-LLaVA. For smaller models
to full-parameter tuning, but it requires only approximately such as Phi2-MoE and Qwen-MoE, the performance with
75% of the time. Therefore, to enhance generalization and MoE surpasses that of dense models. We provide additional
reduce training costs, we only fine-tune FFN layers. results in Appendix B.1.

8
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

5. Conclusion and Future Directions


In this work, we propose the MoE-Tuning to adapting the
MoE architecture to LVLMs, and construct the MoE-based
spare model MoE-LLaVA, which can find a sparse pathway
by simultaneously handling image and text features. Our
framework demonstrates strong ability of multi-modal un-
derstanding and rich potential for hallucination inhibition,
achieving comparable performance of LLaVA-1.5-7B with
only 3B activated parameters.
While MoE-LLaVA demonstrates competitive capabilities,
we observe some difficulties in training stability, particularly
with 16-bit float precision. Furthermore, due to the presence
of multiple experts specializing in different abilities, MoE-
LLaVA can easily be expanded to handle additional tasks
such as detection, segmentation, generation, or handling
more modalities such as video, depth, and thermal.

9
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Impact Statements References


Broader Impacts 01-ai. Building the next generation of open-source and
bilingual llms. [Link]
While MoE-LLaVA holds great potential and application 2023.
value in multi-modal understanding, it may also have some
negative social impacts: Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I.,
Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds,
• Information credibility: MoE-LLaVA can generate re- M., et al. Flamingo: a visual language model for few-shot
alistic texts, including false information and misleading learning. Advances in Neural Information Processing
content. Systems, 35:23716–23736, 2022.
Baevski, A. and Auli, M. Adaptive input representa-
• Bias and discrimination: The training data for MoE- tions for neural language modeling. arXiv preprint
LLaVA often comes from the internet, where various bi- arXiv:1809.10853, 2018.
ases and discriminatory content may exist. If these un-
equal patterns are learned and amplified by the model, Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan,
they may be reflected in the generated responses. Y., Ge, W., Han, Y., Huang, F., et al. Qwen technical
report. arXiv preprint arXiv:2309.16609, 2023a.
• Social influence: People may become overly reliant on
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J.,
MoE-LLaVA for information and problem-solving, in-
Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision-
stead of actively thinking and seeking multiple sources
language model with versatile abilities. arXiv preprint
of information. This can lead to increased dependency,
arXiv:2308.12966, 2023b.
reduced autonomy in thinking, and judgment skills.
Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K.,
Reproducibility Aggarwal, K., Som, S., Piao, S., and Wei, F. Vlmo:
Unified vision-language pre-training with mixture-of-
In Appendix A.2, we have provided a detailed list of all the modality-experts. Advances in Neural Information Pro-
training hyperparameters. We have open-sourced all models cessing Systems, 35:32897–32912, 2022.
and codes. Reproducibility can be achieved by using the
code provided in the materials. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Compute Askell, A., et al. Language models are few-shot learners.
Advances in neural information processing systems, 33:
For the main results, we conducte experiments on 8 A800- 1877–1901, 2020.
80G. For the ablation study, we measure the time on 8
V100-32G. Cha, J., Kang, W., Mun, J., and Roh, B. Honeybee: Locality-
enhanced projector for multimodal llm. arXiv preprint
arXiv:2312.06742, 2023.
Licenses
Chen, J., Guo, L., Sun, J., Shao, S., Yuan, Z., Lin, L., and
The majority of this project is released under the Apache
Zhang, D. Eve: Efficient vision-language pre-training
2.0 license.
with masked prediction and modality-aware moe. arXiv
preprint arXiv:2308.11971, 2023a.
• The service is a research preview intended
for non-commercial use only, subject to Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krish-
the model License of LLaMA (https: namoorthi, R., Chandra, V., Xiong, Y., and Elhoseiny, M.
//[Link]/facebookresearch/llama/ Minigpt-v2: large language model as a unified interface
blob/main/MODEL_CARD.md). for vision-language multi-task learning. arXiv preprint
arXiv:2310.09478, 2023b.
• Terms of Use of the data generated by Ope- Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and
nAI ([Link] Zhao, R. Shikra: Unleashing multimodal llm’s referential
terms-of-use). dialogue magic. arXiv preprint arXiv:2306.15195, 2023c.
• Privacy Practices of ShareGPT (https: Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J.,
//[Link]/webstore/ Zhao, F., and Lin, D. Sharegpt4v: Improving large
detail/sharegpt-share-your-chatg/ multi-modal models with better captions. arXiv preprint
daiacboceoaocpibfodeljbdfacokfjb). arXiv:2311.12793, 2023d.

10
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, Gou, Y., Liu, Z., Chen, K., Hong, L., Xu, H., Li, A., Yeung,
S., Muyan, Z., Zhang, Q., Zhu, X., Lu, L., et al. In- D.-Y., Kwok, J. T., and Zhang, Y. Mixture of cluster-
ternvl: Scaling up vision foundation models and align- conditional lora experts for vision-language instruction
ing for generic visual-linguistic tasks. arXiv preprint tuning. arXiv preprint arXiv:2312.12379, 2023.
arXiv:2312.14238, 2023e.
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, Parikh, D. Making the v in vqa matter: Elevating the
H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., role of image understanding in visual question answer-
et al. Vicuna: An open-source chatbot impressing gpt-4 ing. In Proceedings of the IEEE conference on computer
with 90%* chatgpt quality. See [Link] lmsys. org vision and pattern recognition, pp. 6904–6913, 2017.
(accessed 14 April 2023), 2023.
Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman,
K., Luo, J., and Bigham, J. P. Vizwiz grand challenge:
Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y., Hu, Y., Wei,
Answering visual questions from blind people. In Pro-
F., Zhang, X., Zhang, B., Wei, X., et al. Mobilevlm: A
ceedings of the IEEE conference on computer vision and
fast, reproducible and strong vision language assistant for
pattern recognition, pp. 3608–3617, 2018.
mobile devices. arXiv preprint arXiv:2312.16886, 2023.
Hendrycks, D. and Gimpel, K. Gaussian error linear units
Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, (gelus). arXiv preprint arXiv:1606.08415, 2016.
W., Li, B., Fung, P., and Hoi, S. Instructblip: Towards
general-purpose vision-language models with instruction Hudson, D. A. and Manning, C. D. Gqa: A new dataset for
tuning, 2023. real-world visual reasoning and compositional question
answering. In Proceedings of the IEEE/CVF conference
DeepSeek-AI. Deepseek llm: Scaling open-source on computer vision and pattern recognition, pp. 6700–
language models with longtermism. arXiv preprint 6709, 2019.
arXiv:2401.02954, 2024.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E.
Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., Adaptive mixtures of local experts. Neural computation,
and Tang, J. Glm: General language model pretrain- 3(1):79–87, 1991.
ing with autoregressive blank infilling. arXiv preprint
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
arXiv:2103.10360, 2021.
Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel,
Eigen, D., Ranzato, M., and Sutskever, I. Learning fac- G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-
tored representations in a deep mixture of experts. arXiv A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix,
preprint arXiv:1312.4314, 2013. T., and Sayed, W. E. Mistral 7b, 2023.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary,
falconry. Falcon-180b. [Link]
B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna,
ae/, 2023.
E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G.,
Fedus, W., Zoph, B., and Shazeer, N. Switch transform- Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P.,
ers: Scaling to trillion parameter models with simple Subramanian, S., Yang, S., Antoniak, S., Scao, T. L.,
and efficient sparsity. The Journal of Machine Learning Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed,
Research, 23(1):5232–5270, 2022. W. E. Mixtral of experts, 2024.
Koh, J. Y., Salakhutdinov, R., and Fried, D. Grounding
FlagAI-Open. Aquila2-34b. [Link]
language models to images for multimodal generation.
FlagAI-Open/Aquila2, 2023.
arXiv preprint arXiv:2301.13823, 2023.
Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz,
J., Zheng, X., Li, K., Sun, X., Wu, Y., and Ji, R. Mme: C. R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M.,
A comprehensive evaluation benchmark for multimodal and Houlsby, N. Sparse upcycling: Training mixture-
large language models. arXiv preprint arXiv:2306.13394, of-experts from dense checkpoints. arXiv preprint
2023. arXiv:2212.05055, 2022.
Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., Zhao, Q., Kudugunta, S., Huang, Y., Bapna, A., Krikun, M., Lepikhin,
Liu, K., Zhang, W., Luo, P., and Chen, K. Multimodal-gpt: D., Luong, M.-T., and Firat, O. Beyond distillation: Task-
A vision and language model for dialogue with humans. level mixture-of-experts for efficient inference. arXiv
arXiv preprint arXiv:2305.04790, 2023. preprint arXiv:2110.03742, 2021.

11
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., and Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved base-
Jia, J. Lisa: Reasoning segmentation via large language lines with visual instruction tuning. arXiv preprint
model. arXiv preprint arXiv:2308.00692, 2023. arXiv:2310.03744, 2023b.
Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction
Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, tuning. arXiv preprint arXiv:2304.08485, 2023c.
A. M., Kiela, D., Cord, M., and Sanh, V. Obelics: An
open web-scale filtered dataset of interleaved image-text Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W.,
documents, 2023. Yuan, Y., Wang, J., He, C., Liu, Z., et al. Mmbench: Is
your multi-modal model an all-around player? arXiv
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., preprint arXiv:2307.06281, 2023d.
Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling
giant models with conditional computation and automatic Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J.,
sharding. arXiv preprint arXiv:2006.16668, 2020. Cao, Y., Zhang, Z., Dong, L., et al. Swin transformer v2:
Scaling up capacity and resolution. In Proceedings of the
Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., IEEE/CVF conference on computer vision and pattern
and Liu, Z. Mimic-it: Multi-modal in-context instruction recognition, pp. 12009–12019, 2022.
tuning. arXiv preprint arXiv:2306.05425, 2023a.
Liu, Z., He, Y., Wang, W., Wang, W., Wang, Y., Chen, S.,
Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping Zhang, Q., Lai, Z., Yang, Y., Li, Q., et al. Interngpt:
language-image pre-training for unified vision-language Solving vision-centric tasks by interacting with chatgpt
understanding and generation. In International Confer- beyond language. arXiv preprint arXiv:2305.05662, 3,
ence on Machine Learning, pp. 12888–12900. PMLR, 2023e.
2022.
Long, Z., Killick, G., McCreadie, R., and Camarasa, G. A.
Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot-
Multiway-adapater: Adapting large-scale multi-modal
strapping language-image pre-training with frozen im-
models for scalable image-text retrieval. arXiv preprint
age encoders and large language models. arXiv preprint
arXiv:2309.01516, 2023.
arXiv:2301.12597, 2023b.
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu,
Li, X., Yao, Y., Jiang, X., Fang, X., Meng, X., Fan, S., Han,
S.-C., Tafjord, O., Clark, P., and Kalyan, A. Learn to
P., Li, J., Du, L., Qin, B., et al. Flm-101b: An open llm
explain: Multimodal reasoning via thought chains for
and how to train it with 100 k budget. arXiv preprint
science question answering. Advances in Neural Infor-
arXiv:2309.03852, 2023c.
mation Processing Systems, 35:2507–2521, 2022.
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen,
J.-R. Evaluating object hallucination in large vision- Ma, G., Wu, X., Wang, P., and Hu, S. Cot-mote: Explor-
language models. arXiv preprint arXiv:2305.10355, ing contextual masked auto-encoder pre-training with
2023d. mixture-of-textual-experts for passage retrieval. arXiv
preprint arXiv:2304.10195, 2023.
Li, Y., Hui, B., Yin, Z., Yang, M., Huang, F., and Li, Y.
Pace: Unified multi-modal dialogue pre-training with Microsoft. Phi-2: The surprising power of
progressive and compositional experts. arXiv preprint small language models. [Link]
arXiv:2305.14839, 2023e. [Link]/en-us/research/blog/
phi-2-the-surprising-power-of-small-language-mode
Liang, V. W., Zhang, Y., Kwon, Y., Yeung, S., and Zou, 2023.
J. Y. Mind the gap: Understanding the modality gap
in multi-modal contrastive representation learning. Ad- Mustafa, B., Riquelme, C., Puigcerver, J., Jenatton, R., and
vances in Neural Information Processing Systems, 35: Houlsby, N. Multimodal contrastive learning with limoe:
17612–17625, 2022. the language-image mixture of experts. Advances in
Neural Information Processing Systems, 35:9564–9576,
Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., and Yuan, 2022.
L. Video-llava: Learning united visual representa-
tion by alignment before projection. arXiv preprint OpenAI. Gpt-4 technical report, 2023.
arXiv:2311.10122, 2023.
Pearson, K. Liii. on lines and planes of closest fit to systems
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., and Wang, L. of points in space. The London, Edinburgh, and Dublin
Aligning large multi-modal model with robust instruction philosophical magazine and journal of science, 2(11):
tuning. arXiv preprint arXiv:2306.14565, 2023a. 559–572, 1901.

12
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cap- Sun, T., Zhang, X., He, Z., Li, P., Cheng, Q., Yan, H., Liu,
pelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and X., Shao, Y., Tang, Q., Zhao, X., et al. Moss: Train-
Launay, J. The refinedweb dataset for falcon llm: out- ing conversational language models from synthetic data.
performing curated corpora with web data, and web data arXiv preprint arXiv:2307.15020, 7, 2023.
only. arXiv preprint arXiv:2306.01116, 2023.
SUSTech-IDEA. Sus-chat: Instruction tuning done
Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, right. [Link]
S., and Wei, F. Kosmos-2: Grounding multimodal SUS-Chat, 2023.
large language models to the world. arXiv preprint Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X.,
arXiv:2306.14824, 2023. Guestrin, C., Liang, P., and Hashimoto, T. B. Alpaca: A
strong, replicable instruction-following model. Stanford
Pi, R., Gao, J., Diao, S., Pan, R., Dong, H., Zhang, J.,
Center for Research on Foundation Models. [Link]
Yao, L., Han, J., Xu, H., and Zhang, L. K. T. Detgpt:
stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
Detect what you need via reasoning. arXiv preprint
arXiv:2305.14167, 2023. Team, I. Internlm: A multilingual language model with
progressively enhanced capabilities, 2023.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Team, S. A. L. Stable lm 2 1.6b. URL
et al. Learning transferable visual models from natural [[Link]
language supervision. In International conference on stablelm-2-1.6b]([Link]
machine learning, pp. 8748–8763. PMLR, 2021. co/stabilityai/stablelm-2-1.6b).
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S.,
M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Cholakkal, H., Anwer, R. M., Xing, E., Yang, M.-H., and
Azhar, F., et al. Llama: Open and efficient foundation lan-
Khan, F. S. Glamm: Pixel grounding large multimodal
guage models. arXiv preprint arXiv:2302.13971, 2023a.
model. arXiv preprint arXiv:2311.03356, 2023.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Jenatton, R., Susano Pinto, A., Keysers, D., and Houlsby, Bhosale, S., et al. Llama 2: Open foundation and fine-
N. Scaling vision with sparse mixture of experts. Ad- tuned chat models. arXiv preprint arXiv:2307.09288,
vances in Neural Information Processing Systems, 34: 2023b.
8583–8595, 2021.
Wang, G., Cheng, S., Zhan, X., Li, X., Song, S., and Liu, Y.
Satar, B., Zhu, H., Zhang, H., and Lim, J. H. Rome: Role- Openchat: Advancing open-source language models with
aware mixture-of-expert transformer for text-to-video re- mixed-quality data. arXiv preprint arXiv:2309.11235,
trieval. arXiv preprint arXiv:2206.12845, 2022. 2023a.

Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, Wang, J., Meng, L., Weng, Z., He, B., Wu, Z., and Jiang,
D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., Y.-G. To see is to believe: Prompting gpt-4v for better vi-
et al. Bloom: A 176b-parameter open-access multilingual sual instruction tuning. arXiv preprint arXiv:2311.07574,
language model. arXiv preprint arXiv:2211.05100, 2022. 2023b.
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F.,
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
and Chao, L. S. Learning deep transformer models for
Q., Hinton, G., and Dean, J. Outrageously large neural
machine translation. arXiv preprint arXiv:1906.01787,
networks: The sparsely-gated mixture-of-experts layer.
2019.
arXiv preprint arXiv:1701.06538, 2017.
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu,
Shen, S., Yao, Z., Li, C., Darrell, T., Keutzer, K., and He, Y. Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som,
Scaling vision-language models with sparse mixture of S., et al. Image as a foreign language: Beit pretraining
experts. arXiv preprint arXiv:2303.07226, 2023. for all vision and vision-language tasks. arXiv preprint
arXiv:2208.10442, 2022.
Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X.,
Batra, D., Parikh, D., and Rohrbach, M. Towards vqa Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G.,
models that can read. In Proceedings of the IEEE/CVF Luo, P., Lu, T., Zhou, J., Qiao, Y., et al. Visionllm: Large
conference on computer vision and pattern recognition, language model is also an open-ended decoder for vision-
pp. 8317–8326, 2019. centric tasks. arXiv preprint arXiv:2305.11175, 2023c.

13
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, In Proceedings of the 32nd ACM International Confer-
J., Yang, Z., Zhao, L., Song, X., et al. Cogvlm: Visual ence on Information and Knowledge Management, pp.
expert for pretrained language models. arXiv preprint 4435–4439, 2023.
arXiv:2311.03079, 2023d.
Zhang, Y., Zhang, R., Gu, J., Zhou, Y., Lipka, N., Yang,
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., D., and Sun, T. Llavar: Enhanced visual instruction
Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought tuning for text-rich image understanding. arXiv preprint
prompting elicits reasoning in large language models. arXiv:2306.17107, 2023c.
Advances in Neural Information Processing Systems, 35:
Zhao, B., Wu, B., and Huang, T. Svit: Scaling up vi-
24824–24837, 2022.
sual instruction tuning. arXiv preprint arXiv:2307.04087,
Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin, 2023a.
C., Lv, C., Pan, D., Wang, D., Yan, D., et al. Baichuan
Zhao, Y., Lin, Z., Zhou, D., Huang, Z., Feng, J., and Kang,
2: Open large-scale language models. arXiv preprint
B. Bubogpt: Enabling visual grounding in multi-modal
arXiv:2309.10305, 2023.
llms. arXiv preprint arXiv:2307.08581, 2023b.
Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Hu, A., Shi, P., Shi, Y., et al. mplug-owl: Modulariza- Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging
tion empowers large language models with multimodality. llm-as-a-judge with mt-bench and chatbot arena. arXiv
arXiv preprint arXiv:2304.14178, 2023. preprint arXiv:2306.05685, 2023.
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
E. A survey on multimodal large language models. arXiv Minigpt-4: Enhancing vision-language understanding
preprint arXiv:2306.13549, 2023. with advanced large language models. arXiv preprint
arXiv:2304.10592, 2023.
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang,
X., and Wang, L. Mm-vet: Evaluating large multi- Zhu, J., Zhu, X., Wang, W., Wang, X., Li, H., Wang, X.,
modal models for integrated capabilities. arXiv preprint and Dai, J. Uni-perceiver-moe: Learning sparse gener-
arXiv:2308.02490, 2023. alist models with conditional moes. Advances in Neural
Information Processing Systems, 35:2664–2678, 2022.
Yuan, Z., Li, Z., and Sun, L. Tinygpt-v: Efficient multi-
modal large language model via small backbones. arXiv Zhu, Y., Zhu, M., Liu, N., Ou, Z., Mou, X., and Tang,
preprint arXiv:2312.16862, 2023. J. Llava-phi: Efficient multi-modal assistant with small
language model, 2024.
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M.,
Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. Glm-130b: Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean,
An open bilingual pre-trained model. arXiv preprint J., Shazeer, N., and Fedus, W. St-moe: Designing stable
arXiv:2210.02414, 2022. and transferable sparse expert models. arXiv preprint
arXiv:2202.08906, 2022.
Zhang, P., Wang, X. D. B., Cao, Y., Xu, C., Ouyang, L.,
Zhao, Z., Ding, S., Zhang, S., Duan, H., Yan, H., et al.
Internlm-xcomposer: A vision-language large model for
advanced text-image comprehension and composition.
arXiv preprint arXiv:2309.15112, 2023a.

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V.,
et al. Opt: Open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068, 2022.

Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S.,
Li, J., Hu, R., Zhang, T., Wu, F., et al. Instruction tuning
for large language models: A survey. arXiv preprint
arXiv:2308.10792, 2023b.

Zhang, X. and Yang, Q. Xuanyuan 2.0: A large chinese


financial chat model with hundreds of billions parameters.

14
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Appendix for MoE-LLaVA

A. Implementation Details
A.1. More Model Architecture
In Table 8, we present additional variants of the MoE-LLaVA. We introduce how the total parameters is calculated. When
the number of activated experts is 2, setting Experts = 2 yields the number of activated parameters.
T otal P arameters =Embedding · W idth
+ Layers · (4 · W idth · W idth + W idth · F F N · F F N F actor + 2 · W idth)
+ W idth + W idth · Embedding (12)
+ M oE Layers · (Experts − 1) · (W idth · F F N · F F N F actor + 2 · W idth)
+ M oE Layers · (W idth · Experts)

Table 8. More architecture details of the MoE-LLaVA model. “FFN Factor“ represents the number of linear layers in the FFN. “*”
donates the dimension of the hidden states for the keys (k) and values (v) is 1024. “1.6B×4-Top2” represents a dense foundation model
with 1.6B parameters, which will be equipped with a total of four experts, with two of them being activated. “†” donates all layers will
equipped with MoE layer.

MoE FFN Activated Total


Name Experts Top-k Embedding Width Layers FFN Heads
Layers Factor Param Param
StableLM-1.6B (Team) - - - 100352 2560 32 10240 2 32 1.6B 1.6B
MoE-LLaVA-1.6B×4-Top2 4 2 16 100352 2560 32 10240 2 32 2.0B 2.9B
MoE-LLaVA-1.6B×4-Top2† 4 2 32 100352 2560 32 10240 2 32 2.5B 4.1B
Qwen-1.8B (Bai et al., 2023a) - - - 151936 2048 24 5504 3 16 1.8B 1.8B
MoE-LLaVA-1.8B×4-Top2 4 2 12 151936 2048 24 5504 3 16 2.2B 3.1B
MoE-LLaVA-1.8B×4-Top2† 4 2 24 151936 2048 24 5504 3 16 2.6B 4.3B
Phi2-2.7B (Microsoft, 2023) - - - 51200 2560 32 10240 2 32 2.7B 2.7B
MoE-LLaVA-2.7B×4-Top2 4 2 16 51200 2560 32 10240 2 32 3.6B 5.3B
MoE-LLaVA-2.7B×4-Top2† 4 2 32 51200 2560 32 10240 2 32 4.5B 7.8B
OpenChat-7B (Wang et al., 2023a) - - - 32000 4096∗ 32 14336 3 32 6.7B 6.7B
MoE-LLaVA-7B×4-Top2 4 2 16 32000 4096∗ 32 14336 3 32 9.6B 15.2B
MoE-LLaVA-7B×4-Top2† 4 2 32 32000 4096∗ 32 14336 3 32 12.4B 23.7B

A.2. Training Details Table 9. Training hyperparameters.


As shown in Table 9, we present the training hyperparameters
Config Stage I Stage II Stage III
for all models, which are applicable to Qwen, StableLM, Phi and
OpenChat. For the training process in all stages, we consistently Experts - - 4
Top-k - - 2
train for 1 epoch, as we find that the models overfit when training
for 2 epochs. The batch size for the first stage is 256 and 128 Deepspeed Zero2 Zero2 Zero2 offload
for the second and third stages. We use an image resolution of Data LLaVA-PT Hybird-PT LLaVA-FT
Image resolution 336×336
336x336 for all three stages. Additionally, for smaller models Image encoder CLIP-Large/336
like Qwen-1.8B, it is feasible to train them on 8 V100-32G Feature select layer -2
GPUs. However, during the training process, using fp16 may Image projector 2 Linear layers with GeLU
sometimes lead to loss becoming NaN. Since our models are Epoch 1
smaller than 7B, we can train them in zero2 mode. However, for Learning rate 1e-3 2e-5 2e-5
Learning rate schdule Cosine
stage 3, deepspeed temporarily does not support training MoE Weight decay 0.0
architecture in zero3 mode. Therefore, we choose zero2 offload Text max length 2048
to further reduce the memory requirements and enable running Batch size per GPU 32 16 16
on 8 A800-80G GPUs. We enable the gradient checkpoint mode GPU 8 × A800-80G
for all training stage. Precision Bf16

15
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

B. Additional Results and Visualization


B.1. Model Scaling Table 10. Ablation study about the model size of MoE-LLaVA.
As shown in Table 10, for models smaller than 7B,
Model MoE VQAv2 SQAI VQAT MMB LLaVAW
we demonstrate a strong scale of law. MoE-LLaVA
exhibits improved performance as the model size ✗ 74.5 62.0 48.8 58.2 83.2
StableLM
increases, as exemplified by StableLM-1.6B, Qwen- ✔ 76.0 62.6 47.8 59.4 85.9
1.8B, and Phi-2.7B. But surprisingly, the overall ✗ 74.9 60.2 48.3 60.6 86.3
Qwen
performance of OpenChat-MoE is significantly in- ✔ 76.2 63.1 48.0 59.7 88.7
ferior to dense models. We speculate that this may ✗ 75.6 67.8 50.0 65.0 91.3
Phi-2
be due to the insufficient data for current multi- ✔ 77.6 68.5 51.4 65.2 94.1
modal instruction tuning to support sparse pattern ✗ 77.9 69.0 54.7 66.9 89.7
learning in 10B-level models, which should be ad- OpenChat
✔ 78.9 62.8 52.5 65.9 86.3
dressed in future work when scaling up to larger
MoE-LLaVA models.

B.2. Training Capacity


For MoE layers, we employ the Batch Priority Routing (BPR) strategy (Riquelme et al., 2021). This strategy utilizes the
routing results to determine which tokens should be dropped, ensuring a more balanced workload among the experts. During
the training process, the BPR strategy dynamically adjusts the routing results for each expert based on their capacity. When
the tokens assigned to an expert exceed its predefined capacity, the excess tokens are dropped. We conduct a ablation
study on the hyperparameter capacity, as shown in Table 11. Increasing the capacity consistently improves performance for
different sizes of MoE-LLaVA.

Table 11. Ablation study about the capacity of MoE-LLaVA. “Res.” represent the input image resolution. donates that there is some
overlap in the training data.

Image Question Answering Benchmark Toolkit


Methods Res. Capacity
VQAv2 GQA VisWiz SQAI VQAT POPE MMB LLaVAW MM-Vet Avg
1.5 76.7∗ 60.3∗ 36.2 62.6 50.1 85.7 60.2 86.8 26.9 60.6
MoE-LLaVA-1.6B×4-Top2 336
1.0 76.0∗ 60.4∗ 37.2 62.6 47.8 84.3 59.4 85.9 26.1 59.9
1.5 77.6∗ 61.4∗ 43.9 68.5 51.4 86.3 65.2 94.1 34.3 64.7
MoE-LLaVA-2.7B×4-Top2 336
1.0 77.1∗ 61.1∗ 43.4 68.7 50.2 85.0 65.5 93.2 31.1 63.9
1.5 79.9∗ 62.6∗ 43.7 70.3 57.0 85.7 68.0 97.3 35.9 66.7
MoE-LLaVA-2.7B×4-Top2 384
1.0 79.4∗ 62.7∗ 42.1 70.3 55.7 85.5 67.9 95.1 33.6 65.8

B.3. Routing Distributions


In this section, we present the routing distributions of MoE-LLaVA-OpenChat-7B×4-Top2, MoE-LLaVA-Phi-2.7B×4-Top2,
MoE-LLaVA-Qwen-1.8B×4-Top2, and MoE-LLaVA-StableLM-1.6B×4-Top2 on six benchmarks (ScienceQA-IMG (Lu
et al., 2022), TextVQA (Singh et al., 2019), POPE (Li et al., 2023d), MMBench (Liu et al., 2023d), VisWiz (Gurari et al.,
2018), MM-Vet (Yu et al., 2023)). These routing distributions are based on the training up to the final checkpoint.
For MoE-LLaVA-OpenChat-7B×4-Top2, it is a truly large model compared to our setting. However, as shown in Ap-
pendix B.1, its performance is not as good as expected. We provide the routing distribution of MoE-LLaVA-OpenChat
after sparsification in Figure 7. We can observe that even after three stages of training, the routing distributions of MoE-
LLaVA-OpenChat and MoE-LLaVA-Phi ( Figure 8) differ significantly. MoE-LLaVA-OpenChat exhibits a relatively
balanced distribution overall, in terms of both expert loads and expert preferences for different modalities. On the other
hand, MoE-LLaVA-Phi, along with other smaller models such as MoE-LLaVA-Qwen and MoE-LLaVA-StableLM, show
some specific patterns or, in other words, their distributions are more disordered. For example, (1) in Figure 8, MoE-
LLaVA-Phi exhibits a prominent expert 3 in layers 17-23, which dominates the majority of the workload. (2) In Figure 9,
MoE-LLaVA-Qwen shows a strong preference for the image modality in expert 1. (3) In Figure Figure 10, experts 2 and 3
of MoE-LLaVA-StableLM are actively engaged in the middle layers of the model. We believe this is highly likely due to
the insufficient amount of current multimodal fine-tuning data (655k in our setting) to enable sparsification for 10B-level
models, even starting from a well-initialized LVLM.

16
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage
50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(a) ScienceQA-IMG (b) TextQA


All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage
50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(c) POPE (d) MMBench


All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage
50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(e) Viswiz (f) MM-Vet


Figure 7. Distribution of expert loadings and expert preferences on MoE-LLaVA-OpenChat-7B×4-Top2.

In fact, we should reflect on what behavior is expected for a sparse MoE model. Should it exhibit specific patterns for each
expert, like MoE-LLaVA-Phi, or should it have similar behavior among the experts, like MoE-LLaVA-OpenChat? If
we consider that in a sparse model, the behavior of each expert should be similar at initialization, as they are initialized from
a shared FFN and the router has not yet learned any inductive biases, then if the routing distribution continues to remain
balanced as the network learns, it would be similar to the initialization and may lead to confusion in the model. Therefore,
we speculate that the lack of sufficient data may be a reason for the poor performance of MoE-LLaVA-OpenChat.
However, due to the current limitations in data and computational resources, we are unable to further explore this, and we
hope that future work can make progress in this direction.
Additionally, we provide more details in Figure 11, Figure 12, Figure 13, and Figure 14.
All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage

50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(a) ScienceQA-IMG (b) TextQA


All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage

50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(c) POPE (d) MMBench


All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage

50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(e) Viswiz (f) MM-Vet


Figure 8. Distribution of expert loadings and expert preferences on MoE-LLaVA-Phi-2.7B×4-Top2.

B.4. Token Pathways


In Figure 11, Figure 12, Figure 13, and Figure 14, we track the paths of each token for MoE-LLaVA-OpenChat-7B×4-Top2,
MoE-LLaVA-Phi-2.7B×4-Top2, MoE-LLaVA-Qwen-1.8B×4-Top2, and MoE-LLaVA-StableLM-1.6B×4-Top2, respectively.
In general, the overall trends of the token paths align with the analysis in Appendix B.3. The paths of MoE-LLaVA-
OpenChat-7B×4-Top2 appear more disorderly and diverse, which is attributed to a more balanced expert assignment. On the
other hand, MoE-LLaVA-Phi-2.7B×4-Top2, MoE-LLaVA-Qwen-1.8B×4-Top2, and MoE-LLaVA-StableLM-1.6B×4-Top2
each exhibit their specific patterns.

17
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage
50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(a) ScienceQA-IMG (b) TextQA


All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage
50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(c) POPE (d) MMBench


All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage
50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(e) Viswiz (f) MM-Vet


Figure 9. Distribution of expert loadings and expert preferences on MoE-LLaVA-Qwen-1.8B×4-Top2.
All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage

50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(a) ScienceQA-IMG (b) TextQA


All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage

50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(c) POPE (d) MMBench


All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%

75% 75%
Percentage

Percentage

50% 50%

25% 25%

0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer

(e) Viswiz (f) MM-Vet


Figure 10. Distribution of expert loadings and expert preferences on MoE-LLaVA-StableLM-1.6B×4-Top2.
B.5. Exhibition Board
In Table 12, we present some classic examples using images from LLaVA (Liu et al., 2023c) and LLaVA-1.5 (Liu et al.,
2023b). We observe that MoE-LLaVA performs comparably to them on these classic images, despite using fewer parameters.

18
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Table 12. Exhibition Board of MoE-LLaVA. MoE-LLaVA demonstrates the ability to detect and answer challenging questions when
prompted to verify them.

Visual input example, Tricky Question and Image: Visual input example, Common Question:

User If there are factual errors in the questions, point it out; User Can you explain this meme?
if not, proceed answering the question.
What’s happening in the desert? GPT-4 (OpenAI, 2023) This meme is a joke that combines two unrelated
things: pictures of the earth from space and chicken
LLaVA-1.5 There are no deserts in the image. The image features nuggets. The text of the meme suggests that the image
a beach with palm trees, a city skyline, and a large below is a beautiful picture of the earth from space.
body of water. However, the image is actually of chicken nuggets ar-
MoE-LLaVA There are no deserts in the image. The scene features ranged to vaguely resemble a map of the world. The
a city with a beach and a large body of water, which is humor in this meme comes from the unexpected juxta-
more likely to be a coastal city. position of the text and the image. The text sets up an
expectation of a majestic image of the earth, but the
image is actually something mundane and silly.
MoE-LLaVA The image features a pan filled with golden-brown
fried chicken pieces arranged in the shape of a world
map. There are numerous chicken pieces spread across
the pan, with some pieces located near the center,
while others are positioned towards the edges. The
chicken pieces are placed in a way that resembles the
User Do you know who drew this painting? continents and countries, creating a visually appealing
MoE-LLaVA The painting is a portrait of the famous artist, Leonardo and creative presentation. The pan itself is placed on a
da Vinci. table, ready to be served and enjoyed.

User Do you know who drew this painting? User What is unusual about this image?
MoE-LLaVA The painting is a drawing of a dog, and it is not possi- MoE-LLaVA The unusual aspect of this image is that a man is stand-
ble to definitively determine the artist who created it ing on a clothesline that is attached to the back of a
based on the image alone. moving taxi cab. This is not a typical scene, as clothes-
lines are usually found in residential areas, patios, or
laundry rooms, and not on the back of a vehicle. The
man is holding a shirt while standing on the clothes-
line, which adds to the unusualness of the situation. It
is likely that the man is performing some task or trying
to dry the shirt in an unconventional manner.

User What is the name of the man in the photo?


MoE-LLaVA The man in the photo is named Elon Musk.

User Describe the image and where’s the red arrow?


MoE-LLaVA The image is a graph or chart with a red arrow pointing
to the bottom. The graph is color-coded, with pink and
green lines, and it appears to be a line graph. The red
User What is the name of the man in the photo? arrow is located at the bottom of the graph, indicating
MoE-LLaVA The man in the photo is named Elon Musk. a specific point or value.

19
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Text Image Text Image Text Image


Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4
100% 100% 100%

75% 75% 75%


Percentage

Percentage

Percentage
50% 50% 50%

25% 25% 25%

0% 0% 0%
1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx

(a) ScienceQA-IMG (b) TextQA (c) POPE


Text Image Text Image Text Image
Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4
100% 100% 100%

75% 75% 75%


Percentage

Percentage

Percentage
50% 50% 50%

25% 25% 25%

0% 0% 0%
1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx

(d) MMBench (e) Viswiz (f) MM-Vet


Figure 11. Distribution of modalities across different experts on MoE-LLaVA-OpenChat-7B×4-Top2.

Text Image Text Image Text Image


Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4
100% 100% 100%

75% 75% 75%


Percentage

Percentage

Percentage
50% 50% 50%

25% 25% 25%

0% 0% 0%
1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx

(a) ScienceQA-IMG (b) TextQA (c) POPE


Text Image Text Image Text Image
Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4
100% 100% 100%

75% 75% 75%


Percentage

Percentage

Percentage

50% 50% 50%

25% 25% 25%

0% 0% 0%
1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx

(d) MMBench (e) Viswiz (f) MM-Vet


Figure 12. Distribution of modalities across different experts on MoE-LLaVA-Phi-2.7B×4-Top2.

Text Image Text Image Text Image


Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4
100% 100% 100%

75% 75% 75%


Percentage

Percentage

Percentage

50% 50% 50%

25% 25% 25%

0% 0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx

(a) ScienceQA-IMG (b) TextQA (c) POPE


Text Image Text Image Text Image
Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4
100% 100% 100%

75% 75% 75%


Percentage

Percentage

Percentage

50% 50% 50%

25% 25% 25%

0% 0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx

(d) MMBench (e) Viswiz (f) MM-Vet


Figure 13. Distribution of modalities across different experts on MoE-LLaVA-Qwen-1.8B×4-Top2.

20
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Text Image Text Image Text Image


Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4
100% 100% 100%

75% 75% 75%


Percentage

Percentage

Percentage
50% 50% 50%

25% 25% 25%

0% 0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx

(a) ScienceQA-IMG (b) TextQA (c) POPE


Text Image Text Image Text Image
Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4 Expert 2 Expert 4
100% 100% 100%

75% 75% 75%


Percentage

Percentage

Percentage
50% 50% 50%

25% 25% 25%

0% 0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx

(d) MMBench (e) Viswiz (f) MM-Vet


Figure 14. Distribution of modalities across different experts on MoE-LLaVA-StableLM-1.6B×4-Top2.

Text Text Text


1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer idx MoE layer idx MoE layer idx

(a) ScienceQA-IMG (b) TextQA (c) POPE


Text Text Text
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx

2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx

2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer idx MoE layer idx MoE layer idx

(d) MMBench (e) Viswiz (f) MM-Vet


Figure 15. Visualization of activated pathways on MoE-LLaVA-OpenChat-7B×4-Top2.

Text Text Text


1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx

2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx

2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer idx MoE layer idx MoE layer idx

(a) ScienceQA-IMG (b) TextQA (c) POPE


Text Text Text
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx

2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx

2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer idx MoE layer idx MoE layer idx

(d) MMBench (e) Viswiz (f) MM-Vet


Figure 16. Visualization of activated pathways on MoE-LLaVA-Phi-2.7B×4-Top2.

21
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Text Text Text


1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx

(a) ScienceQA-IMG (b) TextQA (c) POPE


Text Text Text
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx

(d) MMBench (e) Viswiz (f) MM-Vet


Figure 17. Visualization of activated pathways on MoE-LLaVA-Qwen-1.8B×4-Top2.

Text Text Text


1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx

2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx

2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx

(a) ScienceQA-IMG (b) TextQA (c) POPE


Text Text Text
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx

2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx

Expert idx

Expert idx

2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx

(d) MMBench (e) Viswiz (f) MM-Vet


Figure 18. Visualization of activated pathways on MoE-LLaVA-StableLM-1.6B×4-Top2.

22

Common questions

Powered by AI

Sparse mixture of experts like MoE-LLaVA reduces computational costs because only a subset of experts is activated for each token, leading to efficient model inference and training. Additionally, it achieves comparable or superior performance in benchmarks compared to traditional, fully dense LVLMs with higher activated parameters, such as outperforming LLaVA-1.5-13B on the POPE object hallucination benchmark .

MoE-LLaVA outperforms models with higher activated parameters due to its innovative sparse architecture that activates specialized experts, optimizing task-specific competence and maintaining lower activation for irrelevant tasks. This specialization mitigates object hallucination by refining insights with fewer but more relevant parameters, surpassing models like LLaVA-1.5-13B in the POPE benchmark .

MoE-LLaVA utilizes a sparse architecture that selectively activates experts for processing specific visual-language patterns, contrasting with earlier LVLMs that engaged all parameters at once. This approach enhances interpretative accuracy and computational efficiency, providing it an edge in both vision understanding and hallucination mitigation capabilities .

MoE-Tuning is a three-stage process that adapts visual tokens and trains LVLMs' parameters to establish multi-modal understanding. It prevents performance degradation typically caused by sparsity by maintaining inactive paths for less relevant components. This strategy refines the MoE layers and ensures efficient knowledge transfer across the model, culminating in a sparse but highly effective model, thus enhancing MoE-LLaVA's overall performance .

MoE-LLaVA architecture employs a strategy called MoE-Tuning, which activates only the top-k experts through learnable routers, keeping other experts inactive. This method reduces the computational burden by ensuring that only a fraction of the experts are used for each task, thus maintaining a constant computational cost while expanding the number of parameters. This allows MoE-LLaVA to handle a large model size with improved efficiency .

Extensive experimentation validates MoE-LLaVA's superior hallucination mitigation abilities, demonstrated by its performance on the POPE object hallucination benchmark. It surpasses LLaVA-1.5-13B by 1.1% while using fewer parameters, showing effective management of hallucination due to its specialized experts and sparse path strategy .

In the MoE-LLaVA architecture, the router is responsible for directing tokens to the appropriate experts. It efficiently determines which experts to activate, ensuring that only the most relevant ones contribute to processing the tokens. This selective activation process is a core component in maintaining the model's computational efficiency while leveraging a large number of experts .

MoE-LLaVA uses significantly fewer activated parameters (approximately 3B) compared to models like LLaVA-1.5-7B and even LLaVA-1.5-13B, yet it achieves similar or superior performance on visual understanding and hallucination benchmarks. This indicates that MoE-LLaVA is highly efficient, maintaining comparable functionality with reduced computational costs .

MoE-LLaVA achieves balance by employing sparse pathways where only select experts are activated via conclusive routing mechanisms. This allows for an increased vastness of parameters, refining task-specific expertise without a proportional increase in computational requirements, ensuring efficiency and scalability .

MoE-LLaVA's parameter structure achieves sparsity by incorporating multiple sparse paths through its MoE layers. Each layer consists of experts, among which only the top two are activated based on token relevance. Parameters are calculated by including embedding, MoE layers, and FFN factors, optimizing performance with minimal activation .

You might also like