MoE-LLaVA: Efficient Sparse LVLMs
MoE-LLaVA: Efficient Sparse LVLMs
Bin Lin 1 Zhenyu Tang 1 Yang Ye 2 Jinfa Huang 3 Junwu Zhang 1 Yatian Pang 4 1 Peng Jin 1 Munan Ning 1 5
Jiebo Luo 3 Li Yuan 1 5
1
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
2
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Response
Activated forward Copy weight Non-activated forward
Add & Norm
Trainable parameter Non-trainable
MoE
Generated Text Caption Response
Word Word
Embedding MLP Embedding MLP Word
Embedding MLP
Caption Vision Instruction Vision
Request
Encoder
Request
Encoder Instruction Vision
Encoder
Request
Image Image
Image
2023b), aligning the input (Lin et al., 2023) and projection making it difficult for soft routers to learn the optimal pat-
layers (Cha et al., 2023; Alayrac et al., 2022; Bai et al., terns for assigning tokens to different experts. A series of
2023b; Dai et al., 2023; Ye et al., 2023; Zhao et al., 2023a). works (Bao et al., 2022; Long et al., 2023; Satar et al., 2022;
These works empowered LVLMs with powerful visual un- Wang et al., 2022; Shen et al., 2023) naturally decouple ex-
derstanding capabilities by expanding the visual instruction perts based on modal categories and pre-define each expert
fine-tuning datasets and model scales. to handle a specific modality. An important feature of these
hard-based routers is that they do not require learning the
Currently, some works have endowed LVLMs with fine-
router. This mode is also widely applied in the task-specific
grained image understanding capabilities, such as region un-
MoE (Li et al., 2023e; Zhu et al., 2022; Ma et al., 2023;
derstanding (Chen et al., 2023c; Zhao et al., 2023b; Liu et al.,
Kudugunta et al., 2021).
2023e), multi-region understanding (Wang et al., 2023c;
Pi et al., 2023; Peng et al., 2023), and pixel-wise ground- Soft Routers. Some works (Shazeer et al., 2017; Lep-
ing (Rasheed et al., 2023; Lai et al., 2023). However, the ikhin et al., 2020; Fedus et al., 2022; Zoph et al., 2022;
cost of scaling up dense visual data and models is chal- Komatsuzaki et al., 2022) in natural language process have
lenging to bear (Liu et al., 2022; Yin et al., 2023). In this explored the MoE based on soft routers. Soft routers en-
work, we aim to make state-of-the-art LVLMs research more able dynamic allocation of data among different experts,
accessible by leveraging mixture of experts. allowing each expert to focus on its expertise and achieve
model sparsity. Therefore, our main focus is on leveraging
2.2. Mixture of Experts in Multi-modal Learning soft routers in the MoE. Small-scale (million-level) models
based on soft routers have also been explored in the context
Mixture of Experts (MoE) (Jacobs et al., 1991; Eigen et al., of multi-modal learning, such as EVE (Chen et al., 2023a)
2013) is a hybrid model consisting of multiple sub-models, and LIMoE (Mustafa et al., 2022), which attempt a fusion
known as experts, which are integrated together. The key of data by using soft routers. The work most relevant to ours
concept of MoE is the use of a router to determine the token is MoCLE (Gou et al., 2023). However, MoCLE clusters
set that each expert handles, thereby reducing interference different instruction sets and distributes them to different
between different types of samples. experts, which compromises the flexibility and autonomy of
Hard Routers. In the hard router mode, each expert is the experts. Differently, MoE-LLaVA relies on knowledge-
typically pre-defined as a specific pattern. This is because rich routers to distribute tokens to different paths.
multi-modal data naturally exhibit gaps (Liang et al., 2022),
3
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Table 1. Architecture details of the MoE-LLaVA model. “FFN Factor” represents the number of linear layers in the FFN. “1.6B×4-Top2”
represents a dense foundation model with 1.6B parameters, which is equipped with a total of four experts, two of them being activated.
4
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
3.4. Training Objectives coder, and the MLP consists of two linear layers with GELU
activation function (Hendrycks & Gimpel, 2016) between
The Ltotal consists of auto-regressive loss Lregressive and auxil- them. Unless otherwise specified, MoE-LLaVA employs an
iary loss Laux , and auxiliary loss are scaled by the balancing alternating replacement of FFN with MoE layers, meaning
coefficient α: that the number of MoE layers is half of the total number
of layers. The value of balancing coefficient α is 0.01. We
Ltotal = Lregressive + α · Laux . (7)
provide additional training details in Appendix A.2.
Auto-Regressive Loss. We optimize the output of LLM Data Details. As shown in Table 2, we reorganize the
through a generative loss in an auto-regressive manner. currently available data for the three-stage training. For
Given an image and text, MoE-LLaVA generates the output the first stage of pretraining, we use the pretrained data
sequence Y = [y1 , y2 , · · · , yK ] ∈ RK×D by progressively of LLaVA 1.5-558k (Liu et al., 2023b). For the second
generating each element, where K = P + N represents the stage, we collect datasets from MIMIC-IT (Li et al., 2023a),
length of the output sequence. The formula is: LRV (Liu et al., 2023a), SViT (Zhao et al., 2023a) and
LVIS (Wang et al., 2023b) to provide a robust initialization
N
X for MoE-LLaVA. For the third stage, we utilize the same
Lregressive = − log pθ Y [P +i] | V, T [:i−1] , (8)
data pipeline as LLaVA-mix-665k (Liu et al., 2023b).
i=1
where θ is a trainable parameter and we only calculate the 4.2. Image Understanding Evaluation
loss for the generated text.
Zero-shot Image Question Answering. As shown in Ta-
Auxiliary Loss. Due to the presence of multiple experts, ble 3, since MoE-LLaVA is a sparse model equipped with
it is necessary to impose load balancing constraints on the a soft router based on LVLM, we categorize the previous
MoE layer. We incorporate differentiable load balancing models as dense models. We evaluate the performance of
loss (Fedus et al., 2022) into each MoE layer to encourage MoE-LLaVA on five image question-answering benchmarks
experts to handle tokens in a balanced manner as follows: and report the number of activated parameters. Compared
E
X to the state-of-the-art method LLaVA 1.5, MoE-LLaVA
Laux = E · Fi · Gi , (9) demonstrates powerful image understanding capabilities
i=1 and performs very close to LLaVA-1.5 on five benchmarks.
where F represents the fraction of tokens processed by each Specifically, MoE-LLaVA-Phi-2.7B×4 surpasses LLaVA-
expert Ei , and G represents the average routing probability 1.5-7B by 2.7% on SQAI using 3.6B sparse activated param-
of Ei , which can be expressed by the following formulas: eters. Notably, MoE-LLaVA-StableLM-1.6B×4 achieves
comprehensive superiority over IDEFICS-80B with only
E
1 X 2.0B activated parameters. Furthermore, we observe the re-
F= 1{argmax P(x) = i}, (10) cent small-scale vision-language model, LLaVA-Phi. MoE-
K i=1
K LLaVA-Phi-2.7B×4 outperforms LLaVA-Phi by more than
1 X 6.2% on VQAv2 , highlighting the strong comprehension
G= P(x)i . (11)
K i=1 abilities of MoE-LLaVA in natural vision.
Evaluation under Benchmark Toolkits. To comprehen-
4. Experiments sively evaluate the multi-modal understanding capabilities
of MoE-LLaVA, we evaluate its performance on four bench-
4.1. Experimental Setup
mark toolkits. These benchmark toolkits typically involve
Model Settings. Following LLaVA 1.5 (Liu et al., 2023b), open-ended answers, serving as tools to verify a model’s
we utilize CLIP-Large (Radford et al., 2021) as the vision en- ability to engage in natural language questioning. In Ta-
5
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Table 3. Comparison among different LVLMs on image understanding benchmarks. “Res.”, “Act.”, “L”, “V”, “S”, “Q”, “P”, “M”
and “I” respectively represent the input image resolution, activated parameters, LLaMA (Touvron et al., 2023a), Vicuna (Chiang et al.,
2023), StableLM (Team), Qwen (Bai et al., 2023a), Phi-2 (Microsoft, 2023) MobileLLaMA (Chu et al., 2023) and IDEFICS (Laurençon
et al., 2023). Evaluation Benchmarks include VQA-v2 (Goyal et al., 2017); GQA (Hudson & Manning, 2019); VisWiz (Gurari et al.,
2018); SQAI : ScienceQA-IMG (Lu et al., 2022); VQAT : TextVQA (Singh et al., 2019); POPE (Li et al., 2023d); MME (Fu et al., 2023);
MMB: MMBench (Liu et al., 2023d); LLaVAW : LLaVA-Bench (in-the-Wild) (Liu et al., 2023c); MM-Vet (Yu et al., 2023). ∗ donates that
there is some overlap in the training data. † donates that the model is trained with an image resolution of 384. The best results and second
best results are indicated by boldface and underline, respectively.
Table 4. Zero-shot object hallucination evaluation results. “Yes” indicates the proportion of positive responses to the given question.
ble 3, MoE-LLaVA-Qwen-1.8B×4 surpasses Qwen-VL-7B surpasses LLaVA-1.5-13B by 1.0%, 1.5%, and 0.8% in ad-
by 21.5%, on MMBench, despite the latter utilizing higher versarial sampling, popular sampling, and random sampling,
image resolutions. These results collectively demonstrate respectively, with 2.2B activated parameters. Additionally,
that the sparse model MoE-LLaVA achieves comparable we observe that the yes ratio of MoE-LLaVA remains rela-
or even superior performance to dense models with fewer tively balanced, indicating that our sparse model is capable
activated parameters. of providing accurate feedback based on the given questions.
6
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
75%
Percentage
50%
25%
0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer
Figure 4. Distribution of expert loadings. The discontinuous lines represent a perfectly balanced distribution of tokens among different
experts or modalities. The first figure on the left illustrates the workload among experts, while the remaining four figures depict the
preferences of experts towards different modalities.
Text Image
Expert 1 Expert 3 Expert 1 Expert 3
Text
1
Expert 2 Expert 4 Expert 2 Expert 4
100% Top-1
Expert idx
2
Top-2
75%
3 Others
Percentage
4
50% Image
1
Top-1
Expert idx
25% 2
Top-2
3 Others
0%
1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 4
MoE layer idx MoE layer idx 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer idx
Figure 5. Distribution of modalities across different experts.
Interrupted lines mean a perfectly balanced distribution of tokens. Figure 6. Visualization of activated pathways. We highlight the
top-10 activated pathways on the text and image. Among them,
the colorful paths represent the top-2 paths for text and image,
the model gradually becomes sparser, the expert 3 loads for respectively, while the gray paths represent the remaining 8 paths.
layers 17 to 27 suddenly increase, and they even dominate
the workload of almost all tokens. For the shallow layers
(5-11), experts 2, 3, and 4 mainly collaborate. It is worth consistently tends to assign experts 2 and 3 to handle them
noting that expert 1 only works predominantly in the first in the deeper layers of the model. Regarding experts 1 and
few layers, and as the model becomes deeper, expert 1 grad- 4, they tend to handle the tokens during the initialization
ually withdraws from the workload. Therefore, the experts phase. These findings contribute to a better understanding
in MoE-LLaVA have learned a certain pattern that allows of the behavior of sparse models in multi-modal learning.
them to divide their tasks in a specific manner.
4.5. Ablation Study
Furthermore, we show the distribution of modalities across
different experts in Figure 5. Similarly, experts develop In this section, we first validate the necessity of the three-
their own preferences. Additionally, we find that the rout- stage training strategy. We then explore the impact of differ-
ing distributions for text and image are highly similar. For ent base models and conduct ablation studies on the number
example, when expert 3 is actively working in layers 17-27, of experts and active experts, and the MoE structure. We
the proportions of text and image that MoE-LLaVA pro- provide additional results in Appendix B.2.
cesses are similar. Each expert in MoE-LLaVA is capable of
Effect of Training Strategy. In Table 6, we conduct three
handling both text tokens and image tokens simultaneously,
variant experiments to demonstrate the rationale behind us-
which demonstrates that MoE-LLaVA does not exhibit a
ing the second-stage instruction tuning as the initialization
clear preference for any modality. This serves as evidence
for the third-stage MoE tuning. When adapting MoE to
of its strong interaction in multimodal learning.
LVLMs, a straightforward approach is to replace the clas-
Token Pathways. Furthermore, we examine the behavior of sic LLaVA’s FFN with a MoE layer and train it according
experts at the token level. More visualization can be found to the original second-stage script, denoted as variant (a).
in Appendix B.4 and Appendix B.5. We track the trajec- However, variant (a) performs the worst, suggesting that
tories of all tokens on downstream tasks. For all activated the current multi-modal instruction dataset is insufficient to
pathways, we employ PCA (Pearson, 1901) to obtain the support both the conversion from LLM to LVLM and the
top-10 pathways, as shown in Figure 6. We found that for conversion from LVLM to a sparse model simultaneously.
a given unseen text token or image tokens, MoE-LLaVA Therefore, we collect more data, referred to as Hybrid-FT,
7
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Table 5. Ablation study about training setting and architecture design decisions. Settings for results in Table 3 and Table 4 are
highlighted in blue . We report the training time on 8 V100-32G.
(a) Tuning the parameters of different subsets. (b) The number of experts.
Subset GQA VisWiz VQAT POPE LLaVAW Time Experts GQA SQAI VQAT POPE LLaVAW Time
FFN 61.5 32.6 48.0 87.0 88.7 20h 1 60.9 60.2 48.3 86.4 86.3 13h
All 61.3 31.9 47.6 87.0 88.1 27h 2 61.2 60.8 47.0 87.5 86.5 14h
Top-k VQAv2 GQA SQAI VQAT POPE Time Architecture VQAv2 GQA SQAI VQAT POPE Time
1 74.5 58.4 58.0 44.0 85.7 19h First-Half 75.9 61.3 62.4 47.0 86.9 20h
2 76.2 61.5 63.1 48.0 88.7 20h Second-Half 76.3 61.2 62.6 47.2 86.9 20h
Interval 76.2 61.5 63.1 48.0 88.7 20h
All 74.5 61.5 62.1 47.1 87.0 32h
and initially convert LLM to LVLM in the second stage. Sub- Effect of the Architectures. In Table 5d, we explore four
sequently, in the third stage, LVLM is sparsified by using variations of MoE architecture. Specifically, “First-Half”
the LLaVA-FT dataset, resulting in variant (b). Additionally, indicates that MoE layers are applied only to the first half of
we expand the data of the original LLaVA’s second stage for the model while the second half retains the original dense
fair comparison, denoted as variant (c). The results indicate architecture. “Second-Half” means that MoE layers are
that variants (b) outperformed variants (a) and (c). These placed in the second half of the model while the first half
findings demonstrate that providing a reasonable LVLM remains dense. “Interval” represents alternating occurrences
initialization allows the model to transition rapidly from of MoE layers and dense layers. “All” indicates that all
a dense model to a sparse model, validating the principle layers are sparse MoE layers. Intuitively, it is expected that
behind our three-stage training strategy. incorporating all MoE will enhance performance. However,
using “All” does not yield better results and results in longer
Effect of Tuning the Parameters of Different Subsets. training times compared to other architectures. Therefore,
In Table 5a, we examine the performance of fine-tuning MoE-LLaVA alternates the insertion of MoE layers.
different parts of the parameters. “FFN” represents fine-
tuning all FFN layers and MoE layers in the model. “All” Effect of the Model Size. As shown in Table 7, we compare
indicates fine-tuning all parameters. The results indicate the performance of models with different parameter sizes as
tuning the FFN is sufficient to achieve results comparable the foundation models for MoE-LLaVA. For smaller models
to full-parameter tuning, but it requires only approximately such as Phi2-MoE and Qwen-MoE, the performance with
75% of the time. Therefore, to enhance generalization and MoE surpasses that of dense models. We provide additional
reduce training costs, we only fine-tune FFN layers. results in Appendix B.1.
8
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
9
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
10
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, Gou, Y., Liu, Z., Chen, K., Hong, L., Xu, H., Li, A., Yeung,
S., Muyan, Z., Zhang, Q., Zhu, X., Lu, L., et al. In- D.-Y., Kwok, J. T., and Zhang, Y. Mixture of cluster-
ternvl: Scaling up vision foundation models and align- conditional lora experts for vision-language instruction
ing for generic visual-linguistic tasks. arXiv preprint tuning. arXiv preprint arXiv:2312.12379, 2023.
arXiv:2312.14238, 2023e.
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, Parikh, D. Making the v in vqa matter: Elevating the
H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., role of image understanding in visual question answer-
et al. Vicuna: An open-source chatbot impressing gpt-4 ing. In Proceedings of the IEEE conference on computer
with 90%* chatgpt quality. See [Link] lmsys. org vision and pattern recognition, pp. 6904–6913, 2017.
(accessed 14 April 2023), 2023.
Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman,
K., Luo, J., and Bigham, J. P. Vizwiz grand challenge:
Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y., Hu, Y., Wei,
Answering visual questions from blind people. In Pro-
F., Zhang, X., Zhang, B., Wei, X., et al. Mobilevlm: A
ceedings of the IEEE conference on computer vision and
fast, reproducible and strong vision language assistant for
pattern recognition, pp. 3608–3617, 2018.
mobile devices. arXiv preprint arXiv:2312.16886, 2023.
Hendrycks, D. and Gimpel, K. Gaussian error linear units
Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, (gelus). arXiv preprint arXiv:1606.08415, 2016.
W., Li, B., Fung, P., and Hoi, S. Instructblip: Towards
general-purpose vision-language models with instruction Hudson, D. A. and Manning, C. D. Gqa: A new dataset for
tuning, 2023. real-world visual reasoning and compositional question
answering. In Proceedings of the IEEE/CVF conference
DeepSeek-AI. Deepseek llm: Scaling open-source on computer vision and pattern recognition, pp. 6700–
language models with longtermism. arXiv preprint 6709, 2019.
arXiv:2401.02954, 2024.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E.
Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., Adaptive mixtures of local experts. Neural computation,
and Tang, J. Glm: General language model pretrain- 3(1):79–87, 1991.
ing with autoregressive blank infilling. arXiv preprint
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
arXiv:2103.10360, 2021.
Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel,
Eigen, D., Ranzato, M., and Sutskever, I. Learning fac- G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-
tored representations in a deep mixture of experts. arXiv A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix,
preprint arXiv:1312.4314, 2013. T., and Sayed, W. E. Mistral 7b, 2023.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary,
falconry. Falcon-180b. [Link]
B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna,
ae/, 2023.
E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G.,
Fedus, W., Zoph, B., and Shazeer, N. Switch transform- Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P.,
ers: Scaling to trillion parameter models with simple Subramanian, S., Yang, S., Antoniak, S., Scao, T. L.,
and efficient sparsity. The Journal of Machine Learning Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed,
Research, 23(1):5232–5270, 2022. W. E. Mixtral of experts, 2024.
Koh, J. Y., Salakhutdinov, R., and Fried, D. Grounding
FlagAI-Open. Aquila2-34b. [Link]
language models to images for multimodal generation.
FlagAI-Open/Aquila2, 2023.
arXiv preprint arXiv:2301.13823, 2023.
Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz,
J., Zheng, X., Li, K., Sun, X., Wu, Y., and Ji, R. Mme: C. R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M.,
A comprehensive evaluation benchmark for multimodal and Houlsby, N. Sparse upcycling: Training mixture-
large language models. arXiv preprint arXiv:2306.13394, of-experts from dense checkpoints. arXiv preprint
2023. arXiv:2212.05055, 2022.
Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., Zhao, Q., Kudugunta, S., Huang, Y., Bapna, A., Krikun, M., Lepikhin,
Liu, K., Zhang, W., Luo, P., and Chen, K. Multimodal-gpt: D., Luong, M.-T., and Firat, O. Beyond distillation: Task-
A vision and language model for dialogue with humans. level mixture-of-experts for efficient inference. arXiv
arXiv preprint arXiv:2305.04790, 2023. preprint arXiv:2110.03742, 2021.
11
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., and Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved base-
Jia, J. Lisa: Reasoning segmentation via large language lines with visual instruction tuning. arXiv preprint
model. arXiv preprint arXiv:2308.00692, 2023. arXiv:2310.03744, 2023b.
Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction
Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, tuning. arXiv preprint arXiv:2304.08485, 2023c.
A. M., Kiela, D., Cord, M., and Sanh, V. Obelics: An
open web-scale filtered dataset of interleaved image-text Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W.,
documents, 2023. Yuan, Y., Wang, J., He, C., Liu, Z., et al. Mmbench: Is
your multi-modal model an all-around player? arXiv
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., preprint arXiv:2307.06281, 2023d.
Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling
giant models with conditional computation and automatic Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J.,
sharding. arXiv preprint arXiv:2006.16668, 2020. Cao, Y., Zhang, Z., Dong, L., et al. Swin transformer v2:
Scaling up capacity and resolution. In Proceedings of the
Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., IEEE/CVF conference on computer vision and pattern
and Liu, Z. Mimic-it: Multi-modal in-context instruction recognition, pp. 12009–12019, 2022.
tuning. arXiv preprint arXiv:2306.05425, 2023a.
Liu, Z., He, Y., Wang, W., Wang, W., Wang, Y., Chen, S.,
Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping Zhang, Q., Lai, Z., Yang, Y., Li, Q., et al. Interngpt:
language-image pre-training for unified vision-language Solving vision-centric tasks by interacting with chatgpt
understanding and generation. In International Confer- beyond language. arXiv preprint arXiv:2305.05662, 3,
ence on Machine Learning, pp. 12888–12900. PMLR, 2023e.
2022.
Long, Z., Killick, G., McCreadie, R., and Camarasa, G. A.
Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot-
Multiway-adapater: Adapting large-scale multi-modal
strapping language-image pre-training with frozen im-
models for scalable image-text retrieval. arXiv preprint
age encoders and large language models. arXiv preprint
arXiv:2309.01516, 2023.
arXiv:2301.12597, 2023b.
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu,
Li, X., Yao, Y., Jiang, X., Fang, X., Meng, X., Fan, S., Han,
S.-C., Tafjord, O., Clark, P., and Kalyan, A. Learn to
P., Li, J., Du, L., Qin, B., et al. Flm-101b: An open llm
explain: Multimodal reasoning via thought chains for
and how to train it with 100 k budget. arXiv preprint
science question answering. Advances in Neural Infor-
arXiv:2309.03852, 2023c.
mation Processing Systems, 35:2507–2521, 2022.
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen,
J.-R. Evaluating object hallucination in large vision- Ma, G., Wu, X., Wang, P., and Hu, S. Cot-mote: Explor-
language models. arXiv preprint arXiv:2305.10355, ing contextual masked auto-encoder pre-training with
2023d. mixture-of-textual-experts for passage retrieval. arXiv
preprint arXiv:2304.10195, 2023.
Li, Y., Hui, B., Yin, Z., Yang, M., Huang, F., and Li, Y.
Pace: Unified multi-modal dialogue pre-training with Microsoft. Phi-2: The surprising power of
progressive and compositional experts. arXiv preprint small language models. [Link]
arXiv:2305.14839, 2023e. [Link]/en-us/research/blog/
phi-2-the-surprising-power-of-small-language-mode
Liang, V. W., Zhang, Y., Kwon, Y., Yeung, S., and Zou, 2023.
J. Y. Mind the gap: Understanding the modality gap
in multi-modal contrastive representation learning. Ad- Mustafa, B., Riquelme, C., Puigcerver, J., Jenatton, R., and
vances in Neural Information Processing Systems, 35: Houlsby, N. Multimodal contrastive learning with limoe:
17612–17625, 2022. the language-image mixture of experts. Advances in
Neural Information Processing Systems, 35:9564–9576,
Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., and Yuan, 2022.
L. Video-llava: Learning united visual representa-
tion by alignment before projection. arXiv preprint OpenAI. Gpt-4 technical report, 2023.
arXiv:2311.10122, 2023.
Pearson, K. Liii. on lines and planes of closest fit to systems
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., and Wang, L. of points in space. The London, Edinburgh, and Dublin
Aligning large multi-modal model with robust instruction philosophical magazine and journal of science, 2(11):
tuning. arXiv preprint arXiv:2306.14565, 2023a. 559–572, 1901.
12
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cap- Sun, T., Zhang, X., He, Z., Li, P., Cheng, Q., Yan, H., Liu,
pelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and X., Shao, Y., Tang, Q., Zhao, X., et al. Moss: Train-
Launay, J. The refinedweb dataset for falcon llm: out- ing conversational language models from synthetic data.
performing curated corpora with web data, and web data arXiv preprint arXiv:2307.15020, 7, 2023.
only. arXiv preprint arXiv:2306.01116, 2023.
SUSTech-IDEA. Sus-chat: Instruction tuning done
Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, right. [Link]
S., and Wei, F. Kosmos-2: Grounding multimodal SUS-Chat, 2023.
large language models to the world. arXiv preprint Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X.,
arXiv:2306.14824, 2023. Guestrin, C., Liang, P., and Hashimoto, T. B. Alpaca: A
strong, replicable instruction-following model. Stanford
Pi, R., Gao, J., Diao, S., Pan, R., Dong, H., Zhang, J.,
Center for Research on Foundation Models. [Link]
Yao, L., Han, J., Xu, H., and Zhang, L. K. T. Detgpt:
stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
Detect what you need via reasoning. arXiv preprint
arXiv:2305.14167, 2023. Team, I. Internlm: A multilingual language model with
progressively enhanced capabilities, 2023.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Team, S. A. L. Stable lm 2 1.6b. URL
et al. Learning transferable visual models from natural [[Link]
language supervision. In International conference on stablelm-2-1.6b]([Link]
machine learning, pp. 8748–8763. PMLR, 2021. co/stabilityai/stablelm-2-1.6b).
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S.,
M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Cholakkal, H., Anwer, R. M., Xing, E., Yang, M.-H., and
Azhar, F., et al. Llama: Open and efficient foundation lan-
Khan, F. S. Glamm: Pixel grounding large multimodal
guage models. arXiv preprint arXiv:2302.13971, 2023a.
model. arXiv preprint arXiv:2311.03356, 2023.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Jenatton, R., Susano Pinto, A., Keysers, D., and Houlsby, Bhosale, S., et al. Llama 2: Open foundation and fine-
N. Scaling vision with sparse mixture of experts. Ad- tuned chat models. arXiv preprint arXiv:2307.09288,
vances in Neural Information Processing Systems, 34: 2023b.
8583–8595, 2021.
Wang, G., Cheng, S., Zhan, X., Li, X., Song, S., and Liu, Y.
Satar, B., Zhu, H., Zhang, H., and Lim, J. H. Rome: Role- Openchat: Advancing open-source language models with
aware mixture-of-expert transformer for text-to-video re- mixed-quality data. arXiv preprint arXiv:2309.11235,
trieval. arXiv preprint arXiv:2206.12845, 2022. 2023a.
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, Wang, J., Meng, L., Weng, Z., He, B., Wu, Z., and Jiang,
D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., Y.-G. To see is to believe: Prompting gpt-4v for better vi-
et al. Bloom: A 176b-parameter open-access multilingual sual instruction tuning. arXiv preprint arXiv:2311.07574,
language model. arXiv preprint arXiv:2211.05100, 2022. 2023b.
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F.,
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
and Chao, L. S. Learning deep transformer models for
Q., Hinton, G., and Dean, J. Outrageously large neural
machine translation. arXiv preprint arXiv:1906.01787,
networks: The sparsely-gated mixture-of-experts layer.
2019.
arXiv preprint arXiv:1701.06538, 2017.
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu,
Shen, S., Yao, Z., Li, C., Darrell, T., Keutzer, K., and He, Y. Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som,
Scaling vision-language models with sparse mixture of S., et al. Image as a foreign language: Beit pretraining
experts. arXiv preprint arXiv:2303.07226, 2023. for all vision and vision-language tasks. arXiv preprint
arXiv:2208.10442, 2022.
Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X.,
Batra, D., Parikh, D., and Rohrbach, M. Towards vqa Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G.,
models that can read. In Proceedings of the IEEE/CVF Luo, P., Lu, T., Zhou, J., Qiao, Y., et al. Visionllm: Large
conference on computer vision and pattern recognition, language model is also an open-ended decoder for vision-
pp. 8317–8326, 2019. centric tasks. arXiv preprint arXiv:2305.11175, 2023c.
13
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, In Proceedings of the 32nd ACM International Confer-
J., Yang, Z., Zhao, L., Song, X., et al. Cogvlm: Visual ence on Information and Knowledge Management, pp.
expert for pretrained language models. arXiv preprint 4435–4439, 2023.
arXiv:2311.03079, 2023d.
Zhang, Y., Zhang, R., Gu, J., Zhou, Y., Lipka, N., Yang,
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., D., and Sun, T. Llavar: Enhanced visual instruction
Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought tuning for text-rich image understanding. arXiv preprint
prompting elicits reasoning in large language models. arXiv:2306.17107, 2023c.
Advances in Neural Information Processing Systems, 35:
Zhao, B., Wu, B., and Huang, T. Svit: Scaling up vi-
24824–24837, 2022.
sual instruction tuning. arXiv preprint arXiv:2307.04087,
Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin, 2023a.
C., Lv, C., Pan, D., Wang, D., Yan, D., et al. Baichuan
Zhao, Y., Lin, Z., Zhou, D., Huang, Z., Feng, J., and Kang,
2: Open large-scale language models. arXiv preprint
B. Bubogpt: Enabling visual grounding in multi-modal
arXiv:2309.10305, 2023.
llms. arXiv preprint arXiv:2307.08581, 2023b.
Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Hu, A., Shi, P., Shi, Y., et al. mplug-owl: Modulariza- Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging
tion empowers large language models with multimodality. llm-as-a-judge with mt-bench and chatbot arena. arXiv
arXiv preprint arXiv:2304.14178, 2023. preprint arXiv:2306.05685, 2023.
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
E. A survey on multimodal large language models. arXiv Minigpt-4: Enhancing vision-language understanding
preprint arXiv:2306.13549, 2023. with advanced large language models. arXiv preprint
arXiv:2304.10592, 2023.
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang,
X., and Wang, L. Mm-vet: Evaluating large multi- Zhu, J., Zhu, X., Wang, W., Wang, X., Li, H., Wang, X.,
modal models for integrated capabilities. arXiv preprint and Dai, J. Uni-perceiver-moe: Learning sparse gener-
arXiv:2308.02490, 2023. alist models with conditional moes. Advances in Neural
Information Processing Systems, 35:2664–2678, 2022.
Yuan, Z., Li, Z., and Sun, L. Tinygpt-v: Efficient multi-
modal large language model via small backbones. arXiv Zhu, Y., Zhu, M., Liu, N., Ou, Z., Mou, X., and Tang,
preprint arXiv:2312.16862, 2023. J. Llava-phi: Efficient multi-modal assistant with small
language model, 2024.
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M.,
Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. Glm-130b: Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean,
An open bilingual pre-trained model. arXiv preprint J., Shazeer, N., and Fedus, W. St-moe: Designing stable
arXiv:2210.02414, 2022. and transferable sparse expert models. arXiv preprint
arXiv:2202.08906, 2022.
Zhang, P., Wang, X. D. B., Cao, Y., Xu, C., Ouyang, L.,
Zhao, Z., Ding, S., Zhang, S., Duan, H., Yan, H., et al.
Internlm-xcomposer: A vision-language large model for
advanced text-image comprehension and composition.
arXiv preprint arXiv:2309.15112, 2023a.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V.,
et al. Opt: Open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068, 2022.
Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S.,
Li, J., Hu, R., Zhang, T., Wu, F., et al. Instruction tuning
for large language models: A survey. arXiv preprint
arXiv:2308.10792, 2023b.
14
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
A. Implementation Details
A.1. More Model Architecture
In Table 8, we present additional variants of the MoE-LLaVA. We introduce how the total parameters is calculated. When
the number of activated experts is 2, setting Experts = 2 yields the number of activated parameters.
T otal P arameters =Embedding · W idth
+ Layers · (4 · W idth · W idth + W idth · F F N · F F N F actor + 2 · W idth)
+ W idth + W idth · Embedding (12)
+ M oE Layers · (Experts − 1) · (W idth · F F N · F F N F actor + 2 · W idth)
+ M oE Layers · (W idth · Experts)
Table 8. More architecture details of the MoE-LLaVA model. “FFN Factor“ represents the number of linear layers in the FFN. “*”
donates the dimension of the hidden states for the keys (k) and values (v) is 1024. “1.6B×4-Top2” represents a dense foundation model
with 1.6B parameters, which will be equipped with a total of four experts, with two of them being activated. “†” donates all layers will
equipped with MoE layer.
15
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
16
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
In fact, we should reflect on what behavior is expected for a sparse MoE model. Should it exhibit specific patterns for each
expert, like MoE-LLaVA-Phi, or should it have similar behavior among the experts, like MoE-LLaVA-OpenChat? If
we consider that in a sparse model, the behavior of each expert should be similar at initialization, as they are initialized from
a shared FFN and the router has not yet learned any inductive biases, then if the routing distribution continues to remain
balanced as the network learns, it would be similar to the initialization and may lead to confusion in the model. Therefore,
we speculate that the lack of sufficient data may be a reason for the poor performance of MoE-LLaVA-OpenChat.
However, due to the current limitations in data and computational resources, we are unable to further explore this, and we
hope that future work can make progress in this direction.
Additionally, we provide more details in Figure 11, Figure 12, Figure 13, and Figure 14.
All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
17
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
All experts Expert 1 Expert 2 Expert 3 Expert 4 All experts Expert 1 Expert 2 Expert 3 Expert 4
Expert 1 Expert 3 Expert 1 Expert 3
Expert 2 Expert 4 Text Image Text Image Text Image Text Image Expert 2 Expert 4 Text Image Text Image Text Image Text Image
100% 100%
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
75% 75%
Percentage
Percentage
50% 50%
25% 25%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer MoE layer
18
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Table 12. Exhibition Board of MoE-LLaVA. MoE-LLaVA demonstrates the ability to detect and answer challenging questions when
prompted to verify them.
Visual input example, Tricky Question and Image: Visual input example, Common Question:
User If there are factual errors in the questions, point it out; User Can you explain this meme?
if not, proceed answering the question.
What’s happening in the desert? GPT-4 (OpenAI, 2023) This meme is a joke that combines two unrelated
things: pictures of the earth from space and chicken
LLaVA-1.5 There are no deserts in the image. The image features nuggets. The text of the meme suggests that the image
a beach with palm trees, a city skyline, and a large below is a beautiful picture of the earth from space.
body of water. However, the image is actually of chicken nuggets ar-
MoE-LLaVA There are no deserts in the image. The scene features ranged to vaguely resemble a map of the world. The
a city with a beach and a large body of water, which is humor in this meme comes from the unexpected juxta-
more likely to be a coastal city. position of the text and the image. The text sets up an
expectation of a majestic image of the earth, but the
image is actually something mundane and silly.
MoE-LLaVA The image features a pan filled with golden-brown
fried chicken pieces arranged in the shape of a world
map. There are numerous chicken pieces spread across
the pan, with some pieces located near the center,
while others are positioned towards the edges. The
chicken pieces are placed in a way that resembles the
User Do you know who drew this painting? continents and countries, creating a visually appealing
MoE-LLaVA The painting is a portrait of the famous artist, Leonardo and creative presentation. The pan itself is placed on a
da Vinci. table, ready to be served and enjoyed.
User Do you know who drew this painting? User What is unusual about this image?
MoE-LLaVA The painting is a drawing of a dog, and it is not possi- MoE-LLaVA The unusual aspect of this image is that a man is stand-
ble to definitively determine the artist who created it ing on a clothesline that is attached to the back of a
based on the image alone. moving taxi cab. This is not a typical scene, as clothes-
lines are usually found in residential areas, patios, or
laundry rooms, and not on the back of a vehicle. The
man is holding a shirt while standing on the clothes-
line, which adds to the unusualness of the situation. It
is likely that the man is performing some task or trying
to dry the shirt in an unconventional manner.
19
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Percentage
Percentage
50% 50% 50%
0% 0% 0%
1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx
Percentage
Percentage
50% 50% 50%
0% 0% 0%
1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx
Percentage
Percentage
50% 50% 50%
0% 0% 0%
1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx
Percentage
Percentage
0% 0% 0%
1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx
Percentage
Percentage
0% 0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx
Percentage
Percentage
0% 0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx
20
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Percentage
Percentage
50% 50% 50%
0% 0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx
Percentage
Percentage
50% 50% 50%
0% 0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx MoE layer idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer idx MoE layer idx MoE layer idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer idx MoE layer idx MoE layer idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer idx MoE layer idx MoE layer idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
MoE layer idx MoE layer idx MoE layer idx
21
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
Image Image Image
1 1 1
Top-1 Top-1 Top-1
Expert idx
Expert idx
Expert idx
2 2 2
Top-2 Top-2 Top-2
3 Others 3 Others 3 Others
4 4 4
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
MoE layer idx MoE layer idx MoE layer idx
22
Sparse mixture of experts like MoE-LLaVA reduces computational costs because only a subset of experts is activated for each token, leading to efficient model inference and training. Additionally, it achieves comparable or superior performance in benchmarks compared to traditional, fully dense LVLMs with higher activated parameters, such as outperforming LLaVA-1.5-13B on the POPE object hallucination benchmark .
MoE-LLaVA outperforms models with higher activated parameters due to its innovative sparse architecture that activates specialized experts, optimizing task-specific competence and maintaining lower activation for irrelevant tasks. This specialization mitigates object hallucination by refining insights with fewer but more relevant parameters, surpassing models like LLaVA-1.5-13B in the POPE benchmark .
MoE-LLaVA utilizes a sparse architecture that selectively activates experts for processing specific visual-language patterns, contrasting with earlier LVLMs that engaged all parameters at once. This approach enhances interpretative accuracy and computational efficiency, providing it an edge in both vision understanding and hallucination mitigation capabilities .
MoE-Tuning is a three-stage process that adapts visual tokens and trains LVLMs' parameters to establish multi-modal understanding. It prevents performance degradation typically caused by sparsity by maintaining inactive paths for less relevant components. This strategy refines the MoE layers and ensures efficient knowledge transfer across the model, culminating in a sparse but highly effective model, thus enhancing MoE-LLaVA's overall performance .
MoE-LLaVA architecture employs a strategy called MoE-Tuning, which activates only the top-k experts through learnable routers, keeping other experts inactive. This method reduces the computational burden by ensuring that only a fraction of the experts are used for each task, thus maintaining a constant computational cost while expanding the number of parameters. This allows MoE-LLaVA to handle a large model size with improved efficiency .
Extensive experimentation validates MoE-LLaVA's superior hallucination mitigation abilities, demonstrated by its performance on the POPE object hallucination benchmark. It surpasses LLaVA-1.5-13B by 1.1% while using fewer parameters, showing effective management of hallucination due to its specialized experts and sparse path strategy .
In the MoE-LLaVA architecture, the router is responsible for directing tokens to the appropriate experts. It efficiently determines which experts to activate, ensuring that only the most relevant ones contribute to processing the tokens. This selective activation process is a core component in maintaining the model's computational efficiency while leveraging a large number of experts .
MoE-LLaVA uses significantly fewer activated parameters (approximately 3B) compared to models like LLaVA-1.5-7B and even LLaVA-1.5-13B, yet it achieves similar or superior performance on visual understanding and hallucination benchmarks. This indicates that MoE-LLaVA is highly efficient, maintaining comparable functionality with reduced computational costs .
MoE-LLaVA achieves balance by employing sparse pathways where only select experts are activated via conclusive routing mechanisms. This allows for an increased vastness of parameters, refining task-specific expertise without a proportional increase in computational requirements, ensuring efficiency and scalability .
MoE-LLaVA's parameter structure achieves sparsity by incorporating multiple sparse paths through its MoE layers. Each layer consists of experts, among which only the top two are activated based on token relevance. Parameters are calculated by including embedding, MoE layers, and FFN factors, optimizing performance with minimal activation .