0% found this document useful (0 votes)
67 views82 pages

Efficient Architectures for LLMs Survey

The document surveys efficient architectures for Large Language Models (LLMs), focusing on overcoming the limitations of traditional transformer models that require substantial computations. It discusses various innovative approaches, including linear and sparse sequence modeling, efficient full attention variants, and hybrid architectures, while also exploring applications in multimodality and reasoning. The survey aims to provide a comprehensive overview and motivate future research towards more efficient AI systems.

Uploaded by

nareshbathala67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views82 pages

Efficient Architectures for LLMs Survey

The document surveys efficient architectures for Large Language Models (LLMs), focusing on overcoming the limitations of traditional transformer models that require substantial computations. It discusses various innovative approaches, including linear and sparse sequence modeling, efficient full attention variants, and hybrid architectures, while also exploring applications in multimodality and reasoning. The survey aims to provide a comprehensive overview and motivate future research towards more efficient AI systems.

Uploaded by

nareshbathala67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Speed Always Wins: A Survey on Efficient

Architectures for Large Language Models


Weigao Sun*1 Jiaxi Hu*2 Yucheng Zhou*3 Jusen Du*1 Disen Lan*1 Kexin Wang*4
Tong Zhu*5 Xiaoye Qu*1 Yu Zhang5 Xiaoyu Mo6 Daizong Liu7 Yuxuan Liang2
Wenliang Chen5 Guoqi Li4 Yu Cheng8 B
1 Shanghai AI Laboratory 2 HKUST (GZ) 3 University of Macau
4 Institute of Automation, Chinese Academy of Sciences 5 Soochow University
6 KTH Royal Institute of Technology 7 Peking University 8 The Chinese University of Hong Kong
arXiv:2508.09834v1 [[Link]] 13 Aug 2025

Large Language Models (LLMs) have delivered impressive results in language understanding, generation,
reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of
modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer
architecture requires substantial computations and poses significant obstacles for large-scale training and
practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that
address the inherent limitations of transformers and boost the efficiency. Starting from language modeling,
this survey covers the background and technical details of linear and sparse sequence modeling methods,
efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above
techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other
modalities and consider their wider implications for developing scalable, resource-aware foundation models.
By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM
architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.
GitHub: [Link]

Input Tokens

Efficient Sequence Modeling


Linear Sequence Modeling (§2)

Unified Linear
Linear Linear State Space Test-Time-
Sequence
Attention RNN Model Training RNN Hybrid Diffusion LLM Applications to
Modeling
Architecture (DLLM) Other Modalities
(§6) (§7) (§8)
Sparse Sequence Modeling (§3) Efficient Full Attention (§4)
Non-
Inter-layer Hybrid Autoregressive Vision
Static Sparse Attention IO-Aware Grouped
DLLM
Attention Attention
Dynamic Sparse Attention Bridging DLLM
Intra-layer Hybrid and Audio
Mixture of Quantized Autoregressive
Training-free Sparse Attention Attention Attention
Extending DLLM to
Multimodality
Multimodality

Sparse Mixture-of-Experts (MoE) (§5)

Routing Mechanisms Expert Architectures MoE Conversion

Output Tokens

Figure 1: Overview of Efficient Architectures for Large Language Models.

*Equal Contribution B Corresponding Author (chengyu@[Link])


Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Table of Con t e n t s

1 Introduction 3
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Position and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Linear Sequence Modeling 6


2.1 Linear Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Linear RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 State Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Test-Time-Training RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Unified Linear Sequence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Hardware-efficient Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Sparse Sequence Modeling 21


3.1 Static Sparse Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Dynamic Sparse Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Training-free Sparse Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Hardware-efficient Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Efficient Full Attention 25


4.1 IO-Aware Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Grouped Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Mixture of Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Quantized Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Sparse Mixture-of-Experts 31
5.1 Routing Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Expert Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 MoE Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 Hybrid Architectures 36
6.1 Inter-layer Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Intra-layer Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7 Diffusion Large Language Models 39


7.1 Non-Autoregressive Diffusion LLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.2 Bridging Diffusion LLM and Autoregressive . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3 Extending Diffusion LLM to Multimodality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8 Applications to Other Modalities 43


8.1 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.2 Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.3 Multimodality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

9 Conclusion and Future Directions 48

2
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

1. Introduction
1.1. Background

In recent years, Large Language Models (LLMs) have emerged extraordinary capabilities in understanding
and generating natural language have driven substantial progress across a wide range of tasks, including text
generation [1, 2, 3], code generation [4, 5, 6], question answering [7, 8], and machine translation [3, 9].
Prominent LLM families such as ChatGPT [2, 10, 11, 12, 13, 14, 15, 16, 17], Claude [18, 19, 20, 21, 22],
Gemini [23, 24, 25], DeepSeek [26, 27, 28, 29], Qwen [30, 31, 32, 33], LLaMA [34, 35, 36, 37], GLM [38],
Minimax-Text [39], InternLM [40, 41], Hunyuan [42, 43] have continuously pushed the boundaries of
performance, while also reshaping how people interact with machines in daily life. Beyond their initial
role in language tasks, LLMs are increasingly being applied in two demanding areas: multimodality and
complex reasoning. In multimodal applications, LLMs now play a key role in systems that integrate and
generate information across multiple data types. Recent advances in Vision-Language Models (VLMs), such
as Qwen-VL [44, 45, 46], InternVL [47, 48, 49, 50], Seed-VL [51], Kimi-VL [52], Minimax-VL [39], illustrate
this shift, showcasing enhanced abilities in handling cross-modal understanding and generation by combining
language skills with visual processing. At the same time, a growing line of work focuses on strengthening
the reasoning capabilities of LLMs, often referred to as Large Reasoning Models (LRMs). Representative
systems like OpenAI o1/o3 [14, 15], DeepSeek-R1 [29], Seed1.5-Thinking [53], Minimax-M1 [54], Kimi-
k1.5/K2 [55, 56] incorporate strategies such as long-chain Chain-of-Thought (CoT) prompting [57] and
Reinforcement Learning (RL) [58] to support multi-step reasoning and more deliberate cognitive behavior.
Although LLMs, VLMs, and LRMs have brought major advances in language understanding, multimodal
processing, and complex reasoning, they also introduce considerable computational demands [59, 60, 61].
These increased requirements result in significantly higher development and deployment costs, which present
practical barriers to widespread adoption. This challenge is shared across LLMs, VLMs, and LRMs, highlighting
a common trade-off between model capability and efficiency. While such models offer a promising path
toward intelligence, their high resource consumption raises important questions about the sustainability and
practicality of pursuing even more powerful systems under current computational constraints. This keeps us
thinking: Have we paused to consider the immense hidden costs behind such unrivaled capabilities,
and what is the true price of this "intelligence"?
The core architecture behind many of latest breakthroughs is the Transformer [62], introduced in
2017. Its self-attention mechanism allows models to capture long-range dependencies more effectively
than traditional Recurrent Neural Networks (RNNs) [63], enabling the scaling of LLMs to hundreds of
billions or even trillions of parameters [2]. However, one major limitation of the Transformer lies in the
quadratic complexity of its self-attention mechanism, which scales as O( N 2 ) with the input sequence length
N [64]. This computational inefficiency leads to extremely high training and inference costs, particularly in
tasks that involve long-context inputs [65]. With the continued advancement of artificial intelligence (AI),
long-sequence scenarios are becoming increasingly popular.
As shown in Figure 2, tasks such as Retrieval-Augmented Generation (RAG) [7] often requires LLMs
to process entire documents. In the emerging era of AI agents [66], long sequences frequently arise from
repeated generations and multiple tool invocations. When models are equipped with enhanced reasoning
abilities [58], forming LRMs, they must handle lengthy chains of thought, which also result in long sequences.
Similarly, in multimodal applications[67], high-resolution images, videos, and audio introduce additional
long-sequence challenges. Another key component of the Transformer architecture, the Feed-Forward
Network (FFN) [68], also faces challenges as model size increases. When the number of parameters grows
beyond a certain scale, the training cost and inference efficiency of the FFN layer become increasingly difficult

3
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

system doc user


RAG Pattern

system user gen tool user gen tool


Agentic Pattern

system user gen think gen think gen


Reasoning Pattern

system text image video audio user


Multimodal Pattern

Figure 2: Long Context Patterns. We provide representative examples of long-context usage patterns across various
scenarios, including retrieval-augmented generation (RAG), agentic, reasoning, and multimodal applications.

to manage. In this condition, another question arises: How can we break through the Transformer’s
efficiency ceiling? Is costly "intelligence" our only path forward?
To address these pressing challenges and unlock the full potential of LLMs, the research community has
been actively exploring a spectrum of innovative architectural designs and optimization strategies. This
survey delves into these innovative approaches, systematically categorizing them to provide a comprehensive
overview. The specific methods encompassed within each category can be found in Figure 3. Here we
summary each category as below:

• Linear Sequence Modeling: These methods aim to reduce the quadratic complexity of self-attention
to linear complexity (O( N )) by reformulating the attention mechanism, often drawing inspiration from
conventional attention, RNNs or state-space models (SSMs). These methods also eliminate the need to
store Key-Value (KV) cache during inference, thereby lower the deployment cost.
• Sparse Sequence Modeling: Instead of computing attention over all token pairs, these methods selectively
focus on a subset of interactions (i.e., the attention map), thereby reducing computational and memory
requirements while striving to preserve performance.
• Efficient Full Attention: These methods enhance the standard softmax attention’s efficiency while
retaining its theoretical quadratic complexity, such as improving memory access efficiency through IO-
aware attention mechanisms, and reducing the KV cache size through grouped query mechanisms.
• Sparse Mixture of Experts: This paradigm introduces a conditional computation approach where only a
subset of a model’s parameters (called experts) are activated for each input token, allowing for a massive
increase in model capacity without a proportional increase in computational cost.
• Hybrid Architectures: These designs strategically combine linear sequence modeling components with
traditional full attention layers. This can be achieved through intra-layer hybrid, where both types of
operations co-exist within the same layer, or inter-layer hybrid, where different layers utilize distinct
attention types, leveraging the strengths of each to trade-off both efficiency and model capacity.
• Diffusion LLMs: An emerging area that explores the non-autoregressive diffusion models for language
generation, potentially offering new avenues for efficient and high-quality text synthesis.
• Applications to Other Modalities: Importantly, the architectural principles driving efficiency in LLMs are
not confined to language; their adaptability is increasingly evident in other domains such as vision, audio,
and multi-modality, a trend we will also explore.

4
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

e.g.,Linear Transformer [64], ABC [69], Lightning Attention [70], GLA [71], GSA [72],
Linear Attention (§2.1)
LightNet [73], Based [74], Rebased [75], DeltaNet [76], Gated DeltaNet [77], MoM [78], etc.

e.g.,HGRN [79], HGRN2 [80], RWKV4 [81], RWKV6 [82], RWKV7 [83], LRU [84],
Linear RNN (§2.2)
xLSTM [85], GateLoop [86], etc.

e.g., LegRNN [87], Hippo [88], LSSL [89], S4 [90], HTTYH [91], DSS [92], S4D [93],
State Space Model (§2.3) H3 [90], S5 [94], SpaceTime [95], Time-SSM [96], Stable-SSM [97], Hippo-PTD [98],
Liquid-S4 [99], Longhorn [100], Mamba [101], Mamba2 [102], Comba [103], etc.
Linear Sequence
Modeling (§2) Test-Time-Training
e.g., TTT [104], Titans [105], Lattice [106], Miras [107], Atlas [108], MesaNet [109], etc.
RNN (§2.4)

Unified Linear Sequence


e.g.,LCSM [110], Linear-MoE [65], Mamba2 [102], Comba [103], TTT [104], Titans [105], etc.
Modeling (§2.5)

Linearization (§2.6) e.g.,T2R [111], MambaInLlama [112], SUPRA [113], LoLCATs [114], Liger [115], etc.

Hardware-efficient
e.g.,Lightning Attention [70], GLA [71], S4 [90], Mamba [101], Mamba2 [102], etc.
Implementation (§2.7)

Static e.g.,Sparse Transformer [116], Star-Transformer [117], BlockBERT [118], Longformer [119],
Sparse Attention (§3.1) ETC [120], BigBird [121], LongT5 [122], LongNet [123], Axial Attention [124], etc.

Dynamic e.g.,Reformer [125], Routing Transformer [126], Sparse Sinkhorn Attention [127],
Sparse Attention (§3.2) Memorizing Transformers [128], Unlimiformer [129], NSA [130], MoSA [131], etc.
Sparse Sequence
Modeling (§3)
Training-free e.g.,SpAtten [132], MInference [133], SeerAttention [134], StreamingLLM [135], H2O [136],
Sparse Attention (§3.3) FastGen [137], Quest [138], LongHeads [139], LServe [140], XAttention [141], etc.

Hardware-efficient e.g.,Longformer [119], FlashAttention-1 [142], FlashAttention-2 [143], SeerAttention [134],


Implementation (§3.4) NSA [130], MoBA [144], etc.

IO-Aware Attention
e.g.,FlashAttention-1 [142], FlashAttention-2 [143],FlashAttention-3 [145] etc.
(§4.1)

Grouped Attention
e.g.,MQA [146], GQA [147], MLA [27, 28], GTA [148], GLA [148] etc.
(§4.2)
Efficient Full
Efficient Architectures

Attention (§4)
Mixture of Attention
e.g.,MoA [149], Llama-MoE-v2 [150], MoH [151], MoBA [144], MoM [78], MoSA [131], etc.
(§4.3)
e.g.,SageAttention-V1 [152], SageAttention-V2 [153], SageAttention-V3 [154], Q-Bert [155],
Quantized Attention
I-BERT [156], INT-FlashAttention [157], Q8BERT [158], FullyQT [159], TurboAttention [160],
(§4.4)
HACK [161], BitDistiller [162], etc.

Routing Mechanisms e.g.,Expert-Choice [163], BASE Layer [164], Hash Layer [165], MoE-Dynamic [166],
(§5.1) DynMoE [167], AdaMoE [168], Ada-K [169], AuxLossFree [170], Global Batch [171], etc.

Sparse Mixture- Expert Architectures


e.g.,DeepSeekMoE [172], DeepSpeed-MoE [173], Qwen3 [33], OLMoE [174], MoD [175], etc.
of-Experts (§5) (§5.2)

MoE Conversion e.g.,MoEBERT [176], MoEfication [177], LLaMA-MoE [178], LLaMA-MoE-v2 [150],
(§5.3) Sparse Upcycling [179], BTM [180], BTX [181], etc.

e.g.,Zamba [182], Zamba2 [183], Samba [184], Jamba [185], RWKV-X [186],
Inter-layer Hybrid
Minimax-01 [39], Mamba-in-Llama [112], HunYuan-Turbos [43], Zebra-Llama [187],
(§6.1)
Hybrid YOCO [188], RecurrentGemma [189], LaCT [190], etc.
Architectures (§6)
Intra-layer Hybrid
e.g.,Hymba [191], TransMamba [192], Liger [115], LoLCATs [114], LoLA [193], etc.
(§6.2)

Non-Autoregressive
e.g.,LLaDA [194], Diffusion-LM [195], DiffuSeq [196], SEDD [197], Plaid [198], etc.
Diffusion LLM (§7.1)

Bridging Diffusion LLM


Diffusion LLM (§7) e.g.,BD3-LMs [199], Scaling diffusion [200], etc.
and Autoregressive (§7.2)

Extending Diffusion LLM


e.g.,LLaDA-V [201], UniDisc [202], LaViDa [203], MMaDA [204], etc.
to Multimodality (§7.3)
e.g.,Vig [205], Vision-rwkv [206], Tutel [207], InsectMamba [208], Voxel mamba [209],
Vision (§8.1)
DiM [210], U-mamba [211], Vm-unet [212], Rwkv-unet [213], Mambabev [214], etc.

Applications to Other e.g.,Audio mamba [215], Mamca [216], Rawbmamba [217], SaShiMi [218], Music-Diff [219],
Audio (§8.2)
Modalities (§8) BiMamba [220], Dual-path mamba [221], Spmamba [222], VAD [223], etc.

e.g.,MaTAV [224], Avs-mamba [225], VisualRWKV-UHD [226], Llada-v [201], Mmada [204],
Multimodality (§8.3)
Fragkiadaki [202], VL-MoE [227], Moe-llava [228], MoCLE [229], Llava-mole [230], etc.

Figure 3: A Comprehensive Taxonomy of Efficient Architectures for Large Language Models.

5
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

1.2. Position and Contributions

The pursuit of more efficient model architectures has attracted substantial attention, resulting in a number
of survey articles that chart the evolution of this research area. For example, Tay et al. [231] provides a
detailed examination of Efficient Transformers, discussing a wide range of strategies designed to improve the
self-attention mechanism, including early efforts in linear attention. Recent years, Patro et al. [232] offers
an extensive survey of SSMs, positioning these architectures as promising alternatives to Transformer-based
approaches for handling very long input sequences; their work systematically classifies different SSM designs
and highlights applications that span language, vision, and other domains. More recently, Tiezzi et al. [233]
reviews the renewed interest in recurrent processing, presenting models that blend core ideas from both
Transformers and traditional recurrent networks, as well as the latest advances in state-space formulations.
Sun et al. [234] summarizes recent advances in linear attention and sparse attention, and hybrid pretrained
LLMs with these techniques. Although these surveys cover some of the efficient architectures like linear
models, they only focus on a limit scope of methods to address Transformer’s high computation problem.
Compared with the above reviews, this survey starts from key components of the Transformer model
and provides a more comprehensive and organized overview of recent advances in efficient architectures of
LLMs. We focus on summarizing their key design principles, performance benefits, known limitations, and
possible future developments. Through this synthesis of current progress and emerging trends, we aim to
support researchers and practitioners in developing more efficient and scalable LLMs and beyond.
The key contributions of this survey can be summarized as follows:

• Comprehensive Survey of Efficient Sequence Modeling: We present an in-depth review of recent


progress in efficient sequence modeling. This includes methods that reduce the quadratic cost of attention
through linear sequence models, strategies that selectively sparsify attention scores, and approaches
that optimize the full softmax attention operation. By examining their design principles, performance
trade-offs, and implementation details, we identify common themes and trace each technique back to its
original motivation.
• Broader Transformer Component Optimizations: In addition to sequence modeling, we extend our
analysis to other critical Transformer sub-modules. We cover advances in sparse MoE layers that enable
conditional computation, hybrid architectures that blend linear and standard attention within or across
layers, and the emerging area of diffusion-based language generation architectures. This wider perspective
highlights how different parts of the Transformer can be re-imagined for efficiency.
• Multi-Modal and Multi-Domain Applications: Recognizing that efficient architectures are not limited to
text processing, we explore their adaptation to other data modalities. We discuss applications in vision,
audio, and multimodal settings, demonstrating how efficiency gains in token and channel mixing can
benefit tasks ranging from high-resolution image understanding to multimodal sequence modeling.

2. Linear Sequence Modeling


In general, linear sequence modeling approaches can be broadly categorized into four groups: Linear
Attention, Linear Recurrent Neural Network (Linear RNN), State Space Model (SSM), and Test-Time-Training
(TTT) RNN, each originating from distinct motivations and mathematical formulations [65]. Following we
first present these methods individually, and then describe how they can be unified under a common linear
sequence modeling framework. We further explore a new direction called Linearization, which aims to
convert standard Transformer models with softmax attention into linear sequence modeling architectures,
allowing them to benefit from the efficiency of linear methods at a low cost [115]. In addition, this section

6
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

𝑦
𝑺𝑡 = 𝑺𝑡−1 − 𝜂𝑡∇ℓ 𝑺𝑡−1;𝒌𝑡,𝒗𝑡 Neural
𝑢 𝑩 𝑥 𝑪 𝑦 Network
𝑸 𝑲𝑇 𝑽
d×𝑁

𝑁×𝑑 𝑁×𝑑
𝑨 Learning
𝑥 ℓ 𝑺; 𝒌𝑡 , 𝒗𝑡
𝑥

Linear Attention Linear RNN State Space Model Test-Time-Training RNN

Converge to

𝑺
Linearize to
𝑸 𝑲𝑇 𝑽
𝑁×𝑑 d×𝑁 𝑁×𝑑

𝑞 𝑘 𝑣

Softmax Attention
𝑥

Unified Linear Sequence Modeling

Figure 4: Linear Sequence Modeling Methods and Their Connections. The formulations of linear attention, linear RNNs,
state space models, and test-time training RNNs have gradually converged toward a unified representation. Moreover,
softmax attention can also be transformed into a linear sequence modeling form through the linearization techniques.

also covers implementation strategies that enhance the hardware efficiency of these linear sequence modeling
approaches.

2.1. Linear Attention

Standard transformer [62] employs softmax attention mechanism which takes the input token xt and
computes the output ot through:

∑︀t
exp(qt ki⊤ )vi
qt , kt , vt = xt WQ , xt WK , xt WV , ot = ∑︀i=t 1 ⊤
(1)
i =1 exp( qt ki )

Linear Attention was initially proposed in the Linear Transformer [64], which replaces the standard
softmax attention with a linear approximation based on feature maps or kernel functions. This modification
addresses the computational inefficiencies of conventional attention mechanisms and enables linear-time
complexity. The generalized form of attention using an arbitrary similarity function can be expressed as:

∑︀t
sim(qt , ki )vi
ot = ∑︀i=t 1 (2)
i =1 sim( qt , ki )
Note that Eq. (2) is equivalent to Eq. (1) if we substitute the similarity function with the specific implemen-
tation sim(q, k) = exp(qk⊤ ). Linear attention introduces kernel-based function to represent sim(q, k) as
ϕ(q)ϕ(k)⊤ with feature mapping ϕ(·). Then Eq. (2) can be rewritten as:

7
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

∑︀t
ϕ(qt )ϕ(ki )vi
ot = ∑︀i=t 1 (3)
i =1 ϕ ( q t ) ϕ ( ki )
Eq. 3 can be further simplified by using the associative property of matrix multiplication as:

∑︀t
ϕ(qt ) ϕ(ki )vi
ot = ∑︀i=t 1 (4)
ϕ(qt ) i =1 ϕ ( ki )
∑︀t ∑︀t
Let St = i =1 ϕ ( ki ) vi and zt =

i =1 ϕ ( kt ), the above formulation in Eq. (4) can be written in a
recurrent form [64] as:

St = St−1 + ϕ(kt )⊤ vt ,
{︃
ϕ ( q t ) St

ot = (5)
z t = z t −1 + ϕ ( k t ) , ϕ(qt )zt

By reconfiguring the order of operations ( QK ⊤ )V into Q(K ⊤ V ) using the associative property of matrix
multiplication can reduce both computational and memory complexity from quadratic complexity O( N 2 d)
to linear complexity O( Nd2 ) over sequence length N.
Softmax Approximation via Feature Mapping. The pursuit of linear computational complexity in linear
attention mechanisms fundamentally relies on decoupling attention weight computation from the sequential
dependencies inherent in standard softmax attention. This is achieved by reformulating softmax( QK ⊤ )V
into ϕ( Q)(ϕ(K ⊤ )V ) through the associative property of matrix multiplication. Core to this approach is the
approximation of exponential similarity exp(q⊤ k) via feature mapping functions sim(q, k) = ϕ(q)ϕ(k)⊤ .
For example, Linear Transformer [64] employs a feature map ϕ( x ) = elu( x ) + 1 ensuring non-negative
similarity while mitigating gradient vanishing for negative inputs compared to relu(·). Random feature
attention (RFA) [235] proposes to utilize random feature projection [236] as feature mapping function
ϕ(·) for efficient softmax function approximation, outperforming its baseline ϕelu (·). Nystromformer [237]
and Skyformer [238] adopt a similar idea to utilize low-rank kernel-based method for softmax attention
approximation. Subsequent work by [239] points out that the distribution of attention weights in linear
transformers is relatively smooth, limiting focus on informative features, therefore proposes focused function
ϕ p as query-key feature mapping to improve feature diversity in linear attention weights and a simple
rank restoration module by using depthwise convolution (DWC) to the attention matrix, enhancing the
expressiveness of linear attention while maintaining low computation complexity.
While linear attention models offer considerable gains in sequence modeling efficiency, the fidelity
of their approximation to standard attention mechanisms remains deficient, especially when applied to
long-context recall tasks. Based [74] introduces a simple hybrid architecture combining linear attention
and sliding window attention, achieving better throughput and recall tradeoff. ReBased [75] improves
long-context modeling performance by further developing Based architecture using polynomial kernels with
learnable parameters and normalization operations. Concurrently, Zhang et al. [240] propose Hedgehog
feature mapping for mimicking low-entropy ("spiky") weights and dot-product monotonicity properties in
softmax attention, achieving better recall performance. However, Qin et al. [241] observes that kernel-based
linear attention for softmax attention approximation suffer from unbounded gradients, which would cause
unstable convergence during training, and proposes output normalization for gradients stability. ReGLA
[242] addresses this problem by using the normalized exponential feature mapping function, and introduces
variance reduction factor for training stability.

8
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Gating Mechanism in Linear Attention. Linear attention mechanisms often exhibit suboptimal sequence
modeling capability compared to softmax attention, with empirical observations indicating performance
disparities exceeding marginal levels; this limitation is partly attributed to vanilla linear attention’s reliance
on cumulative memory updating, which induces memory conflicts and hinders effective sequence modeling
[71]. Furthermore, this cumulative process intrinsically lacks mechanisms for dynamically regulating
information flow over sequences. Consequently, the introduction of gating mechanism becomes essential
to endow the model with adaptive forgetting and controlled updating capabilities, which allows the model
to selectively attenuate or discard obsolete or irrelevant accumulated states while strategically integrating
new inputs, mitigating memory interference and enabling more precise, contextually relevant representation
learning essential for complex sequence tasks. Early work like TransNormerLLM [243] utilizes time-invariant
memory decay and introduces Lightning Attention [70] for I/O-aware hardware-efficient optimization,
and further introduces Lightning Attention-2 [244] by refining block-wise attention computations, which
significantly improves computation efficiency. RetNet [245] adopts a retention mechanism by introducing a
data-independent decay term into linear attention and gain improvement in sequence modeling.
While prior approaches utilized data-independent gating mechanisms for memory management, the ideal
scenario involves models dynamically removing less important key-value associations based on the interaction
between new inputs and existing memory content to accommodate new information. Recent advancements
in linear attention variants demonstrate significant improvements by incorporating gating or decay factors
for forgetting and updating. However, a more effective strategy is the adoption of data-dependent gating
mechanisms, which dynamically control memory retention and updates through input-driven selection.
For instance, Gated Linear Attention (GLA) [71] employs a data-dependent gating scheme to enhance the
performance of sequence modeling and hardware efficiency, while Gated Slot Attention (GSA) [72] leverages
context-aware gated linear attention with bounded-memory control (ABC) [69] to improve sequence modeling
and long-context recall. To address the inefficiency of linear attention in multi-dimensional sequence modeling,
where multiplicative decay requires multiple scans over the input, LightNet [73] introduces an additive
linear recurrence that enables efficient single-pass processing of multi-dimensional data. Theoretical work in
MetaLA [246] establishes the necessity of time-varying gating for optimal softmax attention approximation
within linear attention frameworks. Further refinement is achieved by ReGLA [242], which introduces an
additional refining gate to modify the forget gate when activations near saturation, thereby enhancing overall
model performance and stability.
Delta Learning Rule in Linear Attention An effective sequence model should be able to remove less
relevant key-value associations to accommodate new information, and this removal should depend on the
interaction between incoming inputs and the memory state. From the perspective of fast weight programming,
the recurrent hidden state or memory in linear attention mechanisms, such as in GLA, functions as a fast
weight matrix that maps the input query qt to the output ot . This mapping is updated using a Hessian-like
rule [247], which imposes limitations on memory capacity. To improve the expressiveness and adaptability of
memory, recent work has explored the delta (Widrow-Hoff) learning rule [248, 249], enabling meta-learning
or online adaptation during inference, which updates memory as follows:

St = St−1 − β t (St−1 kt − vt )k⊤


t

where the update is driven by the difference of predicted output St−1 kt and the target value vt . This
mechanism enhances memory capacity and gives rise to several delta-rule-based models for linear sequence
learning. DeltaNet [76] is closely related to the Test-Time Training (TTT) framework [104], where memory
is treated as a trainable component and updated through gradient descent steps. To better highlight the
distinctions, we introduce the TTT family separately in Section 2.4. To overcome the limitation of DeltaNet,

9
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

which updates only a single key-value pair at each step, Gated DeltaNet [77] introduces a gating mechanism
that enables more flexible memory control and allows rapid removal of outdated or irrelevant information.
MesaNet [109] further advances this approach by incorporating the Mesa layer [250], which can dynamically
adjust computational cost at inference time. It employs a recursive least squares loss for fast weight updates
and can be viewed as a second-order online learner. MesaNet reduces the computational overhead of matrix
inversion using gradient conjugation and supports hardware-efficient chunkwise parallelization. Additionally,
models such as Comba [103] and RWKV7 [83] also fall under this category. However, we present them
separately in the sections on state-space models and linear RNNs, respectively, to highlight their unique
structural characteristics and connections to these model families.
Log-Linear Memory in Linear Attention. Although linear attention (or more generally, linear sequence
modeling) has achieved performance comparable to or even surpassing that of Transformers in various
downstream tasks, it still suffers from an inherent limitation: the existence of only a single, fixed-size
state, which constrains its memory capacity. As a result, its performance degrades in long-context retrieval
scenarios. A possible compromise is to design a model whose memory grows at a rate that is logarithmic plus
linear in the sequence length (another potential solution is to expand multiple memories, as done in MoM
[78]). Log-Linear Attention [251] introduce a general framework for linear attention models by replacing
the fixed size hidden state in linear attention with a logarithmically growing set of hidden states enables
achieving O( N log N ) training and O(log N ) inference complexity, thus balancing the efficiency of linear
attention and the expressiveness of softmax attention. PSM [252] and Attraos [253] adopt a similar idea to
extend linear recursion to a logarithmic scale through Blelloch’s parallel scanning [254].

2.2. Linear RNN

RNNs are one of the most common methods for processing and modeling sequence data. At any given time
step t, an RNN processes an input xt and updates its hidden recurrent state ht . This updated hidden state is
subsequently used to generate an output yt :

ht = σ(Whh ht−1 + Whx xt + bh ), yt = f (ht ) (6)


where ht−1 represents the previous hidden state, f (·) is the projection function, and σ(·) is the activation
function. Even over various time steps, an RNN maintains a fixed-size hidden state ht containing historical
sequence information. Consequently, the parametrization cost for an RNN remains constant regardless of the
increase in time steps, enabling linear-time complexity sequence modeling in a recurrent form.
Traditional RNNs face the problem of inability to conduct parallel training and low efficiency due to
its recurrence characteristics, limiting the capability in modeling long-term dependencies and difficulty in
scaling up. The main reason is that the update of hidden state involves matrix multiplication and nonlinear
activation functions which not only leads to gradient issues but also prevents parallel training. Linear RNNs
are proposed to address these issues by removing nonlinearity. In some variants, the recurrent transformation
is restricted to element-wise operations for efficiency, while others maintain structured or diagonalized
matrices. LRU [84] leverages complex-valued diagonalization for efficient parallel computation and stable
exponential parameterization of its recurrent dynamics to effectively capture long-range dependencies.
Typical linear RNN such as Gated Linear Recurrent Unit (GLRU) [79, 80, 255] is formulated as follows:

gt = σ(Wg xt + b g ), it = τ (Wi xt + bi ), ot = σ(Wo xt + bo ),


(7)
h t = g t ⊙ h t −1 + (1 − g t ) ⊙ i t , yt = ht ⊙ ot

10
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

where ⊙ denotes element-wise product. Since linear RNN (in Eq. (7)) removes nonlinearity, it enables
efficient parallelized training and achieve linear recurrent inference for sequence modeling.
Recent works have made effort to explore more expressive Linear RNN architecture with efficient
recurrence or gating mechanisms. HGRN [79] maintains a similar RNN form as in Eq.( 7), using dot products
and accumulation to update memory. Additionally, it employs a novel forget gate mechanism featuring a
learnable, layer-wise monotonically increasing lower bound across network layers. This allows lower layers
to prioritize local, short-term information by forgetting more rapidly, while upper layers, constrained by a
higher forget gate lower bound, effectively retain historical context for long-term dependency modeling. To
enhance its expressive power within the linear recurrence framework, HGRN incorporates complex-valued
hidden states and recurrence, where the gate’s magnitude governs memory retention and its phase can
encode relative positional information. RWKV4 [81] adopts a similar RNN structure but uses a channel-wise
exponential decay mechanism inspired by relative position modeling in AFT [256]. Moreover, it employs
token shifts to integrate the input tokens. These linear RNN models use d-dimensional vectors as memory,
which limits their capacity. Research has increasingly shown that expanding memory capacity is crucial for
enhancing the performance of linear RNNs. Consequently, subsequent approaches have started using the
outer product of two d-dimensional vectors to form a d × d matrix as the memory representation.
HGRN2 [80] modifies the hidden state update process to Eq.( 8). By using the outer product with
the forget gate, it expands the memory to a d × d matrix, thereby increasing memory capacity. GateLoop
[86] replaces their static state transition matrices with diagonal state transitions that dynamically adapt
to the input. This approach allows for a matrix form of memory update within its recurrent mechanism,
enabling more flexible state evolution. RWKV6 [82] changes the original element-wise vector product to
a k T v formulation to achieve a multi-headed matrix-valued memory, thus expanding its capacity. RWKV6
introduces an advancement with its dynamic recurrence. This means the channel-wise decay rates within
the linear state update are no longer static but become data-dependent and vary at each timestep, achieved
efficiently through Low-Rank Adaptation (LoRA) to augment learned base decay vectors. Additionally, it
applies similar data-dependency to its token-shift mechanism, allowing for more flexible and context-aware
integration of past and current token information. Similarly, xLSTM [85] extends traditional LSTMs by
introducing variants like mLSTM, which replaces the standard vector cell state with a matrix-valued memory
to significantly expand storage capacity.

ht = ht −1 · Diag{ f t } + it ⊗ (1 − f t ) ∈ Rd×d (8)

At this stage, the memory structure of both linear RNNs and linear attention mechanisms has converged,
utilizing matrix-based memory derived from the outer product of vectors. This allows for the application of
Test-Time-Training on the memory as will be discussed in §2.4. RWKV7 [83] incorporates test-time gradient
descent by introducing a "dynamic state evolution" powered by a generalized delta rule for its recurrent state
updates. This mechanism allows the model to dynamically adapt its multi-headed matrix-valued state at
each time step, effectively performing a form of test-time learning or in-context adaptation. This update
rule, while more complex than prior RWKV iterations, endows RWKV7 with enhanced expressivity, further
enhancing performance.
The motivation behind linear RNNs is to optimize the hidden state update for parallel training. Their
recurrent formulations now closely resemble those of linear attention mechanisms. Although their starting
points differ, they have evolved towards structurally similar designs, differing mainly in notation and specific
architectural choices.

11
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

2.3. State Space Model

State Space Model (SSM) [257] is a historical mathematical framework used to describe dynamic systems
that evolve over time in control systems. In this section, we systematically go through its development and
applications on efficient sequence modeling.
From Hippo Theory to Continuous-Time SSM. Unlike the development path of traditional control
theory [258], which progresses from state space models to spectral projections, deep state-space models
follow a reverse trajectory. According to Legendre-RNN [87]
∫︀ t and Hippo [88] theory, given an input function
u(s), a set of orthogonal polynomial basis ϕn (t, s) that −∞ ϕm (t, s)ϕn (t, s)ds = δm,n , and an inner product
probability measure µ(t, s). This enables us to project the u(s) onto the polynomial basis along the time
dimension:

∫︁ t
⟨u, ϕn ⟩µ = u(s)ϕn (t, s)ω (t, s)ds (9)
−∞

This process can also be viewed as a spectral transformation K u(s) = D K (t, s)u(s)ds with kernel
∫︀

Kn = ϕn (t, s)ω (t, s) [91, 96]. According to Time-SSM [96], by adjusting the polynomial basis ϕn (t, s) and
inner product probability measure µ(t, s), various integral transforms, e.g., Gabor and Laplace transforms, can
be realized. Moreover, When ϕ is the orthogonal basis in closed-recursive-form (e.g., Legendre, Chebyshev,
Laguerre), by collecting all coefficients of order n, we can obtain a continuous ODE system. Together
with the corresponding polynomial reconstruction process [96], this leads to the form of a time-invariant
continuous-time SSM formulated in LSSL [89] and S4 [90]:

x′ (t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) (10)


This can also be regarded as parameterized maps that transform the input u(t) into an N-dimensional latent
space and project it onto the output y(t). Du(t) term is typically regarded as a residual connection [259]
and can often be omitted.
Discretization. Since real-world data is typically in discrete form, it is necessary to convert continuous-
time SSM parameters (AB) into their discrete counterparts (A, B) with step ∆. Early approaches like
S4[89, 90] employ Zero-Order Hold (ZOH) discretization (use the general solution for Eq. (10)), while later
studies like Mamba [101] found that using hybrid discretization methods (use forward Euler method [88]
for B) could lead to more compact and efficient representations.

(ZOH) : A = exp(∆A), B = (∆A)−1 (exp(∆A) − I ) · ∆B (11)


(In Practice) : A = exp(∆A), B = ∆B (12)

Diagonal SSMs. Early SSMs such as S4 and LSSL leveraged the HiPPO theory for initialization, while
in practice, the naive recursive calculation of SSM Kernel can be quite computationally intensive. Ideally,
when matrix A is diagonal, the computation simplifies to exponentiating the diagonal elements only. S4 [90]
initialize the state matrix as a diagonal plus low-rank structure. Later, DSS [92] expressed it as a negative
diagonal matrix (since any real skew-symmetric matrix derived from Hippo can be conjugate-diagonalized
into a complex matrix, and the process A = V −1 ΛV equal to a coordinate transform [96]), and in S4D [93],
this is empirically simplified to a real diagonal matrix, which can be regarded as a rough approximation of
the HiPPO-LegS matrix, and S4D introduces a method known as the left half-plane control to ensure that
the diagonal elements of matrix A( D) remain negative. Subsequently, HTTYH [91] provided a theoretical

12
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

extension to the HiPPO framework to encompass the initialization scheme used in S4D. S5 [94] expands the
single-input/output (SISO [90]) SSM to multi-input/output (MIMO) systems and introduces Blelloch parallel
scan [254] to faster recurrent computation. SpaceTime [95] enhances model expressiveness by augmenting
a diagonal state matrix with full-rank column vectors, effectively transforming it into a companion matrix.
In contrast, H3 [260] adopts a two-layer SSM design: it initializes the state matrix of the first SSM layer
as a shift matrix, and computes the diagonal parameters of the second SSM via key-value dot products.
This design realizes token shift operations and thereby strengthens the model’s ability to recall relevant
information. StableSSM [97] and Hipo-PTD [98] further discuss how to carry out more stable initialization.
Formally, these models can all be expressed as convolution operations during the recursive computation
process:

L
K = (CB, CAB, . . . , CA B), y = x∗K (13)

Time-variant (Selective) SSM. Although initialization based on the HiPPO framework is theoretically
well-grounded, it results in system dynamics that remain the same at every time step. This uniformity
prevents the model from selectively attending to important information, as attention mechanisms do, thereby
limiting its expressive capacity. Based on the original formulation of S4 (with low-rank correction), Liquid-S4
[99] combines with a liquid time-constant network to obtain a new linear time-varying system. Mamba
[101], on the other hand, abandons HiPPO-based initialization and instead directly derives data-dependent
SSM parameter matrices through a linear projection layer. Moreover, it simplifies the computation into
element-wise multiplications to accelerate inference. Building upon this, Attraos [253] initializes the diagonal
matrix A with all ones values to approximate a Lebesgue measure, and further proposes a multi-resolution
SSM based on piecewise orthogonal polynomial projection. Mamba-2 [102] further simplifies the state
matrix to a scalar and introduces a hardware-efficient algorithm for blockwise parallel computation.
Recently, inspired by closed-loop control theory, Comba [103] introduced two key improvements to
enhance the effectiveness of recurrent sequence modeling. First, it replaces the scalar state transition in
Mamba2 with a scalar-plus-low-rank (SPLR) matrix, updated via a delta learning mechanism [248]. This
matrix-based formulation improves the expressiveness of the transition dynamics and, interpreted as a form
of Householder transformation [261, 262], introduces supervised memory management. This mitigates
memory conflicts typically found in recurrent models that rely on fixed-size hidden states. Furthermore, the
SPLR structure naturally supports negative eigenvalues, which further increases representational capacity. A
similar formulation also arises in Longhorn [100], which extends Mamba to L2-based objectives. Second,
Comba incorporates an output correction mechanism that introduces a learnable scalar to connect queries
and keys. From the perspective of neural memory, this design ensures that the value is stored with high
fidelity and can be accurately retrieved by the corresponding query. The output correction mechanism plays a
central role in enabling this functionality. Moreover, Comba improves computational efficiency by eliminating
unnecessary matrix inversions, which allows for the use of a highly optimized Triton kernel, resulting in
significantly faster execution.

2.4. Test-Time-Training RNN

Although models like Comba [103], (Gated)-DeltaNet [76, 77], and RWKV7 [83] can also be formulated
in an SGD-like manner for test-time training, a range of models such as TTT [104] offer a more explicit
formulation. These models treat the model’s state matrices directly as fast-adapting weights, which are
updated through a learnable optimizer. From this perspective, the model is no longer restricted to a fixed
linear or bilinear kernel but can instead leverage more advanced optimization algorithms to gain stronger

13
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

expressive power.
In terms of formulation, this class of models abandons closed-form recurrent representations and instead
expresses the model directly in an online learning paradigm, similar to early meta-learning [263, 264] and
fast weight programming [265, 266]. For the general form:

St = αt St−1 − ηt ∇S ℓ (St−1 ; kt , vt ) (14)

Initially, TTT [104] still employed an SGD-based optimizer for updates and introduced a deep state
based on a two-layer MLP, which has been widely adopted by subsequent models. It is worth noting that
this type of model typically applies LayerNorm to the state or its output to stabilize gradients and the
distribution of the memory stored in the state. Titans [105] introduces a first-order momentum term to
represent long-term memory and momentary surprise. Lattice [106] introduced a mechanism designed
for linear models that can compress information into a limited number of memory slots efficiently and can
update exclusively with information that is orthogonal to its current state, ensuring that only new and
non-redundant data is written into memory and reducing the memory interference. Miras [107] presents a
general framework for explaining the role of standard model architectural choices, including: (1) associative
memory architecture, (2) attentional bias objective, (3) retention gate, and (4) memory learning algorithm,
and introduces three novel sequence model variants based on the guidance of this framework, outperforming
traditional Transformers and other modern linear recurrent models on language modeling and recall intensive
tasks. Atlas [108] adopts higher-order feature mappings for memory capacity enhancing, and proposes
the Omega rule and Muon optimizer [267] for memory updating, which can be demonstrated effective in
language modeling downstream tasks. LaCT [190] further replaces the loss function with a dot product
operation and employs large chunks of gradient descent to improve computational efficiency.

2.5. Unified Linear Sequence Modeling

While linear attention, linear RNNs, state space models, and test-time-training RNNs have traditionally
followed separate lines of development, recent studies [65, 110] have begun to integrate these methods
within a unified theoretical framework. In this paper, we present a consolidated view of these approaches,
focusing on their memory update rules and optimization strategies.

2.5.1. A Memory Perspective

In this part, we adopt the conceptual framework proposed in Comba [103], wherein "linear" specifically
refers to models whose dynamics with respect to key-value associative memory are linear.
Linear Update Rule. In contrast to autoregressive Transformers [62], which preserve all contextual
information explicitly in a key-value (KV) cache, linear RNNs distill higher-level representations into a
fixed-size hidden state to enhance generalization. From a metaphysical perspective, this transition can be
viewed as a shift from conductive to inductive attention [103, 270], reflecting the notion that the compression
is the intelligence [271]. Structurally, they bear similarities to energy-based models [272] such as Hopfield
networks [273, 274] and neural systems employing Hebbian learning principles [247]. Early instances,
including Linformer [275], S4 [90], and RetNet [245], did not incorporate sufficiently adaptive, data-driven
memory management, which limited their performance relative to models utilizing softmax attention. More
recent approaches, exemplified by Mamba [101] and GLA [71], have mitigated this limitation by adopting
dynamic, projection-based gating mechanisms, thereby achieving notable gains. From a formal perspective,
these architectures can be regarded as linear register systems equipped with key-value associative memory:

14
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Table 1: A comparative overview of various linear sequence modeling approaches in terms of their memory update rules
and optimization objectives. For reference, we also include the principles of softmax attention and sliding-window attention
mechanisms. Broadly speaking, linear attention, linear RNNs, and state-space models follow a convergent developmental
trajectory, evolving through mutual influence. This trajectory reflects a shift from data-independent to data-dependent
gating mechanisms, and from L1-based to L2-based optimization objectives. In contrast, the development of TTT models
has been primarily guided by advances in modern optimization algorithms.

Type Model Memorizing with Gate Optimization Objective Lmin


∑︀t ⊺ 2
Softmax Attn SA [62] St = St−1 . append(kt , vt ) i =1 exp( qt ki ) ∥ v − vi ∥ [268]
∑︀t ⊺ 2
SWA [119] St = St−1 . append(kt , vt ). drop(kt− M , vt− M ) i =t− M exp( qt ki ) ∥ v − vi ∥

Linear Attn LA [64] St = St−1 + vt k⊺t ⟨St kt , vt ⟩


Lightning Attn [70] St = αSt−1 + vt k⊺t ⟨St kt , vt ⟩ + α
2 ∥St ∥22
RetNet [245] St = αSt−1 + vt k⊺t ⟨St kt , vt ⟩ + 2
2 ∥ St ∥ 2
α

GLA [71] St = St−1 diag(αt ) + vt k⊺t ⟨St kt , vt ⟩ + α2t ∥St ∥22


diag(α )
MetaLA [246] St = St−1 diag(αt ) + vt (1 − αt )⊺ ⟨St (1 − αt ), vt ⟩ + 2 t ∥St ∥22
diag(α )
GSA [72] Kt = Kt−1 diag(αt ) + kt (1 − αt )⊺ (same as v) ⟨Kt (1 − αt ), kt ⟩ + 2 t ∥Kt ∥22
DeltaNet [76] St = St−1 I − βt kt k⊺t + βt vt k⊺t β t ∥vt − St kt ∥2
(︀ )︀
⃦ ⃦2
Gated-DeltaNet [77] St = St −1 α ∼ 1 I − β k k⊺ + β v k⊺ β t ⃦ α1t vt − αt St kt ⃦ + α2t ∥St ∥22
(︀ )︀ ⃦ ⃦
t t t t t t t

Delta-Product [269] St = St−1 i=1 I − βti kti kti + i=1 i=1 I − βti kti k⊺ti β ti vti k⊺ti
⊺ )︀
β ti ∥vti − Sti kti ∥2
∏︀n (︀ ∑︀n ∏︀n (︀ )︀ ∑︀n
i =1

St = αt St−1 + β t vt k⊺t , Ht = αt Ht−1 + β t kt k⊺t β t ∥vt − St kt ∥2


∑︀t
MesaNet [109] i =1

Linear RNN HGRN2 [80] St = St−1 diag(αt ) + vt (1 − αt )⊺ ⟨St (1 − αt ), vt ⟩ + αt


2 ∥St ∥22
RWKV6 [82] St = St−1 diag(αt ) + vt k⊺t ⟨St kt , vt ⟩ + α2t ∥St ∥22
⃦ ⃦2
RWKV7 [83] St = St−1 (diag(αt ) − β t k̂t k̂⊺t ) + vt k̃⊺t β t ⃦ β t vt − St kt ⃦ + α2t ∥St ∥22
⃦1 ⃦

SSM S4 [90] St = St−1 diag(α) + vt b⊺ ∈ C ⟨St b, vt ⟩ + α


2 ∥St ∥22

Mamba [101] St = St−1 diag(α) + β∼ 0
t vt kt β t ⟨St kt , vt ⟩ + αt
2 ∥St ∥22

Mamba2 [102] St = α ∼ 1 ∼0
t St−1 + β t vt kt β t ⟨St kt , vt ⟩ + αt
2 ∥St ∥22
(︁ )︁
Longhorn [100] St = St−1 I − 1+ β tk⊺ k kt k⊺t + β t vt k⊺t β t ∥vt − St kt ∥2
β
t t t
(︁ )︁
Comba [103] St = St − 1 α ∼ 1 − β ↓ k k⊺ + β ↑ v k⊺ β t ∥vt − St kt ∥2 + αt
∥St ∥22 + ⟨qt , dkt ⟩
t t t t t t t 2
∑︀B ⃦2
TTT TTT-MLP [104]

St (·) = St− B (·) − i =1 β i ∇S L (St−1 , kt , vt ) β i ⃦vi − ψ(S j (ki ))⃦
MIRAS [107] St = αt St−1 − β t ∇S L ( g, Mt−1 , kt , vt ) β t ∥vt − g(ψ(St ), kt )∥ pp + αt
2 ∥St ∥22
Titans [105] Mt = γt Mt−1 + St , St = αt St−1 − β t ∇ M L ( Mt−1 , kt , vt ) β t ∥vt − ψ(St ), kt ∥22 + αt
2 ∥St ∥22
Atlas [108] Mt = γt Mt−1 − ηt Muon(St ), St = αt St−1 − β t ∇ M L ( Mt−1 , kt , vt ) β t ∥vt − ψ(St ), kt ∥22 + αt
2 ∥St ∥22

15
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

information is written through learnable forgetting and input gates (denoted as α and β) and subsequently
accessed by query-based retrieval:

St = (αt , β t )@(St−1 , k⊺t vt ) (Write), ot = St qt (Read) (15)

Bilinear Update Rule. From the perspective of neural memory systems [276], achieving effective memory
regulation remains a fundamental challenge. In contrast to the Hebbian learning principle [247], which
depends on reinforcement-driven updates, the Delta learning rule [248] emphasizes supervised control of
memory traces. These systems are linear with respect to state and input individually, but nonlinear overall
due to the product term (e.g., Sk). They are regarded as a special class of nonlinear systems that preserve
controllability [277, 278, 279, 280]. For a general formulation:

St = (αt , β t )@(St−1 , k⊺t vt ) + γt St−1 kt (Write), ot = St qt (Read) (16)

Where Sk is the bilinear term in models such as Comba [103], (Gated)-DeltaNet [76, 77], and RWKV7 [83].
Alternatively, under the constraint that the spectral radius of the state transition matrix remains less than 1,
it can also adopt an interaction between S and v, similar to that in Lattice [106]. Under the bilinear update
rule, these models can be computed efficiently on GPUs through chunk-wise parallelism.
Nonlinear Update Rule. Apart from early models such as LSTM [281] and GRU [282], the series of
models discussed in §2.4 (e.g., TTT) can be regarded as modern versions of nonlinear RNNs. These models
employ nonlinear operations for state memory (such as nonlinear activations in a two-layer MLP memory),
resulting in inherently nonlinear dynamics. While this provides theoretically stronger expressiveness, these
models are constrained by block-wise parallel computation and can only be updated via minibatch gradient
descent, which leads to very low hardware utilization (less than 5%). A potential solution is to follow LaCT
[190] by adopting large-batch gradient descent, combined with sliding-window attention for intra-batch
information learning as a hybrid architecture.

2.5.2. An Optimizer Perspective

Another perspective for unified linear sequence modeling is from an optimizer angle, which was initially
proposed in TTT [104], Titans [105] and Test-time Regression [268]. We further classify the methods
according to different forms of objective loss as below.
Local L1 Loss. Early models like linear attention [275] used an L1 loss to optimize the model in a single
step, but this update scheme is unbounded, making it easy for S to suffer from memory conflicts. Subsequently,
models such as RetNet [245], Mamba [101], and GLA [71] introduced memory gating mechanisms, which
can be seen as a form of L2 regularization.
[︂ ]︂
1 (︁ )︁
Ŝt = arg min β t |vt − St kt | + Tr St⊤ ΛSt (17)
S 2

Local L2 Loss. However, an L1 loss typically cannot directly enforce v = Sk, and thus cannot achieve
precise key-value associative memory retrieval. A better choice is the squared loss, which provides stronger
memory management capabilities. Such models also include an L2 regularization term on the memory states.

1 [︁ (︁ )︁]︁
Ŝt = arg min β t ∥vt − St kt ∥2 + Tr St⊤ ΛSt (18)
S 2

16
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Multi-step L2 Loss. Models such as Delta-Product [269] apply a sequence of Householder transformations
within a single timestep, effectively treating the latent update as a composition of structured orthogonal
operators. This allows the training objective to be interpreted as a form of multi-step L2 loss, where the
model is supervised not just on immediate transitions but on future latent states over multiple horizons.
In this formulation, a single-step L2 loss encourages the model to approximate simple reflections, which
is the most basic form of orthogonal transformations [103]. In contrast, a multi-step L2 loss allows the model
to approximate general orthogonal matrices through compositions of reflections, thereby capturing richer
linear dynamics such as rotations, permutations, or shears. This property significantly improves performance
on state-tracking tasks like S5 [283]. Specifically, Delta-Product formalizes the following objective:
n
1 ∑︁ [︁ 2
]︁
Ŝt = arg min β ti ∥vti − Sti kti ∥ ( Step by Step ) (19)
S 2
i =1

Global L2 Loss. Inspired by Recursive Least Squares (RLS), MesaNet [109] defines a global L2 opti-
mization for memory state, enabling the model to take the global optimum into account at each update
step. [︃ t ]︃
1 ∑︁ (︁ 2
)︁ (︁

)︁
Ŝt = arg min γi ∥vi − Ski ∥ + Tr S ΛS (20)
S 2
i =1

By deriving the analytical solution of this formula, two linear recursions can be used for fast computation.
To further improve computational efficiency, gradient conjugation is applied for correction at the model
output instead of directly computing the matrix inverse.

2.6. Linearization

The Transformer-based architecture used in modern LLMs suffers from quadratic computational complexity
and glowing memory usage. In contrast, the linear recurrent model architecture benefits from linear com-
putational complexity and constant memory usage, solving the inefficiency challenge of the Transformer
architecture. Despite this, the pretraining process for LLMs from inception necessitates substantial com-
putational and financial resources, which remains a significant barrier to their widespread adoption and
real-world implementation. In this case, linearization is a more ideal choice, whose purpose is to convert
the pretrained transformer-based LLM architecture into a linear recurrent structure using less training and
fine-tuning costs, and restore the capabilities of the original model on natural language understanding and
generation scenarios.
Finetuning-based Linearization. Finetuning-based linearization directly replaces standard softmax
attention with linear sequence modeling in a pre-trained transformer, then finetune the modified model for
architecture adaption without relying on knowledge from external models. Transformer-to-RNN (T2R) [111]
introduces a method to convert a pretrained transformer to an RNN with linear-time computation and
constant memory complexity. T2R follows a swap-then-finetune procedure to align the attention computation
of a pretrained transformer, then finetunes the converted RNN for model architecture adaption. Mao [284]
investigates various update rule configurations to finetune pretrained autoregressive Transformers into RNNs
and propose a simple element-wise decay fast weights update rule with no feature map for linearizing
transformers. DiJiang [285] leverages linear attention with the weighted Quasi-Monte Carlo method for
efficient sampling. Based on the Discrete Cosine Transform (DCT) kernelization process, DiJiang significantly
reduce the training cost of the transformation of a pre-trained vanilla Transformer into a linear complexity

17
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Teacher Outputs
Next Token Prediction Loss Distillation Loss

Student Outputs
“The cats like to eat fish”

𝑦 𝑦 𝑦

𝑺 Softmax 𝑺

𝑞 𝑘 𝑣 𝑞 𝑘 𝑣 𝑞 𝑘 𝑣

𝑥 𝑥 𝑥

“The cats like to eat ___”

(a) Finetuning-based Linearization (b) Distillation-based Linearization

Figure 5: Mechanism Comparison of Finetuning-based and Distillation-based Linearization Procedures.

model. SUPRA [113] introduces a linearization technique for converting large-scale pretrained softmax-
attention transformers into RNNs by replacing the softmax operation with GroupNorm and using a small MLP
for queries and keys feature mapping. Liger [115] proposes a novel linearization technique for converting
Transformer-based LLM to gated recurrent structures by fully reuse original pretrained model weights without
introducing any extra learnable parameters, thus simplifying the linearization process to a single-stage
end-to-end training without relying on distillation and significantly reducing the linearization cost.
Distillation-based Linearization. Distillation-based linearization utilizes knowledge distillation [286] to
transfer the capabilities of a pre-trained transformer with standard softmax attention as a teacher model to a
student model with linear sequence modeling. LoLCATs [114] proposes a simple two-step linearization process
via Low-rank Linear Conversion via Attention Transfer, significantly improves linearizing performance, training
efficiency, and scalability with low-rank adaption (LoRA). Llamba [287] demonstrates the effectiveness and
scalability of cross-architecture distillation based on Llama-3.x series using MOHAWK [288]. LightTransfer
[289] transforms standard transformers into hybrid model architecture by identifing lazy layers which
are and replacing their full attention layers with efficient streaming attention, demonstrating throughput
efficiency and effectiveness on long-context tasks with minimal performance loss. MOHAWK [288] proposes
a distillation procedure for converting Transformer to Mamba. By incorporating three-stage progressive
training: 1) matrix orientation, 2) hidden state alignment, and 3) weight transfer and knowledge distillation,
MOHAWK can distill the Transformer architecture by matching different degrees of granularity in the SSM.
MambaInLlama [112] also adopts a similar multi-stage training procedure to obtain Mamba model initialized
from large Transformer model by combining cross-architecture distillation and hardware-aware speculative
decoding algorithm, further accelerate the inference speed of Mamba and its hybrid models. Lizard [290],
similar as Liger [115], also considers introducing gating mechanism into linearization process combining with
sliding window attention with meta memory, and designed a hardware-efficient factorization algorithm that
incorporates log-gating values into the query and key projections for training stability. LoLA [193] integrates
sparse caching technique into linear attention, further improving language modeling and long-context ability

18
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

(a) Blelloch Scan Algorithm (b) Chunkwise Paralell for Training


𝑖𝑛𝑡𝑒𝑟 𝑖𝑛𝑡𝑟𝑎
𝑢0 𝑢1 𝑢2 𝑢3 𝑆𝑖 𝑂𝑖+1 𝑂𝑖+1
𝐵0 𝐵1 𝐵2 𝐵3

𝑥0 + 𝐵1 𝑢1 𝐵2 𝑢2 + 𝐵3 𝑢3
𝑄𝑖+1 𝐾𝑖+1 𝑉𝑖+1
Up Sweep 𝐴1 𝐴2
𝑥2 + 𝐵3 𝑢3
(c) Recurrent for Inference

𝑂𝑖 𝑂𝑖+1
Down Sweep 𝑥1 + 𝐵1𝑢2 𝐴3
𝐴2
𝑆𝑖−1 𝑆𝑖 𝑆𝑖+1

𝑥0 𝑥1 𝑥2 𝑥3 𝑄𝑖 𝐾𝑖 𝑉𝑖 𝑄𝑖+1 𝐾𝑖+1 𝑉𝑖+1

Figure 6: Hardware-efficient Implementation Algorithms for Linear Sequence Modeling. (a) In the Blelloch parallel scan,
the upward sweep computes and stores the outputs at even indices; the downward sweep then reuses these stored values to
compute the outputs at odd indices. (b) To better exploit tensor-core matmul acceleration during training and prefilling,
linear-recurrence models can adopt intra-block parallel computation combined with inter-block recursion. (c) During
inference, the recurrent formulation supports decoding with constant memory and O(1) per-step computational complexity.

after model architecture linearization. Yeonju et al. [291] proposes a dual-state linear attention (DSLA) for
maintaining long-term historical information and tracking short-term recency-thereby, and introduced an
online adaptive distillation system for linearizing quadratic complexity self-attention layers to linear-cost
DSLA layers.
Linearization in the Era of RL Scaling. As the development of R1-like large reasoning models [29, 58,
292, 293] have been demonstrated their supreme advantages on reasoning tasks, one significant question
is prompted: Can we boost linear recurrent models’ reasoning performance by scaling test-time computation
through long CoT?
A promising direction involves leveraging existing transformer-based reasoning models to create lin-
earized LLMs—models that retain strong reasoning capabilities while benefiting from the efficiency of linear
sequence modeling. This transformation is typically achieved through architectural linearization followed by
SFT and RL to enhance test-time reasoning performance. Daniele et al. [294] explore this by distilling both
pure and hybrid Mamba models from large-scale LLaMA models, investigating their reasoning scalability at
test time. Similarly, M1 [295] introduces a hybrid large reasoning model built on the Mamba architecture,
achieved by distilling and fine-tuning a linearized transformer. Their results highlight the potential of linear
recurrent architectures in supporting efficient and scalable reasoning.

2.7. Hardware-efficient Implementation

One of the key reasons for the rapid development and adoption of linear sequence modeling methods is the
availability of hardware-efficient implementations. Many approaches provide optimized code using tools
such as Triton or CUDA to achieve efficient computation on Modern GPUs. In the following, we review and
summarize these hardware-efficient implementation methods.

19
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Full Global Window Dilated Random


ETC BigBrid
Attention Attention Attention Attention Attention

Figure 7: Example Patterns of Static Sparse Attention. ETC and BigBird are representative examples of mixed static
sparse attention mechanisms, which combine multiple fixed sparsity patterns within a single attention framework.

Fast Recurrent. Early linear sequence modeling methods often leverage the properties of structured
matrices for acceleration. For example, state-space models like S4 [90] can be written in the form of
convolutions to enable fast computation. Models like Mamba [101] use the Blelloch scan algorithm [94, 254]
to achieve fast recursion. Specifically, the Blelloch scan operates as a tree-like structure. During the
computation, the even positions in the sequence are first processed via an upward scan, which serves as
an intermediate step. Then, during the downward scan, the odd positions of the sequence are computed.
This computational approach has recently inspired works such as multi-scale state-space models [253] and
log-linear attention mechanisms [251, 252].
Chunk-wise Parallel. Although Linear RNNs achieve a favorable pretraining time complexity of O( Nd2 ),
they are often slower than softmax attention, which has a complexity of O( N 2 d), on shorter sequences. This
is primarily due to the fact that current hardware is highly optimized for matmul operations, which limits the
efficiency of linear recurrence and necessitates additional training optimizations. Recent methods inspired by
FlashAttention [142], including Lightning Attention [39, 70, 244], GLA [71], and Mamba2 [102], introduce
inter-chunk recurrence combined with intra-chunk parallelism to fully leverage matrix compute throughput.
These techniques optimize computation graphs and algorithms, ensuring efficient parallel execution across
data chunks, which significantly enhances model performance on modern hardware. A basic formulation
using chunk size C can be expressed as:

S[t+1] = S[t] + V[⊺t] K[t] ∈ RD× D , O[t] = Q[t] S[⊺t] + ( Q[t] K[⊺t] ⊙ Mask[t] )V[t] ∈ RC× D (21)

The open-source framework Flash Linear Attention [296] offers a collection of chunk-wise parallel Triton
kernels specifically designed for linear sequence modeling, with a strong emphasis on hardware efficiency.
These implementations not only accelerate the computation of linear attention mechanisms but also enable
easy integration into existing model training and inference pipelines. By making such optimized kernels
publicly available, this framework lowers the barrier for researchers and practitioners to experiment with
and deploy efficient linear attention variants, promoting further innovation within the community.

20
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

3. Sparse Sequence Modeling


Sparse sequence modeling is a paradigm designed to enhance the efficiency of processing sequential data
by limiting interactions between elements to a strategically chosen subset. A typical example is sparse
attention mechanism in transformer models, which addresses the computational bottlenecks of traditional
full attention methods while aiming to preserve modeling performance.

3.1. Static Sparse Attention

Unlike full self-attention, which scales quadratically with sequence length, static sparse attention reduces
computational complexity by restricting each token to attend to a predefined, fixed subset of other tokens.
These static patterns are designed before training and remain unchanged during inference, making them
highly efficient and easy to deploy. Common structural partterns include global, window, strided, dilated,
random and blockwise sparsity. Such approaches have been widely adopted in natural language, vision, and
multimodal models due to their strong inductive biases and scalability [297].
Early work on Sparse Transformer [116] pioneered this idea by introducing fixed strided and dilated
attention patterns that ensure each token connects to both nearby and distant tokens in a deterministic
manner. This structure reduces computation to near-linear complexity while preserving the model’s ability
to capture long-range dependencies, establishing a foundation for later sparse models. Building on this
foundation, Star-Transformer [117] proposed a radial topology with a central relay node that connects to
all tokens, while peripheral tokens form a ring with local connections. This architecture maintains global
communication with linear-time attention and performs well on tasks requiring global structure, making
it both simple and efficient. BlockBERT [118] adapts BERT for longer sequences using blockwise sparse
attention. It divides input into fixed-size blocks, allowing dense intra-block attention and sparse inter-block
communication through selected key tokens. This enables effective long-document modeling while lowering
GPU memory requirements, proving useful for document classification and QA tasks.
To enhance static sparse models with richer contextual representations, Longformer [119] extends this
idea with a combination of sliding window attention and a small set of global tokens. The window handles
local context, while global tokens enable information flow across distant positions. This hybrid setup allows
linear complexity and proves effective in tasks like summarization and long-context QA. GMAT [298] injects
global memory tokens that act as information hubs, enabling interactions across segments of long sequences.
Compatible with other sparse structures like Longformer and Sparse Transformer, GMAT demonstrates
improved performance in long-context understanding tasks. A similar dual-attention structure is introduced
in ETC [120], which separates tokens into local and global streams. Local tokens attend within sliding
windows, while global tokens attend to the full sequence, allowing hierarchical representations. ETC also
incorporates segment-aware attention and relative position encodings, boosting performance in document-
level comprehension. BigBird [121] generalizes static sparse patterns by integrating local, global, and random
connections, forming a sparse attention graph with small-world properties. This design ensures all tokens
remain connected through short paths, and provides theoretical guarantees of universal approximation.
BigBird scales effectively to long sequences in both encoder-only and encoder-decoder setups, supporting
tasks across NLP and bioinformatics. LongT5 [122] builds upon T5 for long-text generation using transient
global attention. It summarizes local contexts into temporary global tokens, which adaptively aggregate
information layer by layer. This allows encoder-decoder models to scale to longer inputs without relying on
dense attention, improving performance in summarization and long-form QA. Pushing the boundary further,
LongNet [123] introduces exponentially dilated attention, where each layer increases the attention span in
powers of two. This hierarchical scheme enables efficient modeling of sequences up to 1 billion tokens with

21
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

logarithmic complexity.
Beyond text, static sparse attention has also proven effective in vision. Axial Attention [124] factorizes
2D attention into independent operations along height and width axes, dramatically reducing complexity
while maintaining expressiveness. This technique underpins many high-resolution vision transformers
and has seen wide adoption in image and video models. In multimodal settings, sparse spatiotemporal
attention mechanisms, such as those used in Open-Sora [299], decouple attention along spatial and temporal
dimensions for efficient video modeling. By applying windowed attention in space and strided or pooled
attention over time, these methods significantly reduce compute cost while preserving temporal coherence,
enabling scalable video understanding and generation in large multimodal models.

3.2. Dynamic Sparse Attention

Unlike static sparse attention, dynamic sparse mechanisms determine attention patterns adaptively based on
the input content. These models aim to approximate the expressiveness of full attention by focusing compu-
tation on a dynamically selected subset of token interactions, thereby retaining task-relevant information
over long contexts while minimizing computational overhead. The evolution of these methods can be seen
as a progress from heuristic grouping techniques to more sophisticated, learnable retrieval systems.
Early approaches focused on grouping or clustering semantically similar tokens to restrict the scope of
attention. Transformer-XL [300], proposes to create a recurrent memory by reusing hidden states from
previous segments, breaking the fixed-length barrier. Compressive Transformer [301], refines this memory
by actively compressing older states into a more efficient long-term store. Adaptive Attention Span [302]
was proposed to tackle the problem from an efficiency angle, allowing each attention head to learn its
optimal context size, which dramatically extended the manageable sequence length without a proportional
increase in computational cost. Furthermore, Reformer [125] pioneered using locality-sensitive hashing
(LSH) to bucket tokens, allowing each query to attend only to keys within the same hash bucket. This
created content-dependent sparsity with near-linear complexity. Similarly, the Routing Transformer [126]
applies online k-means clustering to partition tokens, with attention confined to these dynamically formed
clusters. A related idea, Sparse Sinkhorn Attention [127], learns a differentiable permutation to reorder
tokens, enabling efficient block-wise attention on soft-sorted, semantically coherent segments.
These heuristic-based grouping methods were later generalized under the unified framework of Attention
with Bounded-Memory Control (ABC) [69]. This work frames the context as a memory of a fixed, constant size
and demonstrates that many prior methods are specific instances of managing this memory. Its key innovation,
ABC-MLP, moves beyond fixed heuristics by introducing a learned, contextualized controller to manage
the memory, achieving a better trade-off between efficiency and accuracy. An alternative to compressing
the immediate context is to augment the model with an external memory and use retrieval mechanisms
for sparsity. Memorizing Transformers [128] implements this by using a k-nearest-neighbor (kNN) index
to retrieve relevant past token representations, enabling sparse access to long-term memory. CoLT5 [303]
introduced a conditional routing approach where lightweight routers select a small subset of "important"
tokens for global attention, while most tokens undergo cheaper local attention. Unlimiformer [129] refined
this concept for cross-attention, allowing pre-trained models to attend over vastly longer inputs by retrieving
relevant keys from an index without any weight modification. The focus then shifted towards learning
explicit routing and gating mechanisms to prune the attention matrix directly.
The latest advancements aim not only for theoretical efficiency but also for practical, hardware-aligned
speedups and architectural specialization. Native Sparse Attention (NSA) [130] is designed for this purpose,
employing a hardware-aligned hierarchical strategy that combines coarse-grained token compression for

22
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

global context with fine-grained selection for local precision, delivering significant real-world speedups.
Concurrently, Mixture-of-Sparse-Attention (MoSA) [131] adapts the popular Mixture-of-Experts (MoE)
paradigm to dynamic sparse attention. Each head acts as an "expert" that dynamically selects a small subset
of tokens, and the computational savings are reinvested to train a larger number of specialized heads,
achieving superior performance within the same FLOP budget.

3.3. Training-free Sparse Attention

While many sparse attention methods are designed for training, a significant body of work focuses on
accelerating inference, which is composed of two distinct phases: the computationally-intensive prefill of
processing the initial prompt, and the memory-bandwidth-bound decoding stage of generating subsequent
tokens. Different techniques have emerged to tackle the specific bottlenecks in each phase.
Accelerating Prefill Stage. Optimizing the prefill stage involves reducing the massive computation of
the initial self-attention pass over a long prompt. Early approaches impose structured sparsity based on
offline analysis. For instance, LongLoRA [304] fine-tunes a model for longer contexts by using shifted sparse
attention (S2-Attn). Instead of full attention, S2-Attn performs attention within local token groups and
then shifts the groups to allow information to flow across the entire context, approximating global attention
with significantly less computational cost. MInference [133] observes that attention maps often conform
to a few prototype shapes (e.g., diagonal, vertical stripes) and leverages specialized GPU kernels for these
fixed patterns. Similarly, MoA (Mixture of Attention) [149] uses gradient-based profiling to assign a static,
heterogeneous sliding-window size to each attention head, effectively creating a fixed sparse mask that
reduces computation without retraining.
A more adaptive approach involves learning the sparsity pattern dynamically based on the input content.
SeerAttention [134] exemplifies this by augmenting each attention layer with a lightweight gating module.
After a brief self-distillation phase, this gate learns to predict which blocks of the attention matrix are most
important for a given input, generating a dynamic mask that prunes irrelevant computations on-the-fly. This
allows the model to achieve significant prefill speedups while maintaining high accuracy by adapting the
sparsity pattern to the specific context. The follow-up work SeerAttention-R [305] builds on self-distilled
attention sparsity and introduces key changes for efficient auto-regressive decoding. It removes sequence-level
query pooling and adopts a GQA-style shared sparsity pattern for better hardware efficiency. SeerAttention-R
can be applied to any pretrained Transformer by inserting a learnable gate into the attention layer, without
modifying the original model weights.
Accelerating Decoding Stage. In the auto-regressive decoding stage, the primary bottleneck is not
computation but the memory bandwidth required to load the ever-growing Key-Value (KV) cache at each step.
Research in this area focuses on intelligently pruning this cache to retain only the most critical information.
SpAtten [132] proposes a sparse attention architecture that improves efficiency through cascade token
and head pruning, removing unimportant elements layer by layer using a Top-K engine. It also introduces
progressive quantization, loading low-precision bits only when needed. Together, these methods reduce
memory and computation without harming accuracy, enabling faster long-context inference. Another simple
yet powerful heuristic is to exploit the "attention sink" phenomenon, as demonstrated by StreamingLLM [135].
This work found that initial tokens consistently attract high attention throughout generation and are crucial
for maintaining stability. By caching only these sink tokens alongside a sliding window of recent tokens,
models can handle infinitely long streams with a small, fixed-size cache. Other methods adopt dynamic
eviction policies. TOVA [306] continuously evicts the token with the lowest attention score to make room
for new ones, while H2O [136] formulates eviction as a submodular optimization problem to retain a set

23
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

of "heavy hitter" tokens that have the greatest influence on the attention output. RetrievalAttention [307]
accelerates inference by building an approximate nearest neighbor index of the KV cache in CPU memory.
During generation, it uses an attention-aware vector search to retrieve only a small subset of the most relevant
KV pairs for the current query, thus avoiding the computational and memory costs of the full attention
mechanism.
More structured approaches introduce sparsity at a coarser granularity. FastGen [137] profiles each
attention head to classify its behavior (e.g., local vs. global) and applies a tailored eviction policy to each,
compressing the KV cache more aggressively for heads with localized attention patterns. Taking this further,
query-aware methods select relevant context blocks on-the-fly. Quest [138] divides the KV cache into pages
and uses a lightweight scoring function to retrieve only the most relevant pages for the current query.
Similarly, LongHeads [139] re-purposes multi-head attention by allowing each head to independently select
and attend to a small number of context chunks, effectively parallelizing the search for relevant information
across the sequence. DuoAttention [308] separates attention heads into retrieval heads for long-term context
and streaming heads for local context. Only retrieval heads maintain full KV caches, while streaming heads
use sliding windows, reducing memory and latency. A data-driven method is used to identify retrieval heads,
enabling efficient long-context decoding with minimal performance loss. ShadowKV [309] offloads the KV
cache of a large target model to the CPU, while a smaller draft model maintains a compact "shadow" KV cache
on the GPU. The draft model generates tokens using speculative decoding, and the target model only retrieves
the necessary KV data from the CPU to verify the draft, minimizing slow memory access. LServe [140]
introduces a unified sparse attention design for efficient long-sequence serving. It combines block-level token
skipping with query-driven KV pruning, retaining only relevant memory pages. By converting half of the
heads into lightweight streaming heads, LServe accelerates both prefill and decoding stages while preserving
accuracy. PQCache [310] compresses the KV cache using Product Quantization (PQ) to group and quantize
vector dimensions into low-bit representations. To maintain accuracy, it introduces a lightweight, learnable
"dequantization helper" module that is trained to reconstruct the high-fidelity vectors from their compressed
form during inference. XAttention [141] proposes an efficient block-sparse attention framework using the
sum of antidiagonals as a lightweight block importance metric. This enables accurate pruning with minimal
overhead, achieving up to 13.5× speedup while maintaining accuracy across long-context benchmarks.

3.4. Hardware-efficient Implementation

Efficient hardware implementation plays a key role in scaling sparse attention to support longer sequences
without compromising runtime or memory efficiency. This section reviews recent approaches aimed at
enhancing the hardware efficiency of sparse attention computation.
Except exact attention optimization, FlashAttention-1 [142] also incorporates block-sparse FlashAttention
by introducing structured sparsity to further reduce memory access overhead and accelerate computation for
long sequences. By applying a predefined block-wise sparsity mask, it skips unnecessary attention blocks
during computation:

S = QK T ∈ R N × N , P = softmax(S ⊙ M̃ ) ∈ R N × N , O = PV ∈ R N ×d (22)

where M̃ denotes the block mask and aligns with the block structure of Q and KV in the FlashAttention,
specifically expressed as follows:
{︃
Skl , i f M̃kl = Mij = 1
Skl = , i = ⌊k/Br ⌋, j = ⌊l/Bc ⌋, M ∈ {0, 1} N/Br × N/Bc (23)
−in f , i f M̃kl = Mij = 0

24
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

where Br and Bc denote the block sizes of Q and KV , respectively. Empirically, block-sparse FlashAttention
achieves 2 − 4× speedups over dense FlashAttention and scales transformers to sequence lengths up to
64K. FlashAttention-2 [143] extends FlashAttention-1’s block-sparse attention to support structured sparsity
patterns such as local, dilated, and general block-sparse attention.
Recent work NSA [130] integrates three different sparse attention mechanisms and improves hardware
efficiency through three strategies. First, Group-Centric Data Loading ensures that all query vectors Q ∈ R[h,dk ]
in a GQA group and their corresponding sparse key/value indices It are loaded together in each inner loop.
Second, Shared KV Fetching minimizes memory access by sequentially loading continuous key/value blocks
K ∈ R[ Bk ,dk ] , V ∈ R[ Bk ,dv ] into SRAM, where Bk is a kernel-aligned block size. Finally, by leveraging the
uniform inner-loop length across blocks, the Outer Loop on Grid utilizes Triton’s grid scheduler for efficient
parallelization of query and output computation. This design gives 9.0×/6.0× speedups in forward/backward
passes for 64k sequences compared to FlashAttention-2, showing strong performance for long-context tasks
in GQA/MQA. Furthermore, MoBA [144] implements dynamic sparse attention efficiently through blockwise
variable-length computation. By partitioning sequences into fixed-size chunks and encoding block selections
via segment length tensors, it transforms irregular sparsity into hardware-friendly memory access patterns.
This varlen formulation maintains FlashAttention’s optimization benefits while enabling adaptive routing,
achieving 16× speedup for 10M-token sequences at 95% sparsity.
Besides block sparse attention, other sparse patterns also benefit from efficient kernels. Longformer’s [119]
CUDA kernel implements optimized sparse attention with three key patterns: sliding window attention
for local context, dilated window attention for wider receptive fields, and task-specific global attention for
special tokens. SeerAttention [134] introduces a CUDA kernel supporting dynamic sparsity shapes like
A-shape, vertical-slash, and diagonal blocks. It integrates FlashAttention’s tiling scheme to skip inactive
blocks, reducing computation and memory overhead.

4. Efficient Full Attention


4.1. IO-Aware Attention

Transformer self-attention scales quadratically in both compute and memory with the sequence length N,
which quickly becomes the dominant bottleneck in large-context language models. A series of FlashAttention
kernels eliminates most of the memory traffic while preserving exact softmax attention, delivering large
wall-clock speed-ups on modern GPUs.
Given matrices Q, K, V ∈ R N ×d , the standard softmax attention computation on GPU can be written as:

1. Load Q and K by blocks from HBM, compute S = QK T , write S to HBM.


2. Read S from HBM, compute P = softmax(S), write P to HBM.
3. Load P and V by blocks from HBM, compute O = PV , write O to HBM.

It suffers from several issues: 1) High memory usage. When the hidden dimension is large, it becomes
difficult to load both Q and K into SRAM simultaneously for fast computation. 2) Excessive HBM and
SRAM access. This leads to a major performance bottleneck during training. Specifically, standard softmax
attention computation:

1. Requires reading Q and K from HBM twice, and writing the intermediate result S once. Total: 3 memory
operations.

25
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

2. Reads S once and writes O once. Total: 2 memory operations.


3. Reads P and V twice, and writes O once. Total: 3 memory operations.

4.1.1. FlashAttention-1

Exist algorithms aim to reduce attention computation complexity, but often do not result in significant
speed-ups because GPU runtimes are mainly limited by global memory traffic, not computation intensity.
FlashAttention-1 [142] overcomes this by optimizing memory transfers between global and on-chip memory.
This optimization greatly reduces memory traffic and memory usage, resulting in faster computations and
enabling longer sequence lengths without needing approximations. FlashAttention-1 improves standard
attention computation by dividing the sequence into smaller query and key-value tiles. The sequence is
processed in blocks to minimize memory access. The query and key-value pairs are loaded into shared memory
in one operation, and the score block is computed efficiently using matrix multiplication on specialized
hardware. A partial softmax operation is applied across the tiles, maintaining running maximums and sums
for numerical stability. The final step accumulates partial outputs with the computed probabilities.
The key contributions of FlashAttention-1 can be summarized as:
Online Softmax. FlashAttention-1 reduces memory usage by computing softmax normalization incremen-
tally, keeping track of the running maximum and cumulative weights. This avoids storing all intermediate
scores and ensures efficient memory usage.
Fused Attention Computation. Instead of performing separate kernel invocations for computing attention
scores, softmax, and weighted sums, FlashAttention-1 combines these steps into a single kernel. This reduces
redundant memory transfers and takes full advantage of GPU parallelism, improving computation efficiency.
Recomputation in Backward Pass. During backpropagation, FlashAttention-1 saves memory by recom-
puting attention weights instead of storing all intermediate results. It uses previously saved normalization
statistics to recompute these values, allowing the method to handle longer sequences and larger batch sizes
without exceeding memory limits.

4.1.2. FlashAttention-2 Key Blocks

FlashAttention-1 significantly outperforms standard attention meth-


ods, but its forward pass still achieves only a fraction of the device’s
theoretical peak throughput. The backward pass performs even worse,
Query
reaching just a small portion of the peak throughput. The bottleneck Blocks
arises due to suboptimal work partitioning across thread blocks and
warps, as well as numerous non-matmul operations, which result in
low occupancy and unnecessary memory accesses.
FlashAttention-2 [143] addresses these issues and optimizes per- First Loop Second Loop

formance from the following aspects:


Figure 8: Attention Map Computation
More Matmul Operations. FlashAttention-2 reduces softmax book- in FlashAttention-2.
keeping by storing the Intermediate terms and fusing epilogue operations directly within the Tensor-Core
pipeline.
Query-Outer, Key-Value-Inner Loop Structure. In FlashAttention-2, the structure of the computation is
optimized by processing the query and key-value pairs in a more efficient loop format, improving parallelism
and reducing overhead.

26
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Row-wise Computation. For each query block, attention is computed with every key and value block in
the same row and the results are summed to produce the output; with causal attention, this is usually done
in two passes: first, the outputs for all blocks strictly below the main diagonal are computed without any
masking, and then a mask is applied to compute the lower-triangular part of each diagonal block:
⎧ (︁ )︁
⎨ exp Q⊤ K[ j] , if i < j
A [ i ],[ j ] ∝ (︁ [i] (︁ )︁)︁ (24)
⎩ exp lower Q⊤ K[i] ,
[i ] if i = j

4.1.3. FlashAttention-3

FlashAttention-3 [145] brings substantial improvements tailored for Hopper-class GPUs, integrating two
pivotal hardware features: the Tensor Memory Accelerator (TMA) and the WGMMA Tensor-Core instructions.
These innovations optimize performance by enabling warp-asynchronous memory operations and enhancing
the efficiency of matrix multiplications, which are critical for accelerating attention mechanisms.
The key contributions FlashAttention-3 can be summarized as:
Producer-Consumer Asynchrony. This technique restructures the processing pipeline by assigning
warp groups to distinct roles—producers and consumers. The producers are tasked with prefetching data
from memory using TMA, while the consumers focus on the computationally intensive tasks of matrix
multiplication and softmax operations. This separation of responsibilities allows for a two-stage pipeline that
concurrently handles memory transfers and computation. By overlapping these operations, FlashAttention-3
minimizes idle time, improving resource utilization and increasing throughput. The result is more efficient
hardware utilization, especially in high-latency stages of computation such as memory access.
Interleaved Matmul+Softmax. In this method, blocks are processed in a double-buffered fashion,
where the softmax operation for a block is computed while the next block’s scores are calculated using
WGMMA. This interleaving of the matrix multiplication and softmax operations ensures that the GPU’s
compute resources are kept fully occupied, reducing the time spent waiting between these two stages. This
technique exploits the parallelism of modern GPUs to maintain high throughput and reduce latency, thus
significantly accelerating the attention computation pipeline.
Block-Wise FP8 Quantization with Incoherent Processing. In FlashAttention-3, all numerical opera-
tions, including GEMM and softmax, are performed using FP8 precision, which reduces memory footprint
and improves computational efficiency. However, to mitigate the precision loss typically associated with
FP8, each block independently selects a per-tile scaling factor, allowing it to adjust for cumulative errors
that would otherwise degrade performance. This approach significantly reduces numerical errors compared
to traditional FP8 kernels, ensuring that the results remain robust while maintaining the memory and
computation benefits of reduced precision.

4.2. Grouped Attention

Grouped attention techniques, including Multi-Query Attention (MQA), Grouped-Query Attention (GQA),
and Multi-Head Latent Attention (MLA), have been widely adopted in the training of large language models
(LLMs). These methods have been extensively validated for their ability to reduce key-value (KV) cache sizes
during inference, leading to improved memory efficiency without compromising model performance.
Specifically, MQA [146] was designed to address the significant memory bandwidth bottleneck encoun-

27
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Values

Linear
Projection

Keys

Compressed
Latent KV

Queries

(a) Multi-Head (b) Grouped-Query (c) Multi-Query (d) Multi-Head Latent


Attention (MHA) Attention (GQA) Attention (MQA) Attention (MLA)

Values

Tied Repeat

Keys Router

Broadcast
Head-
Choice
Queries
Input
(h) Mixture-of-Head Tokens
(e) Group-Tiled Latent (f) Grouped-Head (g) Grouped-Value
Attention (GTA) Attention (GHA) Attention (GVA) Attention (MoH)

Figure 9: Mechanism Comparison of Primary Grouped Attention Methods.

tered during the incremental decoding process in autoregressive models. By allowing multiple query heads to
share a single key and value head, MQA dramatically reduces the size of the KV cache, leading to substantial
improvements in inference speed. However, this efficiency sometimes comes at the cost of model quality.
To mitigate the performance degradation observed in MQA while retaining its speed advantages, Grouped-
Query Attention (GQA) was proposed [147]. GQA serves as a middle ground between traditional Multi-Head
Attention (MHA) and MQA. It works by dividing query heads into groups, with each group sharing a single
key and value head. This approach offers a favorable trade-off, achieving quality comparable to MHA while
maintaining inference speeds close to those of MQA. A key innovation in the GQA paper is the concept of
"uptraining," a cost-effective method for converting existing MHA models into GQA or MQA models with
only a small fraction of the original pre-training compute [147].
Building on these efforts to optimize attention mechanisms, Multi-head Latent Attention (MLA) was
introduced as part of the DeepSeek-V2 [27] and V3 models [28]. MLA further enhances inference efficiency
by compressing the KV cache into a low-rank latent vector. This technique significantly reduces the KV
cache size, even more so than MQA, while maintaining strong performance [27]. MLA’s design decouples
the positional information from the compressed keys and values, which is a key difference from standard

28
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

attention mechanisms and allows for its aggressive compression.


Group Tied Attention (GTA) and Group Latent Attention (GLA) [148] are proposed as hardware-friendly
and memory-efficient attention mechanisms aimed at addressing key inefficiencies in LLM inference, partic-
ularly those related to the KV cache. GTA builds upon the foundation of GQA by tying the Key and Value
representations into a single shared projection, which is then reused across small groups of query heads. This
design not only reduces the KV cache size by approximately 2× but also improves arithmetic intensity by
minimizing memory movement relative to compute. A critical innovation in GTA lies in its careful handling
of Rotary Position Embeddings (RoPE): only half of the tied KV state is used for the non-RoPE part of the
key, while a separate smaller head computes the RoPE-associated component, which is broadcast across the
group. Experimental results show that GTA preserves or even improves perplexity and task performance
compared to GQA, offering a drop-in replacement with both memory and computational efficiency gains.
GLA extends the idea of latent compression in MLA by improving its compatibility with tensor parallelism.
While MLA compresses the KV cache into a single latent representation, this compression limits its ability to
parallelize efficiently, often requiring cache duplication across devices. GLA resolves this by sharding the
latent KV cache into multiple segments, such as two halves, and assigning separate groups of query heads to
each shard. Local attention is independently computed within each group and later merged, eliminating the
need for full cache replication and enabling better performance in distributed and imbalanced workloads.
In addition to maintaining or improving model quality over MLA, GLA demonstrates substantial speedups
during inference. Specifically, its custom kernel achieves up to 2× speedup over FlashMLA in speculative
decoding. Together, GTA and GLA provide highly efficient, scalable alternatives to existing attention variants,
offering strong trade-offs between memory usage, compute efficiency, and parallelization capability.

4.3. Mixture of Attention

The idea of combining multiple attention strategies into a unified attention layer began with MoA (Mixture-
of-Attention) [149], which proposed assigning different sparse attention patterns to each head and layer
within a Transformer. By automatically selecting head-wise sparsity from a pool of patterns, MoA significantly
extends the effective context length while improving throughput and reducing memory consumption. This
early work emphasized the benefits of heterogeneous attention structures, highlighting that not all tokens or
heads require uniform computation. Expanding on this notion, SwitchHead [311] treated attention heads as
experts and introduced routing mechanisms to activate only a subset of them per token, reducing attention
computation by up to 8× without sacrificing performance. In parallel, MoH (Mixture-of-Heads) [151]
adopted a similar perspective, but instead of discrete routing, it used soft selection: each token computes a
weighted combination of attention heads, enabling partial pruning while improving or maintaining accuracy.
These works collectively mark a shift toward dynamic head-level specialization, bringing MoE-style routing
into the self-attention mechanism.
Building on this head-wise mixture foundation, LLaMA-MoE v2 [150] scaled the concept to full LLMs by
sparsifying both the attention and feedforward layers through structured expert partitioning. The model
achieves performance on par with dense LLaMA variants across multiple domains, demonstrating that
MoE-based sparsity is viable for large-scale, instruction-tuned transformers. Further pushing flexibility, MoBA
(Mixture of Block Attention) [144] introduced block-level routing where each chunk of tokens can choose
between full or sparse attention dynamically. Rather than relying on predefined patterns, MoBA allows the
model to adaptively allocate computation, improving long-context performance and aligning efficiency with
task demands. These advancements show the trend of increasing granularity and adaptivity in MoE routing,
from heads to full blocks, making attention computation more efficient and task-aware.

29
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Extending MoE principles beyond traditional attention layers, MoM (Mixture-of-Memories) [78] applies
expert routing to memories with linear sequence modeling. Instead of maintaining a single shared state,
MoM uses multiple sparsely activated memory slots, each acting as an expert updated via token-wise routing,
enhancing long-term recall and reducing interference. Finally, MoSA (Mixture of Sparse Attention) [131]
brings MoE routing to a fine-grained level by allowing each attention head to dynamically select the top-k
relevant tokens based on content. This results in a token-wise sparse structure that outperforms dense
attention under equal compute budgets while significantly reducing memory usage. Across these works,
the MoA paradigm evolves from fixed sparsity patterns to fully dynamic, learnable routing strategies at the
head, block, memory, and token levels, offering a unified path toward more scalable, efficient, and adaptable
attention mechanisms in modern large models.

4.4. Quantized Attention

To address the trade-off between computational efficiency and task accuracy in low-precision quantization,
numerous hardware acceleration platforms and quantization techniques have been developed in recent years.
Quantized Attention mechanisms have been proposed and studied to improve computational efficiency and
reduce memory requirements, while maintaining as much of the original model accuracy as possible.
Post-training Quantized Attention. Post-training quantization (PTQ) methods convert the Trans-
former attention operators to low-bit arithmetic without any retraining. For example, SageAttention [152]
quantizes the QK T product to INT8 format (with per-channel smoothing of outliers) while keeping the
(softmax( QK T ), V ) matmul in FP16. This mixed-precision approach (INT8 for the first matmul, FP16 for
the second matmul) doubles the matmul computation speed on GPUs with negligible accuracy loss. SageAt-
tention’s Triton kernel even fuses RoPE [312] and quantization, using NVIDIA Tensor Cores to reach high
efficiency. Similarly, INT-FlashAttention [157] builds a FlashAttention-compatible INT8 kernel: it quantizes
Q, K, V and the softmax input into 8-bit and performs the entire attention in INT8. INT-FlashAttention
achieves faster inference and smaller quantization error than the FP16 baseline, with only minor quality
degradation. Q-BERT [155] uses Hessian-based analysis to quantize attention to very low bits; e.g., uniform
4-bit Q-BERT shows only 0.5% accuracy drop on SQuAD versus over 11% for naive quantization. In summary,
PTQ attention methods quantize the main matmuls (often QK T or (softmax( QK T ), V )) and introduce scaling
or smoothing to preserve fidelity, enabling plug-and-play INT8 inference.
Quantization-Aware Training for Attention. Quantization-aware training (QAT) embeds low precision
into training so the model learns to cope with reduced bit-width. For example, Q8BERT [158] finetunes BERT
under 8-bit constraints (8-bit weights and activations) and can compress BERT by 4× with minimal loss.
I-BERT [156] goes further by training the model so that every operation (including GELU and softmax) can
be performed in integer arithmetic: it learns simple integer approximations for the non-linearities, allowing a
RoBERTa model to run end-to-end in INT8 with near-identical accuracy. FullyQT [159] quantizes all matrix
multiplies, including softmax inputs/outputs, during training, and even approximates the softmax exponent
via bit-shifts. It shows that an 8-bit Transformer can match FP32 BLEU scores when properly trained.
Mixed-Precision Attention. These schemes mix precisions within the attention computation to balance
speed and accuracy. The SageAttention example again illustrates this: it uses INT8 for the QK T multiplication
but keeps the V-matmul in FP16. More generally, one might quantize Q and K aggressively (e.g., 8-bit)
but perform the softmax/V multiplication in higher precision so that the final output remains accurate.
TurboAttention [160]’s FlashQ is also a form of hybrid attention: it applies separate quantization to each
head (one scale per head) so that most matmuls run in low precision, while still using sufficient precision to
compress the KV cache. Other schemes similarly choose mixed precision by layer or token: for example,

30
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

some layers might be run in INT8 while others (or certain operations like softmax normalization) remain
FP16. These mixed-precision designs leverage the fact that modern GPUs can execute INT8 and FP16 ops
on different tensor cores; by assigning the most sensitive sub-operations to a higher precision, they recover
accuracy without sacrificing the overall throughput boost of quantization.
INT8 Fused Attention Kernels. Beyond algorithmic tricks, specialized INT8 attention kernels have been
developed. INT-FlashAttention [157] is one example: it provides a fused CUDA kernel that takes INT8
Q, K, V and softmax inputs, performs all matmuls and softmax in 8-bit, and outputs FP16 or FP32 results.
Similarly, SageAttention’s implementation fuses quantization and FlashAttention-style tiling in a Triton
kernel, exploiting INT8 TensorCore MMA operations for both matmuls. HACK [161]’s attention prefill is
another fused kernel: it quantizes Q, K, V on-the-fly and feeds them into the attention computation without
separate quantize/dequantize steps.
Ultra-Low-Bit (Sub-4-Bit) Attention. The most aggressive quantizers push below 4 bits. SageAttention2
[153] does this by using warp-wise INT4 for the QK T and FP8 for the V-matmul, combined with outlier
smoothing on Q and V. SageAttention3 [154] goes further: it uses FP4 “microscaling” for both matmuls.
Specifically, it quantizes blocks of the matrices to FP4 with per-block scale factors and per-token normalization,
which contains quantization error and enables 5× speedup. On the training side, BitDistiller [162] is a QAT
framework for sub-4-bit LLMs: it employs asymmetric quantization and a self-distillation KL objective so that
3-bit and 2-bit models can approach full-precision performance. Even earlier, Q-BERT showed that carefully
assigning 2-3 bits per layer (with some layers at higher precision) can yield reasonable accuracy; for example,
uniform 2-bit Q-BERT retained 70% accuracy on SQuAD versus 5% for naive 2-bit [155].

5. Sparse Mixture-of-Experts
Sparsely-activated Mixture-of-Experts (MoE) is a cost-effective approach to enlarge the model capacity while
maintaining the computational consumption [313, 314, 315]. Unlike conventional model ensembling, MoE
is comprised of a gate and many specialized experts. The gate is a router that accepts input tokens and
adaptively selects the most relevant experts, which brings better robustness [316]. Besides, the gate induces
specialized experts and the routing preference may be used for dataset-level representations [317, 318].
Through such a sparse activation mechanism, MoE could be significantly scaled with greater model capacity,
leading to better task performance [319, 320, 321, 322, 323]. Based on these advantages, the MoE paradigm
has been widely applied in existing LLMs [33, 42, 172, 174, 324, 325, 326, 327, 328].
This section introduces gate and experts, which are two core components of the MoE paradigm. Besides,
we also introduce modern strategies to build MoE from dense models, which alleviate significant cost to train
MoE models from scratch.

5.1. Routing Mechanisms

The gate is a crucial component to bring sparsity in MoE models. For a batch of input token representations
X ∈ RT × D , the gate function G determines the probabilities of dispatching token xi to each expert ei :

P = G ( X ), P ∈ R T × N , G ( X ) = Softmax( XWg + b g ) (25)

where N is the number of experts, Wg ∈ RD× N and b g ∈ R N are trainable parameters in each MoE layer.
As shown in Eq. (25), the most common case of gate function G (·) is the softmax function after a linear
projection, and the calculation needs to be cast into FP32 for numerical stability especially when the model

31
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Token-choice Expert-choice

top-2
expert1 expert1 token1 token2 token1
routing
expert3 expert2 token2 token3 token3

token1 token2 expert1 expert2 expert3

Figure 10: MoE Routing Strategies.

parameter is in lower precision formats. There are some variants of G to replace the linear projection with
a feed-forward network (FFN) with nonlinear activations [178] for better performance, or substituting
softmax with sigmoid [28, 329] or cosine similarity function [330] to avoid representation collapse and
expert competition problems [331].
After obtaining the routing probabilities, MoE introduces sparsity by only selecting the top-k experts. It
is also a common practice to normalize the top-k routing probabilities and make the summation to be 1.0 for
better convergence [327].
∑︁k
Y= Top-k ( G ( X )i ) · Ei ( X ) (26)
i

Routing Strategies. Routing mechanisms in modern LLMs are usually token-choice-based, while a
token would selects corresponding k experts to calculate new representations, as illustrated in Figure 10. A
significant pitfall of this token-choice strategy is its tendency to lead poor load balancing across experts. This
often results in some experts being over-utilized and others under-utilized, leading to wasted capacity and
potentially unprocessed tokens due to expert capacity limits (i.e. token dropping).
Besides token-choice, each expert could also selects top-k tokens in an expert-choice manner. Expert-
choice [163] routing simply changes the softmax operation from the N dimension to T dimension in Eq. (25).
Furthermore, the top-k operation in Eq. (26) would select k tokens for each expert, leading to perfect load
balancing. Expert-choice is effective in encoder-based models, such as T5 encoder [3] and ViT [332, 333].
However, for autoregressive language modeling, experts could not see the whole sequence at each step,
and an expert may accepts too many tokens to fill all its capacity, remaining limit space for future tokens.
Therefore, modern LLMs with expert-choice has to pair an additional estimator or a loss constraint to make
capacities for future tokens [174].
BASE (Balanced Assignment of Experts) Layer [164] discards the auxiliary load balancing loss in token-
choice by formulating token routing as a linear assignment problem, reaching an equal number of tokens
for each expert during training. To avoid the potential future token leakage problem that is similar to
expert-choice in autoregressive language modeling, a simpler greedy assignment is used when inference.
Instead of training a gate function with learnable parameters, Hash Layer [165] proposes to use a fixed
predefined hash function (lookup table) to route tokens into specific experts. To build a balanced hash
function, the authors first analyze the training corpus and assign the most frequent token in the vocabulary
into the emptiest bucket until all tokens are assigned. Based on this setup, Hash Layer achieves balanced
expert loads and surpasses BASE Layer. However, due to the nature of Zipf ’s law in word distribution, perfect

32
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

balance is nearly impossible.


Adaptive Top-k Routing. Besides top-k routing with fixed k values, recent studies also focus on dynamic
routing to adaptive k experts, where the number of selected experts is determined by the task difficulties
or computational requirements. Based on this mechanism, models could be both efficient and effective on
complex tasks. There are three types of adaptive routing: (1) differentiable activation; (2) expert activation
estimation; and (3) zero computational experts.
Differentiable activation replaces the traditional top-k & softmax router with differentiable activation
functions, such as ReLU and sigmoid. ReMoE [334] utilizes ReLU as the router’s activation function, and
experts with zero activation scores would be discarded to process the input token. Since the router is totally
differentiable, it could be optimized adaptively based on the task complexities. However, training from
scratch may induce all experts to be activated. Therefore, ReMoE applies a regularization to ensure MoE
could reach certain sparsity levels. BlockFFN [335] also uses ReLU to make adaptive routing. Instead of
regularization to obtain target sparsity, it introduces chunk-level sparsity training objectives to encourage the
model to be sparse in both the token-level and the chunk-level. DynMoE [167] regards the expert selection
process as a multi-label classification problem where each expert is an individual class. For an input token,
DynMoE calculates the cosine similarities with all expert representations and select experts with a predefined
threshold.
Expert activation estimation reforms the top-k routing to be directly adaptive with dynamic k estimation.
MoE-Dynamic [166] selects experts by a predefined confidence score p, and the gate would incorporate
experts in a descending order of their selection probabilities until the cumulative score of the chosen experts
exceeds p. Therefore, MoE-Dynamic converts the fixed-size top-k routing to dynamic top-p routing, thereby
dynamically allocating more computational resources to more complex inputs while conserving them for
simpler ones. Ada-K [169] dynamically adjusts the number of activated experts for each token in MoE-based
LLMs through a learnable allocator. This allocator module is similar to the gate and takes the input token
and outputs a probability distribution over the number of experts, from which an optimal count is sampled.
Zero computational experts introduces heterogeneous experts with zero or few computational costs
as placeholders. MoE++ [336] keeps the number of activated experts k unchanged, and devises zero
computational experts. By introducing heterogeneous load balancing across normal experts and these zero
computational experts, some tokens would skip the computation process and therefore improve the overall
efficiency. AdaMoE [168] achieves token-adaptive routing by introducing null experts alongside the standard
experts in MoE, where these null experts perform no computations. Similar to MoE++, the router still
selects a fixed number of k experts, where it is larger than the k in a comparable vanilla MoE. This allows
different tokens to dynamically use a different number of true experts based on their needs. For instance,
semantically rich tokens might engage more true experts, while less significant tokens might be routed to
more null experts, thus achieving the adaptive routing.
Load Balancing. Gate load is the number of tokens routed to each expert. As discussed in Routing
Strategies, token-choice routing is prone to be unbalanced due to biased gates, where an expert may be
selected by many tokens and other experts may process less tokens. This phenomenon would make MoE
collapses where some experts are not sufficiently trained. Furthermore, unbalanced routing would result in
slow training because experts with less tokens may wait for other experts to finish processing. Given the fact
that modern language models prefer token-choice routing for effective autoregressive token prediction, it is
crucial for MoE to reach balancing.
A common practice for addressing unbalanced routing is to add an auxiliary load balancing loss Laux as a
soft balance constraint, and the final optimization object of LLM training would be the summation of a cross-

33
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

entropy loss Lce for language modeling and the auxiliary loss Laux for load balancing. Shazeer et al. [313]
proposes to apply the coefficient of variance (CV) of gate importance scores (i.e. routing probabilities) as
an auxiliary loss, where a more balanced training would lead to lower CV scores. For a stricter balance
constraint, the gate load D ( X ) is also estimated for the back-propagation compatibility:
∑︁ ∑︁
Laux = α CV( G ( X ))2 + β CV( D ( X ))2 , (27)
where α and β are hyper-parameters.
GShard [314] provides a simplified version of auxiliary loss that considers both the gate load and the
gate importance scores:
N
1 ∑︁
Laux = Di ( X ) · Gi ( X ) (28)
N
i =1

This design is prevalent in MoE and has become a default setting in many popular LLMs [172, 327].
Although such auxiliary losses are effective for reaching load balancing, it may violate the optimization
direction due to the interference gradient. To push the language modeling’s performance upper bound,
Wang et al. [170] introduces a method to discard the auxiliary loss while maintaining the load balancing.
The implementation involves applying an expert-wise bias to the routing scores of each expert before the
top-k routing decision is made. This bias is not static and it is dynamically updated for each expert based
on its recent load. This iterative process of token routing and bias updating allows the model to maintain
a balanced distribution of expert load consistently throughout training by directly adjusting gating scores
rather than introducing an auxiliary loss term, thereby avoiding interference with the primary language
modeling objective.
Qiu et al. [171] addresses the issue of inhibited expert specialization in MoE models when using traditional
load balancing loss (LBL) calculated at the micro-batch level. The authors argue that micro-batch load
balancing loss forces tokens, even domain-specific ones, to be uniformly routed to all experts within each
small sequence, hindering the ability of experts to specialize. To solve this, they propose a global-batch load
balancing loss by introducing an extra communication step to synchronize the expert load D ( X ) across all
parallel micro-batches that form a global batch. The load balancing loss is then calculated using these globally
aggregated gate load. This approach aims to loosen the strict constraint of micro-batch LBL, encouraging
load balance at the broader corpus level, thereby fostering better domain specialization among experts and
improving overall model performance in both pretraining perplexity and downstream tasks.

5.2. Expert Architectures

Traditional MoE introduces sparsity in FFNs and the experts are composed of multiple FFN blocks [315].
Although FFN experts are effective in most cases, there are still rooms for improving the efficiency, robustness,
and training stability [337]. Besides, experts are specialized in a pretrained MoE model, where each expert
may process tokens in certain field or have similar patterns. This also innovates new Parameter-Efficient
Fine-Tuning (PEFT) paradigms to only finetune task specific experts for lower computational cost [338].
As shown in Figure 11, there are multiple expert architectures in MoE layers, including fine-grained
experts, shared experts, MoD experts, and other heterogeneous experts (e.g. zero experts). In this section,
we introduce different types of expert architectures in FFN blocks. For attention experts, please refer to § 4.3.
Fine-grained Experts. For an MoE layer with N experts and each expert has a intermediate size of M,
there are N M neurons in total. While keeping the number of total parameters unchanged, the fine-grained

34
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Zero
Fine-grained Experts
Expert
Shared MoD
Expert Layer
Router

Top-4 Routing

MoE Layer

Figure 11: MoE Expert Architectures.

MoE has smaller M values, leading to larger N (i.e. more small experts). Fine-grained experts bring more
combinations and better performance with a clear scaling property [174]. For instance, a 16-top-2 setting
only has (16 64
2 ) = 120 expert selection combinations, while 64-top-8 has ( 8 ) = 4.4 × 10 combinations. Such
9

a fine-grained setting has been verified to be effective in modern LLMs like DeepSeekMoE [172], Qwen-
MoE [33], and OLMoE [174]. However, more experts would bring additional overhead in routing and bring
new challenges in efficient training/inference framework designs [339].
Shared Experts. Shared experts (or residual experts) are fixed experts where tokens are always routed
to. DeepSpeed-MoE [173] utilizes another linear projection and softmax to make a weighted sum of the
shared experts and the vanilla experts. Qwen2-MoE [32, 340] applies a similar linear projection with sigmoid
to calcualte the contribution weight of shared experts and sum two types of experts’ outputs as the final
output. DeepSeekMoE [172] directly adds the shared experts’ and the vanilla experts’ outputs without
complex re-weighting. These implementations all show performance improvements in the language modeling
perplexities or downstream task performance.
Mixture-of-Depths. Besides the above horizontally distributed experts, experts could also be vertically
distributed from the layer perspective. Mixture-of-Depths (MoD) [175] regards transformer layers as
experts and select top-k tokens for each layer with a corresponding router. Thus it significantly reduces the
computational cost by consuming less tokens on each layer. Different from LayerSkip [341], tokens dropped
from former layers may be processed by latter layers. MoD is orthogonal with MoE and could be combined
with MoE as Mixture-of-Depths-and-Experts (MoDE).
Other Special Experts. Besides the conventional modules that could be constructed as experts, there are
also special variants that integrate the idea of sparse MoE. SoftMoE [342] mixes tokens into soft slots, applies
experts to process such slots, and utilizes output projectio to reveal token representations. MoE++ [336]
introduces zero, copy, and constant experts in MoE to adaptively control the top-k routing and the overall
computational cost. ModuleFormer [343] provides a scheduled expert expansion method to add new
specialized experts while maintaining the original knowledge. Inspired by the parameter-efficient fine-tuning
methods, experts could also be LoRA weights for lightweight design [344, 345, 346, 347], and the number
of experts could be scaled to 1 million with a specially designed expert retrieval strategy [348].

35
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Split Copy Merge

FFN Experts FFN Experts FFN1 FFN2 Experts

Figure 12: MoE Conversion Strategies.

5.3. MoE Conversion

Training MoE models from scratch costs a mass of computational resources. Therefore, it is worth constructing
MoE models from existing dense models to efficiently scale the model capacity or reduce the inference
cost. As shown in Figure 12, there are three main strategies: (1) Splitting existing FFN to construct experts
(MoEfication [177]); (2) Copying FFNs and aggregate as an expert (Sparse Upcycling [179, 349]); (3)
Merging two existing dense models as an MoE model (BTX [181]). Here we introduce these strategies in
two aspects: from existing “dense to MoE”, and “sparse model routing” to aggregate two dense models as an
MoE model.
Dense to MoE. MoEBERT [176] builds MoE by splitting encoder-based BERT-family models and proposes
to use importance scores [350] for guiding the model adaptation. MoEfication [177] construct MoE by
splitting T5 models from a co-activation graph and the results are very close to the original T5 model. LLaMA-
MoE [150, 178] partition FFNs in LLaMA [351, 352] to build MoE models with continual pretraining and
supervised fine-tuning. Besides splitting dense FFNs to sparse experts, Komatsuzaki et al. [179] propose to
scale the model capacity by copying current FFNs and initialize new gates for each MoE layer, namely sparse
upcycling. For vertical moe conversion, DLO [353] expands layers and build MoD experts. MoDification [354]
converts layers in an existing dense model to MoD-style experts with a threshold-p gate, where tokens exceed
a predefined threshold are processed.
Sparse Model Routing. Besides module-based MoE, models could be merged or aggregated to build
MoE-like models for better comprehending user instructions and generating responses with specialized
experts. Branch-Train-Merge (BTM) [180] proposes to train several expert language models (LMs) with
domain specific datasets. During inference, these expert models are weighted averaged by the input query’s
domain category and a new model would generate the response. Instead of model merging, Branch-Train-Mix
(BTX) [181] directly aggregates expert LMs’ FFNs into MoE layers and train a new gate for each layer to
dynamically select specialized experts.

6. Hybrid Architectures
As the core component of the Transformer, softmax attention has demonstrated its superior performance
across a wide range of tasks. However, its quadratic computational complexity, coupled with the growing
KV-cache overhead, has become a critical barrier to efficiency in long-sequence processing. In contrast, linear
models have achieved significant breakthroughs in computational efficiency. Nevertheless, extensive studies
have shown that these linear models often fall short of standard softmax attention in tasks requiring precise
recall, sparse information retrieval, and long sequence modeling. In response to these challenges, researchers
have been developing hybrid models to achieve a better balance between efficiency and performance.
The core challenge of hybrid models lies in efficiently and effectively integrating the efficiency benefits
of linear models and the expressive power of standard softmax attention. As shown in Figure 13, current

36
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

MLP/MoE
×M MLP/MoE
Softmax Attention
Softmax Attention ×L
MLP/MoE
×N Linear Attention
Linear Attention

(a) Inter-layer Hybrid

Softmax Attention Linear Attention Hybrid Attention

MLP/MoE

Intra-layer Hybrid
Full Linear
Head Split Attention Attention

Hidden dimension wise Sequence wise

(b) Intra-layer Hybrid

Figure 13: Hybrid Model Architectures. (a) illustrates the classical paradigm of inter-layer hybrid approach. (b)
demonstrates the classical paradigm of intra-layer hybrid approach. The left side of (b) represents a pattern similar to
Hymba [191], which employs head-wise partitioning with either softmax attention or linear attention. The right side,
depicts a pattern analogous to LoLCATs [114], featuring sequence-wise partitioning where local regions utilize softmax
attention while distant regions employ linear attention.

hybrid models can be categorized into two main categories based on the integration approach:

• Inter-layer Hybrid employs a hierarchical alternating approach, where softmax attention layers are
inserted at specific intervals between consecutive linear sequence modeling layers. This approach ensures
that the majority of layers retain linear computational complexity while still benefiting from the high
representational capability of softmax attention layers.
• Intra-layer Hybrid achieves fine-grained fusion within individual layers by strategically blending linear
sequence modeling layers with softmax attention transformer layers.

6.1. Inter-layer Hybrid

The inter-layer hybrid model typically combines softmax attention layers and linear sequence modeling
layers in a predefined proportion. Due to its practical effectiveness and straightforward implementation, it

37
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

has become a dominant approach in the development of hybrid architectures.


Many notable hybrid models have been built on the Mamba [101] architecture. Zamba [182] is a
7B-parameter hybrid language model built primarily on the Mamba with periodic insertions of a single global
shared self-attention block. Zamba demonstrates competitive accuracy on general language understanding
benchmarks, approaching leading models like Mistral [355] despite using significantly fewer training tokens.
Compared to Zamba, Zamba2 [183] introduces key improvements including a switch to Mamba2 [102],
two alternating shared attention blocks and non-shared LoRAs. These changes lead to better performance
and efficiency. Jamba [185] introduces a more expressive and scalable 52B hybrid architecture. Jamba
adopts a hybrid architecture combining Mamba, standard attention, and MoE, interleaved at a 7:1 ratio. This
design enables Jamba’s strong in-context learning, and efficient inference, achieving 3× higher throughput
than Mixtral and supporting 256K context with only 4GB KV cache. Samba [184] combines Mamba and
Sliding Window Attention. Samba avoids full attention and adopts linear-time mechanisms, enabling it
to extrapolate up to 1 million tokens in zero-shot and recall up to 256K tokens with minimal fine-tuning.
Mamba-in-Llama [112] distills pretrained Transformers into Mamba, and proposes a hybrid version of
standard attention and Mamba. The method leverages weight reuse from Transformer attention layers
to initialize Mamba blocks, followed by a multi-stage distillation pipeline to preserve the capabilities of
standard attention layer. Hunyuan-TurboS [43] is a 56B activated (560B total) hybrid model that combines
Mamba2, standard attention, and MoE layers to balance long-sequence efficiency and contextual reasoning.
It uses an interleaved Attention-Mamba2-FFN/Mamba2-FFN block structure supports a 256K context length.
Zebra-Llama [187] proposes an efficient hybrid architecture combining Mamba2 and MLA [27] to achieve
Transformer-level performance with minimal overhead. The method employs SVD-based initialization
and layer distillation to transfer knowledge from pre-trained models, along with a sensitivity-aware layer
replacement strategy for optimal efficiency.
In addition to Mamba, numerous works combine standard attention with linear attention, linear RNNs,
and sliding window attention. RWKV-X [186] strategically interleaves sparse attention MoBA [144] within
RWKV-7 [83] layers, with about 25% of layers using sparse attention, achieving strong performance across
both short- and long-context tasks. It demonstrates superior decoding stability and lower latency than full-
attention models at long sequence lengths. YOCO [188] combines sliding-window attention with standard
attention. It features shared KV cache across layers, where the self-decoder generates a single KV cache
that is reused by the cross-decoder layers. YOCO reduces memory usage and improves efficiency, especially
for long-context tasks, by caching only once for all layers. RecurrentGemma [189] combines Griffin [356]
with sliding window attention, avoiding global attention entirely. This design enables RecurrentGemma
to match the performance of similarly sized transformer models like Gemma, while offering significantly
better memory and inference efficiency. MiniMax-01 [39] is a 456B-parameter MoE hybrid Model that
integrates Lightning Attention with standard softmax attention to handle ultra-long sequences up to 4M
tokens efficiently. LaCT [190] combines large-chunk TTT with local window attention to efficiently model
long sequences while optimizing hardware utilization. This hybrid design enables linear-complexity global
context modeling with quadratic costs only locally, supporting tasks like novel view synthesis with 1M tokens
and autoregressive video diffusion.

6.2. Intra-layer Hybrid

Intra-layer hybrid architectures combine linear and standard attention within a single layer. Typical designs
include: (1) head-wise split, where different heads use either linear or standard attention and (2) sequence-
wise split, applying different attention types to different input segments.

38
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Head-wise split. Hymba [191] represents a classic head-wise hybrid architecture, where the attention
heads are partitioned into two subsets: one employing Mamba and the other utilizing standard softmax
attention. Additionally, it incorporates innovations such as learnable meta tokens for memory initialization,
sliding window attention with selective global attention, and cross-layer KV cache sharing. 1.5B Hymba
outperforms Llama-3.2-3B across reasoning/recall tasks while achieving 3.49× higher throughput and 19×
smaller cache size. WuNeng [357] combines RWKV-7 state-driven mechanisms with Transformer attention,
balancing high-resolution recall and efficiency through cross-head interaction. WuNeng outperforms LLaMA
and Hymba, achieving better results in complex reasoning and long-context tasks.
Sequence-wise split. LoLCATs [114] is a sequence-level hybrid model that replaces softmax attention
with a combination of linear attention and SWA. It uses linear attention for all earlier tokens and local
softmax for the most recent tokens, capturing local dependencies. The model is trained in two steps: (1)
attention transfer to match softmax outputs, and (2) LoRA to fine-tune the replaced layers. With less
than 0.2% parameter updates and just 0.04B tokens used, LoLCATs successfully linearizes models up to
405B parameters, outperforming previous hybrid and subquadratic models in both efficiency and quality.
Liger [115] transforms pretrained Transformer-based LLMs into gated linear recurrent structures by reusing
key projection weights to build gate matrix—avoiding added parameters and preserving efficiency. Liger
blends Linear Attention with Sliding Window Attention, enabling softmax-like expressivity with linear-time
inference and constant memory. Supporting models from 1B to 8B parameters, Liger recovers up to 93% of
original model performance using only 0.02B fine-tuning tokens via LoRA. TransMamba [192] is a sequence-
level hybrid model that dynamically combines standard softmax attention and Mamba state space modeling
using shared parameters and a learned TransPoint to switch between them. Early tokens are processed with
standard softmax attention for precision, while later tokens use Mamba for efficient long-range modeling,
facilitated by a lossless Memory Converter. Supporting models up to 1.5B parameters, TransMamba achieves
up to 25% faster training than Transformers and excels in long-context tasks. LoLA [193] is developed
to overcome the limitations of conventional linear attention models in long-context tasks. By integrating
three complementary memory systems, low-rank linear attention for efficient global token storage, sliding
window attention for precise local context modeling, and a sparse global cache for retaining high-fidelity
representations of interference-prone key-value pairs.

7. Diffusion Large Language Models


In the preceding sections, we discussed various Transformer-based efficient architectures for LLMs, focusing
primarily on attention mechanisms and MoE modules. These approaches improve efficiency by optimizing
computation within the standard autoregressive (AR) framework. However, despite their advantages, they
remain constrained by the sequential nature of AR generation, where tokens are produced one at a time. This
token-wise dependency requires as many forward passes as the sequence length, representing a fundamental
bottleneck for inference latency.
In this section, we introduce recent advances in diffusion large language models (Diffusion LLMs) [358,
359]. In contrast to autoregressive models, Diffusion LLMs generate text by progressively denoising a
sequence from a noisy or masked state to a coherent output. This fundamental shift in generation enables
several unique advantages. Most notably, Diffusion LLMs support parallel decoding, allowing multiple tokens
to be produced at each refinement step, which significantly reduces inference latency by avoiding sequential
token generation. Moreover, the formulation of text generation within Diffusion LLMs as a denoising or
infilling process over a fixed-length canvas inherently provides superior controllability. This allows the
model to better adhere to specific output constraints, such as length, format, or structure, challenges that

39
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Input Text Input Text Response

Autoregressive LLM Diffusion LLM

Mask No Mask

Figure 14: Mechanism Comparison of Autoregressive Models and Diffusion LLMs.

are difficult to address with standard autoregressive methods. Finally, Diffusion LLMs utilize bidirectional
attention, enabling the model to access and revise context across the entire sequence at every step. This
global view helps mitigate issues such as the reversal curse, which arise from the unidirectional nature of
autoregressive decoding.

7.1. Non-Autoregressive Diffusion LLM

LLaDA [194] is a non-autoregressive Diffusion LLM with 8 billion parameters, trained from scratch. It models
the data distribution through a forward process that progressively masks tokens and a reverse denoising
process that simultaneously predicts all masked tokens per step. Empirically, LLaDA exhibits strong scalability
and achieves performance on par with leading AR models such as Llama3-8B across diverse benchmarks,
including in-context learning and instruction-following after supervised fine-tuning (SFT). Furthermore, it
overcomes known failure modes of AR models, such as the “reversal curse”, attributed to their unidirectional
processing.
In this section, we formalize the LLaDA’s probabilistic framework for better understanding. The model
distribution pθ ( x0 ) is characterized by a predefined forward corruption process and a learned reverse
generative process. The forward process systematically transforms a clean sequence x0 into a corrupted
intermediate xt by masking tokens with an independent probability dictated by a timestep t ∈ [0, 1]. The
reverse process is trained to recover the original data by iteratively denoising xt , progressively predicting the
masked tokens as t is annealed from 1 to 0.
Central to this framework is a parametric mask predictor, pθ (·| xt ), which operates on a corrupted sequence
xt to jointly infer the entire set of masked (M) tokens. Its parameters θ are optimized by minimizing the
following cross-entropy objective, computed exclusively over the masked positions:
[︃ L ]︃
1 ∑︁
L(θ ) ≜ −Et,x0 ,xt 1[ xt = M] log pθ ( x0 | xt ) ,
i i
(29)
t
i =1

where x0 is a sample drawn from the data distribution, t is a timestep uniformly sampled from the interval
[0, 1], and xt denotes the resulting corrupted sequence.
Upon training, sampling from pθ ( x0 ) is performed by simulating the reverse process. Crucially, the

40
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

training objective L(θ ) is a variational upper bound on the model’s negative log-likelihood:
−E pdata (x0 ) [log pθ ( x0 )] ≤ L(θ ), (30)
making it a principled objective for density estimation.
Unlike masked language models such as BERT [360] that employ a fixed masking ratio, LLaDA’s use
of a random ratio is fundamental to its design. This distinction, theoretically grounded by the negative
log-likelihood bound in Eq. (30), establishes LLaDA as a true generative model. This principled foundation
enables emergent capabilities like in-context learning, positioning LLaDA as a viable, non-autoregressive
alternative to mainstream LLMs.
Whether Diffusion LLMs can match the reasoning prowess of Reinforcement Learning (RL)-enhanced AR
models has been a significant open problem. The d1 framework [361] provides a compelling affirmative
answer. It adapts pre-trained masked Diffusion LLMs for complex reasoning via a two-stage process: first,
SFT on reasoning traces, followed by a novel RL algorithm, diffu-GRPO. This policy-gradient algorithm is
tailored for the non-autoregressive nature of Diffusion LLMs, featuring an efficient one-step log-probability
estimation regularized via random masking prompt. Applied to the LLaDA-8B-Instruct model, d1 achieves
substantial gains on mathematical and logical reasoning benchmarks, surpassing not only the base model
but also rigorous SFT- and RL-only ablations. This work provides strong evidence that Diffusion LLMs can be
potent reasoners when augmented with RL, establishing them as a competitive architecture in a domain
previously dominated by AR models.

7.2. Bridging Diffusion LLM and Autoregressive

Diffusion LLMs offer significant advantages over their AR counterparts, namely parallelized generation
and greater controllability. Nevertheless, they face challenges in likelihood modeling and are inherently
constrained to fixed-length outputs. A significant thrust in recent Diffusion LLM research is to combine
the strengths of both AR and diffusion methods. BD3-LMs [199] interpolates between discrete denoising
diffusion and AR models by defining an autoregressive distribution over blocks of tokens, while performing
diffusion within each block. By synergizing diffusion and autoregressive paradigms, this hybrid approach
remedies the fixed-length constraint of the former and the high inference latency of the latter. Specifically,
BD3-LMs integrate KV caching from AR models with parallel in-block token sampling, thereby enabling
flexible-length generation and substantially accelerating inference. They further propose an efficient training
algorithm to tackle the key challenge of gradient variance estimation in Diffusion LLMs, complemented by
data-driven noise schedules engineered to minimize this variance.
Formally, BD3-LMs operates on a sequence x = ( x1 , . . . , x L ) by partitioning it into B non-overlapping
blocks of length L′ , denoted x = ( x (1) , . . . , x ( B) ). The model defines the joint log-likelihood through an
autoregressive factorization over these blocks:
B
∑︁
log pθ ( x ) = log pθ ( x (b) | x (<b) ), (31)
b =1

where each conditional distribution pθ ( x (b) | x (<b) ) is modeled by a discrete diffusion process. This process
(b)
learns a denoising model pθ ( x (b) | xt , x (<b) ) to reverse a fixed forward noising process q(· | ·), according to
the reverse-time Markov chain:
(b) (b) (b) (b) (b)
∑︁
pθ ( xs | xt , x (<b) ) = q( xs | xt , x (b) ) pθ ( x (b) | xt , x (<b) ) (32)
x (b)

41
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

The denoising function is parameterized by a single transformer, f θ , which employs a block-causal attention
(b)
mask to enforce the autoregressive dependency between blocks. For a given corrupted block xt and its
(b)
causal context x (<b) , the model predicts the original clean block x̂0 :
(︁ )︁
(b) (b)
f θ xt , x (<b) → x̂0 (33)

Inference proceeds autoregressively across blocks while leveraging parallel decoding within each block.
This structure naturally accommodates KV caching for all preceding blocks x (<b) , significantly improving
efficiency.
The training objective is a variational upper bound on the negative log-likelihood, constructed by
aggregating the diffusion loss from each individual block:

B
∑︁
− log pθ ( x ) ≤ LBD ( x; θ ) := L( x (b) , x (<b) ; θ ), (34)
b =1

where each term L( x (b) , x (<b) ; θ ) is a standard objective for discrete diffusion, adaptable to various formula-
tions such as continuous-time or masking-based processes.
From another perspective, DiffuLLaMA [200] pioneers an alternative approach that leverages the abun-
dance of pre-trained AR models. Their method adapts foundational AR models, such as GPT-2 and LLaMA
(spanning 127M to 7B parameters), into text diffusion architectures named DiffuGPT and DiffuLLaMA.
Grounded in a demonstrated connection between AR and diffusion objectives, this conversion is achieved via
a simple continual pre-training strategy. The resulting models outperform prior Diffusion LLMs and achieve
performance competitive with their AR counterparts, despite requiring significantly less training data than
training from scratch. This adaptation also enables the Diffusion LLMs to inherit capabilities like in-context
learning and instruction following, and they naturally excel at non-sequential tasks such as "fill-in-the-middle"
without prompt reordering, leveraging the inherent nature of the diffusion process.

7.3. Extending Diffusion LLM to Multimodality

Recent advancements have extended Diffusion LLMs into the more complex multimodal domain [45, 362,
363, 364, 365]. These multimodal Diffusion LLMs are engineered to operate on joint textual and visual
data by incorporating a vision encoder and a projection layer, which maps extracted visual features into the
language model’s latent embedding space.
In this emerging area, several distinct approaches have emerged. LLaDA-V [201] introduces a purely
diffusion-based Multimodal LLM architecture, utilizing a vision encoder and an MLP connector that maps
the resulting visual representations into the language model’s latent space. This strategy enables effective
multimodal alignment, primarily achieved through visual instruction tuning. The model demonstrates
notable multimodal performance and substantial data scalability. Similarly, UniDisc [202] advocates for a
unified generative paradigm across both textual and visual domains, using purely discrete diffusion. It builds
upon a shared architecture that jointly tokenizes text and images into a common vocabulary and uses full
self-attention. UniDisc demonstrates strong joint multimodal inpainting and zero-shot editing capabilities
without explicit optimization for these tasks. In a similar vein, LaViDa [203] equips discrete diffusion models
with a vision encoder but specifically focuses on overcoming practical challenges in adapting diffusion models
for vision-language tasks. It introduces novel techniques such as “complementary masking” to improve
training data efficiency by ensuring all output tokens contribute to learning, “Prefix-Diffusion LLM decoding”

42
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

to enable KV caching for efficient inference with long multimodal prompts, and “timestep shifting" to enhance
sample quality, particularly at reduced generation steps, thereby offering the unique advantages of diffusion
models like speed-quality tradeoff and controllability.
To address the inherent instabilities and suboptimal performance associated with purely discrete diffusion
training in multimodal settings, Dimple [366] proposes an innovative hybrid “Autoregressive-then-Diffusion”
training paradigm. This approach first utilizes an autoregressive phase for robust vision-language alignment
and instruction following. It is then succeeded by a transition to a diffusion-based masked language modeling
stage, devised to reinstate parallel decoding capabilities. For inference, Dimple introduces “Confident
Decoding” for dynamic modulation of tokens generated per step, and “Structure Priors", enabling fine-grained
control over response length and format. Consequently, it achieves performance levels on par with prominent
autoregressive baselines like LLaVA-NEXT [367]. Further pushing the boundaries of unified modeling,
MMaDA [204] pioneers a novel category of multimodal diffusion foundation models, distinguished by a
common probabilistic framework and a modality-agnostic design. This architecture inherently eliminates
the requirement for discrete modality-specific components and is synergistically enhanced by a “mixed long
chain-of-thought (CoT)” fine-tuning methodology. This methodology establishes a consistent CoT structure
across diverse modalities, thereby enabling cold-start RL. Moreover, MMaDA introduces UniGRPO, a novel
unified RL algorithm optimized for diffusion models, which exhibits remarkable generalization across tasks,
including textual reasoning, multimodal comprehension, and even text-to-image synthesis.

8. Applications to Other Modalities


Originally developed and popularized for their success in the text domain, efficient architectures are now being
broadly adapted to other domains. This section surveys the transfer of these powerful paradigms, namely
linear-time sequence models like SSMs and RWKV-like architectures, and sparse computation strategies
like MoE, to non-textual data. We explore their impact across several key modalities, beginning with their
extensive applications in computer vision (§8.1). We then examine their growing influence in the audio
domain (§8.2). Finally, we discuss their crucial role in multimodal learning (§8.3), where efficiency is
paramount for integrating diverse data streams. This expansion beyond text demonstrates the versatility of
these models, unlocking new capabilities and enabling scaling to previously intractable problem sizes across
the broader landscape of artificial intelligence.

8.1. Vision

While ViTs excel in many tasks, their quadratic computational complexity has spurred the development
of more efficient architectures. This section reviews the burgeoning application of linear-time sequence
models, such as SSMs, and sparse computation strategies like MoE. We explore their impact across the
vision landscape, beginning with foundational tasks including classification, detection, and segmentation
(§8.1.1). We then cover their role in image enhancement, restoration, and generation (§8.1.2). Finally, we
highlight their growing importance in domain-specific applications such as medicine, autonomous driving,
and remote sensing (§8.1.3). Collectively, these advancements mark a shift towards models that achieve
high performance while remaining computationally tractable, enabling new capabilities at scale.

8.1.1. Image Classification, Detection, and Segmentation

In image classification, recent efforts have focused on adapting linear-time sequence models, particularly
Mamba-based SSMs, to various vision tasks. A key trend is the creation of hybrid architectures that integrate

43
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Table 2: Overview of Applications of Efficient Architectures Across Modalities and Categories.


Modality Category Approach

Classification InsectMamba [208], V-MoE [332], MoE-CNN [368], pMoE [369], Res-vmamba [370], Mam-
mil [371], Spectralmamb [372], Memorymamba [373]

Detection Vig [205], Vision-rwkv [206], Tutel [207], Mamba yolo [374], Voxel mamba [209], Mim-
istd [375], Htd-mamba [376], Soar [377]

Segmentation RWKV-SAM [378], Segman [379], VM-Unet [380], Pyramidmamba [381], Vig [205], Vision-
rwkv [206]
Restore-rwkv [382], Dvmsr [383], Q-mambar [384], RWKV-IR [385], Pixmamba [386], Water-
Enhancement & Restoration mamba [387], Retinexmamba [388], Llemamba [389], Fouriermamba [390], Vmambair [391],
Serpent [392], Matir [393], Cu-mamba [394], Lfmamba [395]
Vision
Generation DiS [396], Dim [397], DiM [210], Zigma [398], AiM [399], Maskmamba [400], Dimba [401],
DiM-3D [402], Gamba [403], Diffusion-rwkv [404], Sdit [405]
U-mamba [211], Vm-unet [212], Rwkv-unet [213], Segmamba [406], Zig-rir [407], Bsbp-
Medicine rwkv [408], Mambamil [409], Mmr-mamba [410], Mambamir [411], Vmambamorph [412],
I2i-mamba [413], Delta-wkv [414]

Autonomous Driving Mambabev [214], Occrwkv [415], H-mba [416], Trajectory mamba [417], Drama [418],
Salm2 [419]
RS-Mamba [420], Samba [421], Changemamba [422], Rs3mamba [423], Hsimamba [424],
Remote Sensing Pan-mamba [425], LE-Mamba [426], Rsdehamba [427], FMSR [428], Rscama [429]

Understanding Audio mamba [215, 430, 431], Mamca [216], Rawbmamba [217], BiMamba [220],
Ssamba [432], Dual-path mamba [221], Spmamba [222], VAD [223]
Audio
Enhancement & Generation SEMamba [433], Tramba [434], SaShiMi [218], Music-Diff [219], oSpatialNet-Mamba [435]
MaTAV [224], Avs-mamba [225], Av-mamba [436], VisualRWKV-UHD [226], Rwkv-clip [437],
Understanding Lavida [203], LIMoE [438], Uni-MoE [439], VL-MoE [227], Moe-llava [228], MoCLE [229],
Multimodality Llava-mole [230], PaCE [440], Llada-v [201], Dimple [366]
Unified Fragkiadaki [202], Mmada [204]

SSMs with established modules, such as CNNs or residual connections, to effectively capture both local
features and long-range dependencies [208, 370, 441]. To overcome the inherent 1D nature of SSMs for
2D image processing, researchers have proposed innovative data scanning strategies, such as multi-path
scanning for remote sensing images [442] and topology-aware scanning for graph-structured whole-slide
images in medical pathology [371]. These models are also being tailored for highly specialized domains by
addressing unique data challenges, including the high dimensionality of hyperspectral images [372] and
data scarcity in industrial defect recognition [373]. In a parallel pursuit of efficiency, sparse MoE models
have been successfully applied to vision. The V-MoE demonstrated that sparsely activating “expert” sub-
networks can scale Vision Transformers to billions of parameters while reducing computational costs [332].
Subsequent work has further matured this paradigm by enhancing its adversarial robustness [368] and
providing theoretical guarantees for its sample efficiency [369].
Building on these foundational applications, the same principles are being extended to object detection,
where linear-time models replace or augment traditional backbones to enhance efficiency and capability. A
prominent trend is the direct integration of SSMs into established detector frameworks, such as in Mamba-
YOLO, which leverages an SSM-based backbone to achieve real-time performance without pre-training [374].
The adaptability of SSMs is particularly evident in specialized and challenging domains. For instance,
Voxel-Mamba introduces a novel group-free paradigm for 3D object detection from point clouds by processing
entire voxel spaces as a single sequence [209]. In multimodal and niche applications, models like Fusion-
Mamba pioneer cross-modal fusion in the hidden state space for RGB-Infrared detection [443], while others

44
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

employ hierarchical structures to efficiently detect small or specialized targets [375, 376, 377]. Beyond
SSMs, other general-purpose linear-complexity backbones, such as Gated Linear Attention (ViG) and RWKV-
based models (Vision-RWKV), are also emerging, demonstrating competitive performance with significantly
reduced computational overhead [205, 206]. Complementing these architectural innovations, systems-level
optimizations like Tutel are proving critical for enabling the efficient, large-scale training of sparse MoE
models, paving the way for future scalable detection systems [207].
The demand for efficiency is further amplified in semantic segmentation, a dense prediction task that
requires processing high-resolution imagery while maintaining a global receptive field. Here again, linear-
time sequence models like Mamba and RWKV have emerged as strong backbones. A key strategy involves
creating hybrid encoder-decoder architectures that pair the global context-capturing ability of these models
with local mechanisms, such as convolutional blocks or local attention, to preserve fine-grained details for
accurate boundary delineation [378, 379]. This approach is also adapted for specialized domains, from
lightweight crack segmentation [380] to remote sensing, where innovations in the decoder use SSMs for
efficient pyramid feature fusion [381]. Confirming their versatility, general-purpose backbones like ViG
and Vision-RWKV also prove highly effective for dense prediction, offering significant speed and memory
advantages over traditional Transformers without sacrificing performance [205, 206].

8.1.2. Image Enhancement, Restoration, and Generation

Efficient architectures are transforming image enhancement and restoration by modeling long-range depen-
dencies with linear complexity. Mamba and RWKV-based models, often embedded in U-Net frameworks,
are being tailored for specific tasks. In low-light and underwater enhancement, for example, they are
often combined with physical principles like Retinex theory [386, 387, 388, 389]. For broader restoration
challenges such as dehazing and super-resolution, key innovations focus on adapting these models to 2D data.
These include sophisticated scanning strategies, hybrid Mamba-Transformer designs, and even modeling
dependencies across feature channels [390, 391, 392, 393, 394]. The efficiency of these models is particularly
advantageous for high-dimensional data like 4D light fields and medical images [382, 395, 444]. To ensure
practical deployment, significant research also focuses on techniques like knowledge distillation, low-bit
quantization, and the creation of balanced benchmark datasets [383, 384, 385].
Building on their success in restoration, these architectures are now being applied to the even more
computationally demanding field of generative modeling. Here, a dominant trend is replacing expensive
Transformer backbones in diffusion models with scalable architectures like Mamba [210, 396, 397]. A central
challenge is adapting the 1D nature of SSMs for 2D image processing. This has spurred innovations in scanning
patterns, such as zigzag scans and bidirectional processing, to better capture spatial context [398]. The
architectural shift extends beyond diffusion, with Mamba also being adapted for autoregressive generation
to achieve substantial inference speed-ups [399]. To leverage complementary strengths, hybrid models
combining Mamba and Transformers are also being explored for tasks like masked image modeling and
text-to-image synthesis [400, 401]. This new class of efficient backbones is also enabling advances in high-
dimensional outputs like 3D shape and scene generation [402, 403]. Finally, other linear-time models like
RWKV are emerging as a viable alternative, further diversifying the architectural landscape of generative
AI [404, 405].

8.1.3. Domain-specific Applications

Medicine. The medical imaging field has rapidly adopted efficient sequence models like Mamba and RWKV,
leveraging their ability to model long-range dependencies in high-resolution data. In semantic segmentation,

45
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

a dominant trend is the integration of these models into U-Net-like architectures to combine global context
awareness with local feature extraction. This has been successfully applied to a wide range of tasks, from 2D
biomedical and skin lesion segmentation [211, 212, 213] to challenging 3D volumetric data [406], with
some approaches using innovative nested or boundary-preserving RWKV structures for enhanced efficiency
and precision [407, 408]. This architectural flexibility also extends to classification, where these models
excel in the data-limited settings common to medicine [445] and are particularly powerful in computational
pathology for modeling interactions across vast sequences of patches in whole-slide images [409, 446].
Furthermore, these models are proving indispensable for complex image reconstruction, restoration, and
synthesis tasks. They are being used for multi-contrast MRI reconstruction [410], joint reconstruction and
uncertainty estimation [411], 3D deformable registration [412], and multi-modal image synthesis [413],
often outperforming established methods. The RWKV architecture and its variants have shown similar
promise in general medical image restoration and super-resolution [382, 414]. The versatility of these
models is even enabling novel applications beyond static images, such as motion-guided tracking of endoscopic
instruments in dynamic environments [447].

Autonomous Driving. In autonomous driving, efficient sequence modeling methods are being applied
across the entire pipeline, from perception to prediction and planning. In the critical Bird’s-Eye View (BEV)
space, these models serve as powerful backbones for tasks like 3D object detection [214] and semantic
occupancy prediction [415], where they efficiently process temporal and spatial information with linear
complexity. Their capabilities extend to other fundamental perception tasks, including multi-modal video
understanding for risk detection [416] and self-supervised depth estimation [448]. Beyond perception, these
models are enhancing downstream modules by replacing computationally intensive attention mechanisms.
For instance, they are used to build highly efficient trajectory forecasting models [417] and to enable
lightweight, end-to-end motion planners that fuse multi-modal sensor data [418]. The extreme efficiency of
these architectures also enables novel in-cabin applications, such as ultra-lightweight models for real-time
driver attention monitoring [419].

Remote Sensing. The field of remote sensing, characterized by very high-resolution (VHR) and hyper-
spectral imagery, has become a fertile ground for efficient sequence modeling methods due to their linear
complexity. For dense prediction tasks like semantic segmentation and change detection, a key innovation
has been the development of specialized scanning mechanisms, such as omnidirectional scans, to effectively
capture the complex spatial layouts of ground features from a global perspective [420, 421, 422, 423]. These
models are also being tailored for specific data modalities, most notably for hyperspectral image classification
and fusion (pansharpening), where they excel at modeling intricate spectral dependencies [424, 425, 426].
Similarly, in restoration tasks like dehazing and super-resolution, hybrid approaches that combine SSMs with
convolutional or frequency-domain modules are proving effective for enhancing both spatial and spectral
quality [427, 428]. The application of these architectures extends beyond pixel-level tasks to higher-level
semantic understanding, such as generating descriptive captions for detected changes between images,
demonstrating their versatility across the remote sensing pipeline [429].

8.2. Audio

The audio domain has widely adopted efficient sequence models as a powerful alternative to self-attention,
leveraging their linear complexity for processing long audio signals. These architectures, particularly Mamba
and its variants, are rapidly establishing new baselines across a range of fundamental audio tasks. They have

46
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

been successfully applied to audio tagging [215], classification [430], and specialized tasks like automatic
modulation classification [216]. A key innovation for audio has been the development of bidirectional Mamba
models, which are crucial for capturing the non-causal context inherent in many audio signals and have
proven effective for tasks like deepfake detection and general speech processing [217, 220]. Furthermore,
these models serve as a powerful backbone for self-supervised pre-training, learning robust general-purpose
audio representations that outperform prior Transformer-based methods on a wide array of downstream
tasks [431, 432].
Beyond general audio understanding, these models are setting new performance records in complex
signal processing domains. In speech separation, a prominent trend is to replace the recurrent or attention
modules in state-of-the-art frameworks like DPRNN and TF-GridNet with Mamba blocks. This strategy
has led to new SOTA results on benchmark datasets while significantly reducing computational complex-
ity [221, 222]. Similarly, in speech enhancement, Mamba-based systems have achieved top performance
on standard benchmarks [433]. Hybrid architectures are also emerging, such as combining Mamba with
Transformers to create highly efficient enhancement models tailored for resource-constrained mobile and
wearable platforms [434].
The inherent recurrent nature of these models makes them exceptionally well-suited for generation
and streaming applications. State Space Models were early pioneers in generating raw audio waveforms
directly, outperforming classic autoregressive models in both quality and speed [218], and they continue to
be relevant in advanced tasks like symbolic music generation [219]. For real-time processing, both Mamba
and RWKV have demonstrated strong capabilities. They are being used to build low-latency streaming
systems for multi-channel speech enhancement [435], voice activity detection (VAD) [223], and automatic
speech recognition (ASR), where RWKV-based transducers match or exceed the performance of traditional
chunk-based models with significantly less memory usage [449].

8.3. Multimodality

In multi-modality, efficient architectures are critical for aligning diverse data streams and scaling large models.
Linear-time sequence models like Mamba and RWKV have proven effective for this purpose. Mamba, for
instance, is used to build sophisticated alignment and fusion modules. In conversational emotion recognition,
a Mamba-aligner synchronizes text, audio, and video features [224]. For audio-visual segmentation and
question answering, specialized modules like a Temporal Mamba Block and a Cross-Modality Mamba enable
efficient temporal modeling and selective cross-modal attention, respectively [225, 436]. Similarly, the
RWKV architecture is being adapted for robust vision-language learning. Models like VisualRWKV-HD use
techniques such as lossless downsampling to process high-resolution images without increasing sequence
length, while RWKV-CLIP integrates the efficient RWKV into a contrastive learning framework, achieving
strong performance in zero-shot tasks [226, 437].
In the generative space, an emerging trend is the use of diffusion models as an alternative to autoregressive
systems for multimodal tasks. Researchers have developed large-scale diffusion language models like LaViDa
and MMaDA, which are combined with visual encoders to achieve strong performance in multimodal
understanding and generation [203, 204]. A key focus is on visual instruction tuning, where models like
LLaDA-V demonstrate that a pure diffusion framework can be competitive with top-tier autoregressive
models [201]. A notable sub-trend is the development of discrete diffusion models, such as Dimple and
UniDisc. These models map all modalities to discrete token sequences, enabling parallel decoding and novel
capabilities like unified cross-modal editing [202, 366].
To scale these large multimodal models efficiently, sparse MoE has become a dominant paradigm. MoE

47
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

architectures are used to build foundational models from scratch, such as LIMoE for contrastive learning
and Uni-MoE for unified systems that handle text, image, audio, and video within a single framework [227,
438, 439]. Sparsity is also a key strategy for efficient fine-tuning. For example, MoE-LLaVA introduces a
“MoE-Tuning” strategy to increase model capacity with constant computational cost [228]. Furthermore,
a mixture of specialized LoRA experts is being used to mitigate task and data conflicts during instruction
tuning, where different experts can focus on specific instruction clusters or data domains [229, 230]. Finally,
the concept of compositional experts, as seen in PaCE, breaks down complex tasks like multimodal dialogue
into sub-skills handled by different expert modules, which are trained progressively [440].

9. Conclusion and Future Directions


In this survey, we have reviewed the key architectural innovations and optimization strategies developed to
overcome the efficiency bottlenecks of Transformer-based models. We highlighted how the quadratic cost of
self-attention and the growth of FFN layers drive up both computation and memory demands, especially in
long-sequence, multimodal, and multi-step reasoning scenarios. We categorized recent solutions into seven
main areas: linear sequence modeling, sparse sequence modeling, efficient full attention, sparse mixture of
experts, hybrid architectures, diffusion LLMs, and cross-modal applications. For each category, we examine
the core ideas and underlying technical details, summarize representative works, and analyze the strengths
and limitations of them. By organizing these approaches systematically, we aim to provide a clear picture of
the current landscape and the common challenges they address.
Looking forward, we identify several promising directions for future exploration:

Efficient Architectures Design. As models continue to grow in scale and are expected to operate across a
wide range of environments, from cloud to edge, there is a pressing need to rethink core design principles.
The following research directions highlight key architectural innovations driving these goals forward:

• Algorithm-System-Hardware Co-Design: Jointly co-designing algorithm, system and hardware can


improve efficiency for linear, sparse, or full attention, especially on edge devices and specialized chips.
• Adaptive Attention Mechanisms: Attention modules that dynamically adjust sparsity or computation
based on input or hardware conditions can better balance efficiency and flexibility.
• Enhanced MoE Routing: Smarter MoE routing can improve expert utilization, reduce communication
overhead, and lower latency during inference.
• Efficient Large Models with Far More Parameters: Scaling models to even larger sizes requires innova-
tions in memory layout, sparse activation, and communication-efficient designs.
• Hierarchical Memory Architectures: Multi-tiered memory modules (local, short-term, long-term) inte-
grated into the model to efficiently store and retrieve past computation results and world knowledge.
• Efficient Small Models on Edge Devices: Designing efficient small-scale LLMs or VLMs for edge deploy-
ment calls for quantization, pruning, and compact architecture design.
• Non-Autoregressive Diffusion LLMs: Diffusion-based LLMs offer parallel generation and faster inference,
with potential to match autoregressive quality in tasks like dialogue and summarization.

Applications of Efficient Architectures. Beyond improving core architectural efficiency, an equally im-
portant frontier lies in applying these advancements to broaden the functional capabilities of language and
multimodal models. As models are increasingly expected to operate in real-time, dynamic, and multimodal

48
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

environments, new design priorities emerge, ranging from infinite context handling and agentic behavior
to lifelong learning and multimodal reasoning. The following directions outline practical applications of
efficient design:

• Infinite Long Context: Efficient models facilitate handling of extremely long or even unbounded contexts,
enhancing RAG, agents, reasoning, and multimodal tasks over extended inputs.
• Efficient Agentic LLMs: Models optimized for efficiency enable real-time tool usage, planning, and
multimodal reasoning, supporting agile agent behaviors with minimal latency in interactive applications.
• Efficient Large Reasoning Models: Efficient reasoning models reduce redundant computation and
leverage lightweight logic or memory components, improving scalability in tasks.
• Efficient Vision-Language-Action (VLA) Models: Efficient multi-modal fusion and rapid visual reasoning
empower VLA models to perform real-time control in robotics and interactive systems.
• Efficient Omni-modal Models: Unified efficient models seamlessly process diverse modalities, including
text, vision, audio, and 3D data.
• Efficient Unified Multimodal Models for Understanding and Generation: Combining multimodal
perception with generation supports more coherent and context-aware outputs in applications.
• Continual Adaptation and Lifelong Learning: Architectures that support on-the-fly adaptation to new
data streams without catastrophic forgetting, enabling LLMs to evolve continually in long-term changing
environments.

References
[1] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Be-
ichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint
arXiv:2303.18223, 1(2), 2023.

[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot
learners. Advances in neural information processing systems, 33:1877–1901, 2020.

[3] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of machine learning research, 21(140):1–67, 2020.

[4] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin,
Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages.
arXiv preprint arXiv:2002.08155, 2020.

[5] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom
Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with
alphacode. Science, 378(6624):1092–1097, 2022.

[6] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau
Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis.
arXiv preprint arXiv:2204.05999, 2022.

[7] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,
Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation

49
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474,
2020.

[8] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented
language model pre-training. In International conference on machine learning, pages 3929–3938.
PMLR, 2020.

[9] Surangika Ranathunga, En-Shiun Annie Lee, Marjana Prifti Skenduli, Ravi Shekhar, Mehreen Alam,
and Rishemjit Kaur. Neural machine translation for low-resource languages: A survey. ACM Computing
Surveys, 55(11):1–37, 2023.

[10] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language under-
standing by generative pre-training, 2018.

[11] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language
models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

[12] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.
arXiv preprint arXiv:2303.08774, 2023.

[13] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow,
Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276,
2024.

[14] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel-
yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint
arXiv:2412.16720, 2024.

[15] OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, April 2025.

[16] OpenAI. gpt-oss-120b & gpt-oss-20b model card. Technical report, OpenAI, August 2025.

[17] OpenAI. Introducing gpt-5, August 2025. Blog post, August 7, 2025.

[18] Anthropic. Introducing claude, March 2023. Blog post, March 14, 2023.

[19] Anthropic. Introducing claude 2.1, November 2023. Blog post, November 21, 2023.

[20] Anthropic. Claude 3.7 sonnet and claude code, February 2025. Blog post, February 25, 2025.

[21] Anthropic. Introducing claude 4, May 2025. Blog post, May 23, 2025.

[22] Anthropic. Claude opus 4.1, August 2025. Blog post, August 6, 2025.

[23] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable
multimodal models. arXiv preprint arXiv:2312.11805, 2023.

[24] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer,
Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding
across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.

50
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[25] Gemini Team, Google. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality,
long context, and next generation agentic capabilities, 2025.

[26] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai
Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.
arXiv preprint arXiv:2401.02954, 2024.

[27] Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan,
Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts
language model. arXiv preprint arXiv:2405.04434, 2024.

[28] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao,
Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint
arXiv:2412.19437, 2024.

[29] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong
Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement
learning. arXiv preprint arXiv:2501.12948, 2025.

[30] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han,
Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.

[31] Qwen Team. Qwen2 technical report. arXiv preprint arXiv:2412.15115, 2024.

[32] Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li,
Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei
Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin
Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia,
Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yi-Chao Zhang, Yunyang Wan, Yuqi Liu, Zeyu
Cui, Zhenru Zhang, Zihan Qiu, Shanghaoran Quan, and Zekun Wang. Qwen2.5 technical report.
ArXiv, abs/2412.15115, 2024.

[33] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao,
Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025.

[34] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient
foundation language models. arXiv preprint arXiv:2302.13971, 2023.

[35] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and
fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

[36] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad
Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of
models. arXiv preprint arXiv:2407.21783, 2024.

[37] AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. [Link]
meta. com/blog/llama-4-multimodal-intelligence/, checked on, 4(7):2025, 2025.

51
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[38] Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas,
Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to
glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024.

[39] Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao
Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention. arXiv
preprint arXiv:2501.08313, 2025.

[40] InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities,
2023.

[41] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen,
Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.

[42] Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen
Yang, Jonny Han, Xiaobo Shu, et al. Hunyuan-large: An open-source moe model with 52 billion
activated parameters by tencent. arXiv preprint arXiv:2411.02265, 2024.

[43] Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang,
Decheng Wu, Dengpeng Wu, Dian Jiao, et al. Hunyuan-turbos: Advancing large language models
through mamba-transformer synergy and adaptive chain-of-thought. arXiv preprint arXiv:2505.15431,
2025.

[44] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin
Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at
any resolution. arXiv preprint arXiv:2409.12191, 2024.

[45] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie
Wang, Jun Tang, et al. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.

[46] Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang
Fan, Kai Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215, 2025.

[47] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong
Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for
generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 24185–24198, 2024.

[48] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi
Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal
models with open-source suites. Science China Information Sciences, 67(12):220101, 2024.

[49] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong
Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal
models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024.

[50] Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen
Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for
open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025.

[51] Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang,
Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062, 2025.

52
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[52] Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin
Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491,
2025.
[53] ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi
Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models
with reinforcement learning. arXiv preprint arXiv:2504.13914, 2025.
[54] Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao
Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.
arXiv preprint arXiv:2506.13585, 2025.
[55] Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao,
Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv
preprint arXiv:2501.12599, 2025.
[56] Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru
Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du,
Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao,
Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaoru Hao,
Tianhong He, Weiran He, Wenyang He, Chao Hong, Yangyang Hu, Zhenxing Hu, Weixiao Huang,
Zhiqi Huang, Zihao Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yongsheng Kang, Guokun Lai, Cheng
Li, Fang Li, Haoyang Li, Ming Li, Wentao Li, Yanhao Li, Yiwei Li, Zhaowei Li, Zheming Li, Hongzhan
Lin, Xiaohan Lin, Zongyu Lin, Chengyin Liu, Chenyu Liu, Hongzhang Liu, Jingyuan Liu, Junqi Liu,
Liang Liu, Shaowei Liu, T. Y. Liu, Tianwei Liu, Weizhou Liu, Yangyang Liu, Yibo Liu, Yiping Liu, Yue
Liu, Zhengying Liu, Enzhe Lu, Lijun Lu, Shengling Ma, Xinyu Ma, Yingwei Ma, Shaoguang Mao,
Jie Mei, Xin Men, Yibo Miao, Siyuan Pan, Yebo Peng, Ruoyu Qin, Bowen Qu, Zeyu Shang, Lidong
Shi, Shengyuan Shi, Feifan Song, Jianlin Su, Zhengyuan Su, Xinjie Sun, Flood Sung, Heyi Tang,
Jiawen Tao, Qifeng Teng, Chensi Wang, Dinglu Wang, Feng Wang, Haiming Wang, Jianzhou Wang,
Jiaxing Wang, Jinhong Wang, Shengjie Wang, Shuyi Wang, Yao Wang, Yejie Wang, Yiqin Wang, Yuxin
Wang, Yuzhi Wang, Zhaoji Wang, Zhengtao Wang, Zhexu Wang, Chu Wei, Qianqian Wei, Wenhao Wu,
Xingzhe Wu, Yuxin Wu, Chenjun Xiao, Xiaotong Xie, Weimin Xiong, Boyu Xu, Jing Xu, Jinjing Xu, L. H.
Xu, Lin Xu, Suting Xu, Weixin Xu, Xinran Xu, Yangchuan Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Xiaofei
Yang, Ying Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Xingcheng Yao, Wenjie Ye,
Zhuorui Ye, Bohong Yin, Longhui Yu, Enming Yuan, Hongbang Yuan, Mengjie Yuan, Haobing Zhan,
Dehao Zhang, Hao Zhang, Wanlu Zhang, Xiaobin Zhang, Yangkun Zhang, Yizhi Zhang, Yongting
Zhang, Yu Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Haotian Zhao, Yikai Zhao, Huabin
Zheng, Shaojie Zheng, Jianren Zhou, Xinyu Zhou, Zaida Zhou, Zhen Zhu, Weiyu Zhuang, and Xinxing
Zu. Kimi k2: Open agentic intelligence, 2025.
[57] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural
information processing systems, 35:24824–24837, 2022.
[58] Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu,
Shuxian Liang, Junxian He, et al. A survey of efficient reasoning for large reasoning models: Language,
multimodality, and beyond. arXiv preprint arXiv:2503.21614, 2025.
[59] Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The
rising costs of training frontier ai models. arXiv preprint arXiv:2405.21015, 2024.

53
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[60] Christian Bogmans, Patricia Gomez-Gonzalez, Giovanni Melina, Jorge Miranda-Pinto, Andrea Pesca-
tori, and Sneha Thube. Power hungry: How ai will drive energy demand. Technical report, Interna-
tional Monetary Fund, 2025.
[61] Ahmad Faiz, Sotaro Kaneda, Ruhan Wang, Rita Osi, Prateek Sharma, Fan Chen, and Lei Jiang.
Llmcarbon: Modeling the end-to-end carbon footprint of large language models. arXiv preprint
arXiv:2309.14393, 2023.
[62] A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
[63] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. Lstm: A
search space odyssey. IEEE transactions on neural networks and learning systems, 28(10):2222–2232,
2016.
[64] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns:
Fast autoregressive transformers with linear attention. In International conference on machine learning,
pages 5156–5165. PMLR, 2020.
[65] Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, and Yu Cheng. Linear-moe: Linear sequence modeling
meets mixture-of-experts. arXiv preprint arXiv:2503.05447, 2025.
[66] Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao
Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.
arXiv preprint arXiv:2501.12326, 2025.
[67] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on
multimodal large language models. National Science Review, 11(12):nwae403, 2024.
[68] Saeed Masoudnia and Reza Ebrahimpour. Mixture of experts: a literature survey. Artificial Intelligence
Review, 42(2):275–293, 2014.
[69] Hao Peng, Jungo Kasai, Nikolaos Pappas, Dani Yogatama, Zhaofeng Wu, Lingpeng Kong, Roy Schwartz,
and Noah A Smith. Abc: Attention with bounded-memory control. arXiv preprint arXiv:2110.02488,
2021.
[70] Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Various lengths,
constant speed: Efficient language modeling with lightning attention. arXiv preprint arXiv:2405.17381,
2024.
[71] Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention
transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
[72] Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi,
Bailin Wang, Wei Bi, et al. Gated slot attention for efficient linear-time sequence modeling. arXiv
preprint arXiv:2409.07146, 2024.
[73] Zhen Qin, Yuxin Mao, Xuyang Shen, Dong Li, Jing Zhang, Yuchao Dai, and Yiran Zhong. You only scan
once: Efficient multi-dimension sequential modeling with lightnet. arXiv preprint arXiv:2405.21022,
2024.
[74] Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James
Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-
throughput tradeoff. arXiv preprint arXiv:2402.18668, 2024.

54
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[75] Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorba-
tovski, and Daniil Gavrilov. Linear transformers with learnable kernel functions are better in-context
models. arXiv preprint arXiv:2402.10644, 2024.
[76] Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers
with the delta rule over sequence length. arXiv preprint arXiv:2406.06484, 2024.
[77] Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta
rule. arXiv preprint arXiv:2412.06464, 2024.
[78] Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. Mom: Linear sequence modeling with
mixture-of-memories. arXiv preprint arXiv:2502.13685, 2025.
[79] Zhen Qin, Songlin Yang, and Yiran Zhong. Hierarchically gated recurrent neural network for sequence
modeling. Advances in Neural Information Processing Systems, 36, 2024.
[80] Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2:
Gated linear rnns with state expansion. arXiv preprint arXiv:2404.07904, 2024.
[81] Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi
Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era.
arXiv preprint arXiv:2305.13048, 2023.
[82] Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene
Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, et al. Eagle and finch: Rwkv with matrix-valued
states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024.
[83] Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Haowen Hou, Janna Lu, William Merrill,
Guangyu Song, Kaifeng Tan, Saiteja Utpala, et al. Rwkv-7" goose" with expressive dynamic state
evolution. arXiv preprint arXiv:2503.14456, 2025.
[84] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu,
and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference
on Machine Learning, pages 26670–26698. PMLR, 2023.
[85] Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael
Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-
term memory. arXiv preprint arXiv:2405.04517, 2024.
[86] Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling. arXiv preprint
arXiv:2311.01927, 2023.
[87] Aaron Voelker, Ivana Kajić, and Chris Eliasmith. Legendre memory units: Continuous-time rep-
resentation in recurrent neural networks. Advances in neural information processing systems, 32,
2019.
[88] Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with
optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487,
2020.
[89] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining
recurrent, convolutional, and continuous-time models with linear state space layers. Advances in
neural information processing systems, 34:572–585, 2021.

55
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[90] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state
spaces. arXiv preprint arXiv:2111.00396, 2021.

[91] Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré. How to train your hippo:
State space models with generalized orthogonal basis projections. arXiv preprint arXiv:2206.12037,
2022.

[92] Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured
state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.

[93] Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization
of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983,
2022.

[94] Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for
sequence modeling. arXiv preprint arXiv:2208.04933, 2022.

[95] Michael Zhang, Khaled K Saab, Michael Poli, Tri Dao, Karan Goel, and Christopher Ré. Effectively
modeling time series with simple discrete state spaces. arXiv preprint arXiv:2303.09489, 2023.

[96] Jiaxi Hu, Disen Lan, Ziyu Zhou, Qingsong Wen, and Yuxuan Liang. Time-ssm: Simplifying and
unifying state space models for time series forecasting. arXiv preprint arXiv:2405.16312, 2024.

[97] Shida Wang and Qianxiao Li. Stablessm: Alleviating the curse of memory in state-space models
through stable reparameterization. arXiv preprint arXiv:2311.14495, 2023.

[98] Annan Yu, Arnur Nigmetov, Dmitriy Morozov, Michael W Mahoney, and N Benjamin Erichson. Ro-
bustifying state-space models for long sequences via approximate diagonalization. arXiv preprint
arXiv:2310.01698, 2023.

[99] Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela
Rus. Liquid structural state-space models. arXiv preprint arXiv:2209.12951, 2022.

[100] Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space
models are amortized online learners. arXiv preprint arXiv:2407.14207, 2024.

[101] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv
preprint arXiv:2312.00752, 2023.

[102] Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through
structured state space duality. arXiv preprint arXiv:2405.21060, 2024.

[103] Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, and Weigao
Sun. Comba: Improving nonlinear rnns with closed-loop control. arXiv preprint arXiv:2506.02475,
2025.

[104] Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen,
Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden
states. arXiv preprint arXiv:2407.04620, 2024.

[105] Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv
preprint arXiv:2501.00663, 2024.

56
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[106] Mahdi Karami and Vahab Mirrokni. Lattice: Learning to efficiently compress the memory. arXiv
preprint arXiv:2504.05646, 2025.

[107] Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey
through test-time memorization, attentional bias, retention, and online optimization. arXiv preprint
arXiv:2504.13173, 2025.

[108] Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn,
and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time. arXiv preprint
arXiv:2505.23735, 2025.

[109] Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian
Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet:
Sequence modeling by locally optimal test-time training. arXiv preprint arXiv:2506.05233, 2025.

[110] Zhen Qin, Xuyang Shen, Dong Li, Weigao Sun, Stan Birchfield, Richard Hartley, and Yiran Zhong.
Unlocking the secrets of linear complexity sequence model from a unified perspective. arXiv preprint
arXiv:2405.17383, 2024.

[111] Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao,
Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. arXiv preprint
arXiv:2103.13076, 2021.

[112] Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the
llama: Distilling and accelerating hybrid models. Advances in Neural Information Processing Systems,
37:62432–62457, 2024.

[113] Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas
Kollar. Linearizing large language models. arXiv preprint arXiv:2405.06640, 2024.

[114] Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik
Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models. arXiv
preprint arXiv:2410.10254, 2024.

[115] Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models
to gated recurrent structures. arXiv preprint arXiv:2503.01496, 2025.

[116] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
transformers. arXiv preprint arXiv:1904.10509, 2019.

[117] Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. Star-
transformer. arXiv preprint arXiv:1902.09113, 2019.

[118] Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise
self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.

[119] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv
preprint arXiv:2004.05150, 2020.

[120] Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh
Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. Etc: Encoding long and structured inputs in
transformers. arXiv preprint arXiv:2004.08483, 2020.

57
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[121] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago
Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer
sequences. Advances in neural information processing systems, 33:17283–17297, 2020.

[122] Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei
Yang. Longt5: Efficient text-to-text transformer for long sequences. arXiv preprint arXiv:2112.07916,
2021.

[123] Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning
Zheng, and Furu Wei. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint
arXiv:2307.02486, 2023.

[124] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimen-
sional transformers. arXiv preprint arXiv:1912.12180, 2019.

[125] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint
arXiv:2001.04451, 2020.

[126] Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse
attention with routing transformers. Transactions of the Association for Computational Linguistics,
9:53–68, 2021.

[127] Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In
International Conference on Machine Learning, pages 9438–9447. PMLR, 2020.

[128] Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers.
arXiv preprint arXiv:2203.08913, 2022.

[129] Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew Gormley. Unlimiformer: Long-range
transformers with unlimited length input. Advances in Neural Information Processing Systems, 36,
2024.

[130] Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie,
YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively
trainable sparse attention. arXiv preprint arXiv:2502.11089, 2025.

[131] Piotr Piękos, Róbert Csordás, and Jürgen Schmidhuber. Mixture of sparse attention: Content-based
learnable sparse attention via expert-choice routing. arXiv preprint arXiv:2505.00315, 2025.

[132] Hanrui Wang, Zhekai Zhang, and Song Han. Spatten: Efficient sparse attention architecture with
cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer
Architecture (HPCA), pages 97–110. IEEE, 2021.

[133] Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua
Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for
long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490, 2024.

[134] Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, and Mao
Yang. Seerattention: Learning intrinsic sparse attention in your llms. arXiv preprint arXiv:2410.13276,
2024.

58
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[135] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language
models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.

[136] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song,
Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative
inference of large language models. Advances in Neural Information Processing Systems, 36, 2024.

[137] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you
what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.

[138] Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-
aware sparsity for efficient long-context llm inference. arXiv preprint arXiv:2406.10774, 2024.

[139] Yi Lu, Xin Zhou, Wei He, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, and Xuanjing Huang. Longheads:
Multi-head attention is secretly a long context processor. arXiv preprint arXiv:2402.10685, 2024.

[140] Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin,
Zhijian Liu, Yao Lu, and Song Han. Lserve: Efficient long-sequence llm serving with unified sparse
attention. arXiv preprint arXiv:2502.14866, 2025.

[141] Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse
attention with antidiagonal scoring. arXiv preprint arXiv:2503.16428, 2025.

[142] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-
efficient exact attention with io-awareness. Advances in Neural Information Processing Systems,
35:16344–16359, 2022.

[143] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv
preprint arXiv:2307.08691, 2023.

[144] Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He,
Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms. arXiv
preprint arXiv:2502.13189, 2025.

[145] Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-
3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information
Processing Systems, 37:68658–68685, 2024.

[146] Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint
arXiv:1911.02150, 2019.

[147] Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit
Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.
arXiv preprint arXiv:2305.13245, 2023.

[148] Ted Zadouri, Hubert Strauss, and Tri Dao. Hardware-efficient attention for fast decoding. arXiv
preprint arXiv:2505.21487, 2025.

[149] Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang,
Zixiao Huang, Shiyao Li, Shengen Yan, et al. Moa: Mixture of sparse attention for automatic large
language model compression. arXiv preprint arXiv:2406.14909, 2024.

59
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[150] Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, and Yu Cheng. Llama-moe v2: Exploring
sparsity of llama from perspective of mixture-of-experts with post-training. ArXiv, abs/2411.15708,
2024.

[151] Peng Jin, Bo Zhu, Li Yuan, and Shuicheng Yan. Moh: Multi-head attention as mixture-of-head
attention. arXiv preprint arXiv:2410.11842, 2024.

[152] Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention:
Accurate 8-bit attention for plug-and-play inference acceleration. arXiv preprint arXiv:2410.02367,
2024.

[153] Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2
technical report: Accurate 4 bit attention for plug-and-play inference acceleration. arXiv preprint
arXiv:2411.10958, 2024.

[154] Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun
Zhu, and Jianfei Chen. Sageattention3: Microscaling fp4 attention for inference and an exploration
of 8-bit training. arXiv preprint arXiv:2505.11594, 2025.

[155] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and
Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821, 2020.

[156] Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. I-bert: Integer-only
bert quantization. In International conference on machine learning, pages 5506–5518. PMLR, 2021.

[157] Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su,
and Tong Yang. Int-flashattention: Enabling flash attention for int8 quantization. arXiv preprint
arXiv:2409.16997, 2024.

[158] Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert. In
2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition
(EMC2-NIPS), pages 36–39. IEEE, 2019.

[159] Gabriele Prato, Ella Charlaix, and Mehdi Rezagholizadeh. Fully quantized transformer for machine
translation. arXiv preprint arXiv:1910.10485, 2019.

[160] Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Ruhle, and Saravan Rajmo-
han. Turboattention: Efficient attention approximation for high throughputs llms. arXiv preprint
arXiv:2412.08585, 2024.

[161] Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, and Minlan
Yu. Hack: Homomorphic acceleration via compression of the key-value cache for disaggregated llm
inference. arXiv preprint arXiv:2502.03589, 2025.

[162] Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. Bitdistiller:
Unleashing the potential of sub-4-bit llms via self-distillation. arXiv preprint arXiv:2402.10631, 2024.

[163] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le,
James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information
Processing Systems, 35:7103–7114, 2022.

60
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[164] Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers:
Simplifying training of large, sparse models. ArXiv, abs/2103.16716, 2021.

[165] Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. Hash layers for large sparse
models. In Neural Information Processing Systems, 2021.

[166] Quzhe Huang, Zhenwei An, Zhuang Nan, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen,
Songfang Huang, and Yansong Feng. Harder tasks need more experts: Dynamic routing in moe
models. ArXiv, abs/2403.07652, 2024.

[167] Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, and Tao Lin. Dynamic mixture of experts: An auto-
tuning approach for efficient transformer models. ArXiv, abs/2405.14297, 2024.

[168] Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, and Zhijie Deng. Adamoe: Token-adaptive
routing with null experts for mixture-of-experts language models. ArXiv, abs/2406.13233, 2024.

[169] Tongtian Yue, Longteng Guo, Jie Cheng, Xuange Gao, and Jing Liu. Ada-k routing: Boosting the
efficiency of moe-based llms. ArXiv, abs/2410.10456, 2024.

[170] Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing
strategy for mixture-of-experts. ArXiv, abs/2408.15664, 2024.

[171] Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu,
Jingren Zhou, and Junyang Lin. Demons in the detail: On implementing load balancing loss for
training specialized mixture-of-expert models. ArXiv, abs/2501.11873, 2025.

[172] Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng,
Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts
language models. arXiv preprint arXiv:2401.06066, 2024.

[173] Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ah-
mad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference
and training to power next-generation ai scale. ArXiv, abs/2201.05596, 2022.

[174] Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia
Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. Olmoe: Open mixture-of-experts language
models. arXiv preprint arXiv:2409.02060, 2024.

[175] David Raposo, Sam Ritter, Blake Richards, Timothy P. Lillicrap, Peter Humphreys, and Adam Santoro.
Mixture-of-depths: Dynamically allocating compute in transformer-based language models. ArXiv,
abs/2404.02258, 2024.

[176] Simiao Zuo, Qingru Zhang, Chen Liang, Pengcheng He, Tuo Zhao, and Weizhu Chen. Moebert:
from bert to mixture-of-experts via importance-guided adaptation. In North American Chapter of the
Association for Computational Linguistics, 2022.

[177] Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Moefication:
Transformer feed-forward layers are mixtures of experts. In Findings, 2021.

[178] Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. LLaMA-
MoE: Building mixture-of-experts from LLaMA with continual pre-training. In Yaser Al-Onaizan,
Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in

61
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Natural Language Processing, pages 15913–15923, Miami, Florida, USA, November 2024. Association
for Computational Linguistics.

[179] Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua
Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts
from dense checkpoints. ArXiv, abs/2212.05055, 2022.

[180] Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke
Zettlemoyer. Branch-train-merge: Embarrassingly parallel training of expert language models. ArXiv,
abs/2208.03306, 2022.

[181] Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob
Kahn, Shang-Wen Li, Wen tau Yih, Jason E Weston, and Xian Li. Branch-train-mix: Mixing expert llms
into a mixture-of-experts llm. ArXiv, abs/2403.07816, 2024.

[182] Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim,
and Beren Millidge. Zamba: A compact 7b ssm hybrid model. arXiv preprint arXiv:2405.16712, 2024.

[183] Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whit-
tington, Jonathan Pilault, and Beren Millidge. The zamba2 suite: Technical report. arXiv preprint
arXiv:2411.15242, 2024.

[184] Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid
state space models for efficient unlimited context language modeling. arXiv preprint arXiv:2406.07522,
2024.

[185] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi,
Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba
language model. arXiv preprint arXiv:2403.19887, 2024.

[186] Haowen Hou, Zhiyi Huang, Kaifeng Tan, Rongchang Lu, and Fei Richard Yu. RWKV-X: A Linear
Complexity Hybrid Language Model, May 2025.

[187] Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, and Emad Barsoum. Zebra-llama:
Towards extremely efficient hybrid models. arXiv preprint arXiv:2505.17272, 2025.

[188] Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong
Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for language models.
Advances in Neural Information Processing Systems, 37:7339–7361, 2024.

[189] Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba
Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, et al. Re-
currentgemma: Moving past transformers for efficient open language models. arXiv preprint
arXiv:2404.07839, 2024.

[190] Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli,
William T Freeman, and Hao Tan. Test-time training done right. arXiv preprint arXiv:2505.23884,
2025.

[191] Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar,
Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid-head
architecture for small language models. arXiv preprint arXiv:2411.13676, 2024.

62
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[192] Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng,
Chengzhong Xu, Di Wang, et al. Transmamba: Flexibly switching between transformer and mamba.
arXiv preprint arXiv:2503.24067, 2025.
[193] Luke McDermott, Robert W Heath Jr, and Rahul Parhi. Lola: Low-rank linear attention with sparse
caching. arXiv preprint arXiv:2505.23666, 2025.
[194] Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong
Wen, and Chongxuan Li. Large language diffusion models, 2025.
[195] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto. Diffusion-lm
improves controllable text generation, 2022.
[196] Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to
sequence text generation with diffusion models, 2023.
[197] Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios
of the data distribution, 2024.
[198] Ishaan Gulrajani and Tatsunori B. Hashimoto. Likelihood-based diffusion language models, 2023.
[199] Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar
Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion
language models, 2025.
[200] Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin
Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling diffusion language models via
adaptation from autoregressive models, 2025.
[201] Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li.
Llada-v: Large language diffusion models with visual instruction tuning, 2025.
[202] Alexander Swerdlow, Mihir Prabhudesai, Siddharth Gandhi, Deepak Pathak, and Katerina Fragkiadaki.
Unified multimodal discrete diffusion. arXiv preprint arXiv:2503.20853, 2025.
[203] Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason
Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for
multimodal understanding. arXiv preprint arXiv:2505.16839, 2025.
[204] Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada:
Multimodal large diffusion language models, 2025.
[205] Bencheng Liao, Xinggang Wang, Lianghui Zhu, Qian Zhang, and Chang Huang. Vig: Linear-complexity
visual sequence learning with gated linear attention. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 39, pages 5182–5190, 2025.
[206] Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li,
Jifeng Dai, and Wenhai Wang. Vision-rwkv: Efficient and scalable visual perception with rwkv-like
architectures. arXiv preprint arXiv:2403.02308, 2024.
[207] Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin
Jose, Prabhat Ram, et al. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning
and Systems, 5:269–287, 2023.

63
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[208] Qianning Wang, Chenglin Wang, Zhixin Lai, and Yucheng Zhou. Insectmamba: State space model
with adaptive composite features for insect recognition. In ICASSP 2025-2025 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025.

[209] Guowen Zhang, Lue Fan, Chenhang He, Zhen Lei, ZHAO-XIANG ZHANG, and Lei Zhang. Voxel
mamba: Group-free state space models for point cloud based 3d object detection. Advances in Neural
Information Processing Systems, 37:81489–81509, 2024.

[210] Shentong Mo and Yapeng Tian. Scaling diffusion mamba with bidirectional ssms for efficient image
and video generation. arXiv preprint arXiv:2405.15881, 2024.

[211] Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image
segmentation. arXiv preprint arXiv:2401.04722, 2024.

[212] Jiacheng Ruan, Jincheng Li, and Suncheng Xiang. Vm-unet: Vision mamba unet for medical image
segmentation. arXiv preprint arXiv:2402.02491, 2024.

[213] Juntao Jiang, Jiangning Zhang, Weixuan Liu, Muxuan Gao, Xiaobin Hu, Xiaoxiao Yan, Feiyue Huang,
and Yong Liu. Rwkv-unet: Improving unet with long-range cooperation for effective medical image
segmentation. arXiv preprint arXiv:2501.08458, 2025.

[214] Zihan You, Ni Wang, Hao Wang, Qichao Zhao, and Jinxiang Wang. Mambabev: An efficient 3d
detection model with mamba2. arXiv preprint arXiv:2410.12673, 2024.

[215] Jiaju Lin and Haoxuan Hu. Audio mamba: Pretrained audio state space model for audio tagging.
arXiv preprint arXiv:2405.13636, 2024.

[216] Yezhuo Zhang, Zinan Zhou, Yichao Cao, Guangyu Li, and Xuanpeng Li. Mamca – optimal on accuracy
and efficiency for automatic modulation classification with extended signal length. arXiv preprint
arXiv:2405.11263, 2024. Also accepted in IEEE Communications Letters (Early Access).

[217] Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng,
Jianhua Tao, Lv Zhao, and Cunhang Fan. Rawbmamba: End-to-end bidirectional state space model
for audio deepfake detection. arXiv preprint arXiv:2406.06086, 2024.

[218] Karan Goel, Albert Gu, Chris Donahue, and Christopher Re. It’s raw! audio generation with state-space
models. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), volume
162 of Proceedings of Machine Learning Research, pages 7616–7633, 2022.

[219] Shipei Liu, Xiaoya Fan, and Guowei Wu. Why perturbing symbolic music is necessary: Fitting
the distribution of never-used notes through a joint probabilistic diffusion model. arXiv preprint
arXiv:2408.01950, 2024.

[220] Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathamby
Ambikairajah, Haizhou Li, and Julien Epps. Mamba in speech: Towards an alternative to self-attention.
arXiv preprint arXiv:2405.12609, 2024.

[221] Xilin Jiang, Cong Han, and Nima Mesgarani. Dual-path mamba: Short and long-term bidirectional
selective structured state space models for speech separation. arXiv preprint arXiv:2403.18257, 2024.

[222] Kai Li, Chen Guo, and Hu Xiaolin. Spmamba: State-space model is all you need in speech separation.
arXiv preprint arXiv:2404.02063, 2024.

64
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[223] Lingyun Zuo, Keyu An, Shiliang Zhang, and Zhijie Yan. Advancing vad systems based on multi-task
learning with improved model structures. arXiv preprint arXiv:2312.14860, 2023.
[224] Xinran Li, Xiaomao Fan, Qingyang Wu, Xiaojiang Peng, and Ye Li. Mamba-enhanced text-audio-video
alignment network for emotion recognition in conversations. arXiv preprint arXiv:2409.05243, 2024.
[225] Sitong Gong, Yunzhi Zhuge, Lu Zhang, Yifan Wang, Pingping Zhang, Lijun Wang, and Huchuan
Lu. Avs-mamba: Exploring temporal and multi-modal mamba for audio-visual segmentation. arXiv
preprint arXiv:2501.07810, 2025.
[226] Zihang Li and Haowen Hou. Visualrwkv-hd and uhd: Advancing high-resolution processing for visual
language models. arXiv preprint arXiv:2410.11665, 2024.
[227] Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yuxiong He. Scaling
vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
[228] Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang,
Munan Ning, et al. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint
arXiv:2401.15947, 2024.
[229] Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok,
and Yu Zhang. Mixture of cluster-conditional lora experts for vision-language instruction tuning.
arXiv preprint arXiv:2312.12379, 2023.
[230] Shaoxiang Chen, Zequn Jie, and Lin Ma. Llava-mole: Sparse mixture of lora experts for mitigating
data conflicts in instruction finetuning mllms. arXiv preprint arXiv:2401.16160, 2024.
[231] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM
Computing Surveys, 55(6):1–28, 2022.
[232] Badri Narayana Patro and Vijay Srinivas Agneeswaran. Mamba-360: Survey of state space models as
transformer alternative for long sequence modelling: Methods, applications, and challenges. arXiv
preprint arXiv:2404.16112, 2024.
[233] Matteo Tiezzi, Michele Casoni, Alessandro Betti, Tommaso Guidi, Marco Gori, and Stefano Melacci.
Back to recurrent processing at the crossroad of transformers and state-space models. Nature Machine
Intelligence, pages 1–11, 2025.
[234] Yutao Sun, Zhenyu Li, Yike Zhang, Tengyu Pan, Bowen Dong, Yuyi Guo, and Jianyong Wang. Efficient
attention mechanisms for large language models: A survey. arXiv preprint arXiv:2507.19595, 2025.
[235] Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong.
Random feature attention. arXiv preprint arXiv:2103.02143, 2021.
[236] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural
information processing systems, 20, 2007.
[237] Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas
Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of
the AAAI conference on artificial intelligence, volume 35, pages 14138–14148, 2021.
[238] Yifan Chen, Qi Zeng, Heng Ji, and Yun Yang. Skyformer: Remodel self-attention with gaussian kernel
and nystr\" om method. Advances in Neural Information Processing Systems, 34:2122–2135, 2021.

65
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[239] Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, and Gao Huang. Flatten transformer: Vision
transformer using focused linear attention. In Proceedings of the IEEE/CVF international conference on
computer vision, pages 5961–5971, 2023.

[240] Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine:
Expressive linear attentions with softmax mimicry. arXiv preprint arXiv:2402.04347, 2024.

[241] Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong.
The devil in linear transformer. arXiv preprint arXiv:2210.10340, 2022.

[242] Peng Lu, Ivan Kobyzev, Mehdi Rezagholizadeh, Boxing Chen, and Philippe Langlais. Regla: Refining
gated linear attention. arXiv preprint arXiv:2502.01578, 2025.

[243] Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong
Lv, Xiao Luo, Yu Qiao, et al. Transnormerllm: A faster and better large language model with improved
transnormer. arXiv preprint arXiv:2307.14995, 2023.

[244] Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-
2: A free lunch for handling unlimited sequence lengths in large language models. arXiv preprint
arXiv:2401.04658, 2024.

[245] Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and
Furu Wei. Retentive network: A successor to transformer for large language models. arXiv preprint
arXiv:2307.08621, 2023.

[246] Yuhong Chou, Man Yao, Kexin Wang, Yuqi Pan, Rui-Jie Zhu, Jibin Wu, Yiran Zhong, Yu Qiao, Bo Xu,
and Guoqi Li. Metala: Unified optimal linear approximation to softmax attention map. Advances in
Neural Information Processing Systems, 37:71034–71067, 2024.

[247] Snehashish Chakraverty, Deepti Moyi Sahoo, Nisha Rani Mahato, Snehashish Chakraverty, Deepti Moyi
Sahoo, and Nisha Rani Mahato. Hebbian learning rule. Concepts of Soft Computing: Fuzzy and ANN
with Programming, pages 175–182, 2019.

[248] DL Prados and SC Kak. Neural network capacity using delta rule. Electronics Letters, 25(3):197–199,
1989.

[249] Bernard Widrow and Marcian E Hoff. Adaptive switching circuits, 1988.

[250] Johannes Von Oswald, Maximilian Schlegel, Alexander Meulemans, Seijin Kobayashi, Eyvind Niklasson,
Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Max Vladymyrov, et al. Uncovering mesa-
optimization algorithms in transformers. arXiv preprint arXiv:2309.05858, 2023.

[251] Han Guo, Songlin Yang, Tarushii Goel, Eric P Xing, Tri Dao, and Yoon Kim. Log-linear attention. arXiv
preprint arXiv:2506.04761, 2025.

[252] Morris Yau, Sharut Gupta, Valerie Engelmayer, Kazuki Irie, Stefanie Jegelka, and Jacob Andreas.
Sequential-parallel duality in prefix scannable models. arXiv preprint arXiv:2506.10918, 2025.

[253] Jiaxi Hu, Yuehong Hu, Wei Chen, Ming Jin, Shirui Pan, Qingsong Wen, and Yuxuan Liang. Attractor
memory for long-term time series forecasting: A chaos perspective. arXiv preprint arXiv:2402.11463,
2024.

66
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[254] Guy E Blelloch. Prefix sums and their applications, 1990.

[255] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of
gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.

[256] Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, and
Josh Susskind. An attention free transformer. arXiv preprint arXiv:2105.14103, 2021.

[257] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems, 1960.

[258] William Glasser. Control theory. Harper and Row New York, 1985.

[259] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[260] Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry
hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052,
2022.

[261] Thierry Joffrain, Tze Meng Low, Enrique S Quintana-Ortí, Robert van de Geijn, and Field G Van Zee.
Accumulating householder transformations, revisited. ACM Transactions on Mathematical Software
(TOMS), 32(2):169–179, 2006.

[262] Christian Bischof and Charles Van Loan. The wy representation for products of householder matrices.
SIAM Journal on Scientific and Statistical Computing, 8(1):s2–s13, 1987.

[263] Tsendsuren Munkhdalai and Hong Yu. Neural semantic encoders. In Proceedings of the conference.
Association for Computational Linguistics. Meeting, volume 1, page 397, 2017.

[264] Tsendsuren Munkhdalai, Alessandro Sordoni, Tong Wang, and Adam Trischler. Metalearned neural
memory. Advances in Neural Information Processing Systems, 32, 2019.

[265] Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. Going beyond linear trans-
formers with recurrent fast weight programmers. Advances in neural information processing systems,
34:7703–7717, 2021.

[266] Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight
programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021.

[267] Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy
Bernstein. Muon: An optimizer for hidden layers in neural networks. Cited on, page 10, 2024.

[268] Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for
designing sequence models with associative memory. arXiv preprint arXiv:2501.12352, 2025.

[269] Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi.
Deltaproduct: Improving state-tracking in linear rnns via householder products. arXiv preprint
arXiv:2502.10297, 2025.

[270] Luca Zancato, Arjun Seshadri, Yonatan Dukler, Aditya Sharad Golatkar, Yantao Shen, Benjamin
Bowman, Matthew Trager, Alessandro Achille, and Stefano Soatto. B’mojo: Hybrid state space
realizations of foundation models with eidetic and fading memory. Advances in Neural Information
Processing Systems, 37:130433–130462, 2024.

67
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[271] Yuzhen Huang, Jinghan Zhang, Zifei Shan, and Junxian He. Compression represents intelligence
linearly. arXiv preprint arXiv:2404.09937, 2024.

[272] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, Fujie Huang, et al. A tutorial on energy-based
learning. Predicting structured data, 1(0), 2006.

[273] ROBERTJ McEliece, Edwardc Posner, EUGENER Rodemich, and SANTOSHS Venkatesh. The capacity
of the hopfield associative memory. IEEE transactions on Information Theory, 33(4):461–482, 1987.

[274] Nabil H Farhat, Demetri Psaltis, Aluizio Prata, and Eung Paek. Optical implementation of the hopfield
model. Applied optics, 24(10):1469–1475, 1985.

[275] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with
linear complexity. arXiv preprint arXiv:2006.04768, 2020.

[276] Samuel J Gershman, Ila Fiete, and Kazuki Irie. Key-value memory in the brain. arXiv preprint
arXiv:2501.02950, 2025.

[277] Carlo Bruni, Gianni DiPillo, and Giorgio Koch. Bilinear systems: An appealing class of" nearly linear"
systems in theory and applications. IEEE Transactions on automatic control, 19(4):334–348, 1974.

[278] Yingbo Zhao and Jorge Cortés. Gramian-based reachability metrics for bilinear networks. IEEE
Transactions on Control of Network Systems, 4(3):620–631, 2016.

[279] Xinyue Wang, Junxia Ma, and Weili Xiong. Expectation-maximization algorithm for bilinear state-
space models with time-varying delays under non-gaussian noise. International Journal of Adaptive
Control and Signal Processing, 37(10):2706–2724, 2023.

[280] Panos M Pardalos and Vitaliy A Yatsenko. Optimization and control of bilinear systems: theory,
algorithms, and applications, volume 11. Springer Science & Business Media, 2010.

[281] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–
1780, 1997.

[282] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for
statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

[283] William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models.
arXiv preprint arXiv:2404.08819, 2024.

[284] Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights. arXiv preprint
arXiv:2210.04243, 2022.

[285] Hanting Chen, Zhicheng Liu, Xutao Wang, Yuchuan Tian, and Yunhe Wang. Dijiang: Efficient large
language models through compact kernelization. arXiv preprint arXiv:2403.19928, 2024.

[286] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv
preprint arXiv:1503.02531, 2015.

[287] Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, and Albert Gu. Llamba: Scaling distilled recurrent
models for efficient language processing. arXiv preprint arXiv:2502.14458, 2025.

68
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[288] Aviv Bick, Kevin Li, Eric Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic
knowledge to subquadratic models. Advances in Neural Information Processing Systems, 37:31788–
31812, 2024.

[289] Xuan Zhang, Fengzhuo Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, and Min Lin. Light-
transfer: Your long-context llm is secretly a hybrid model with effortless adaptation. In Workshop on
Reasoning and Planning for Large Language Models, 2025.

[290] Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang,
Jayakumar Subramanian, Ryan A Rossi, Trung Bui, Nikos Vlassis, et al. Lizard: An efficient linearization
framework for large language models. arXiv preprint arXiv:2507.09025, 2025.

[291] Yeonju Ro, Zhenyu Zhang, Souvik Kundu, Zhangyang Wang, and Aditya Akella. On-the-fly adaptive
distillation of transformer to dual-state linear attention, 2025.

[292] Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian
Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning
large language models. arXiv preprint arXiv:2502.17419, 2025.

[293] Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu,
Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought
for reasoning large language models. arXiv preprint arXiv:2503.09567, 2025.

[294] Daniele Paliotta, Junxiong Wang, Matteo Pagliardini, Kevin Y Li, Aviv Bick, J Zico Kolter, Albert Gu,
François Fleuret, and Tri Dao. Thinking slow, fast: Scaling inference compute with distilled reasoners.
arXiv preprint arXiv:2502.20339, 2025.

[295] Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M Rush, and Tri Dao. M1:
Towards scalable test-time compute with mamba reasoning models. arXiv preprint arXiv:2504.10449,
2025.

[296] Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementations of
linear attention mechanism, 2024.

[297] Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, and Edoardo M Ponti. The
sparse frontier: Sparse attention trade-offs in transformer llms. arXiv preprint arXiv:2504.17768,
2025.

[298] Ankit Gupta and Jonathan Berant. Gmat: Global memory augmentation for transformers. arXiv
preprint arXiv:2006.03274, 2020.

[299] Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou,
Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, March 2024.

[300] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdi-
nov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint
arXiv:1901.02860, 2019.

[301] Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive trans-
formers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.

69
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[302] Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span
in transformers. arXiv preprint arXiv:1905.07799, 2019.

[303] Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Ontañón, Siddhartha Brahma, Yury Zemlyanskiy,
David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, et al. Colt5: Faster long-range transformers with
conditional computation. arXiv preprint arXiv:2303.09752, 2023.

[304] Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora:
Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023.

[305] Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun,
Tianzhu Ye, Li Dong, et al. Seerattention-r: Sparse attention adaptation for long reasoning. arXiv
preprint arXiv:2506.08889, 2025.

[306] Matanel Oren, Michael Hassid, Yossi Adi, and Roy Schwartz. Transformers are multi-state rnns. arXiv
preprint arXiv:2401.06104, 2024.

[307] Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong
Zhang, Bailu Ding, Kai Zhang, et al. Retrievalattention: Accelerating long-context llm inference via
vector retrieval. arXiv preprint arXiv:2409.10516, 2024.

[308] Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and
Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.
arXiv preprint arXiv:2410.10819, 2024.

[309] Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi,
and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context llm inference.
arXiv preprint arXiv:2410.21465, 2024.

[310] Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and
Bin Cui. Pqcache: Product quantization-based kvcache for long context llm inference. Proceedings of
the ACM on Management of Data, 3(3):1–30, 2025.

[311] Róbert Csordás, Piotr Piękos, Kazuki Irie, and Jürgen Schmidhuber. Switchhead: Accelerating
transformers with mixture-of-experts attention. Advances in Neural Information Processing Systems,
37:74411–74438, 2024.

[312] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced
transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.

[313] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and
Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv
preprint arXiv:1701.06538, 2017.

[314] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim
Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation
and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.

[315] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter
models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.

70
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[316] Guanjie Chen, Xinyu Zhao, Tianlong Chen, and Yu Cheng. Moe-rbench: Towards building reliable
language models with sparse mixture-of-experts. ArXiv, abs/2406.11353, 2024.

[317] Ziyue Li and Tianyi Zhou. Your mixture-of-experts llm is secretly an embedding model for free. arXiv
preprint arXiv:2410.10814, 2024.

[318] Tong Zhu, Daize Dong, Xiaoye Qu, Jiacheng Ruan, Wenliang Chen, and Yu Cheng. Dynamic data
mixing maximizes instruction tuning for mixture-of-experts. arXiv preprint arXiv:2406.11256, 2024.

[319] William Fedus, Jeff Dean, and Barret Zoph. A review of sparse expert models in deep learning. arXiv
preprint arXiv:2209.01667, 2022.

[320] Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun,
Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou,
Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern,
Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. GLaM:
Efficient scaling of language models with mixture-of-experts. In Kamalika Chaudhuri, Stefanie Jegelka,
Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International
Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages
5547–5569. PMLR, 17–23 Jul 2022.

[321] Aidan Clark, Diego De Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann,
Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George Bm Van Den Driessche,
Eliza Rutherford, Tom Hennigan, Matthew J Johnson, Albin Cassirer, Chris Jones, Elena Buchatskaya,
David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Marc’Aurelio Ranzato, Jack Rae, Erich
Elsen, Koray Kavukcuoglu, and Karen Simonyan. Unified scaling laws for routed language models. In
Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,
Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of
Machine Learning Research, pages 4057–4086. PMLR, 17–23 Jul 2022.

[322] Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture
of experts in large language models. IEEE Transactions on Knowledge and Data Engineering, 2025.

[323] Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and
applications. arXiv preprint arXiv:2503.07137, 2025.

[324] Nikhil Gupta and Jason Yip. Dbrx: Creating an llm from scratch using databricks. In Databricks Data
Intelligence Platform: Unlocking the GenAI Revolution, pages 311–330. Springer, 2024.

[325] Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. Open-
moe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739,
2024.

[326] Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang,
Xiaoyu Zhang, Liang Zeng, et al. Skywork-moe: A deep dive into training techniques for mixture-of-
experts language models. arXiv preprint arXiv:2406.06563, 2024.

[327] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford,
Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of
experts. arXiv preprint arXiv:2401.04088, 2024.

71
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[328] Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging:
Dynamic integration of modular expertise in model merging. Advances in Neural Information Processing
Systems, 37:78905–78935, 2024.

[329] Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu,
Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982, 2025.

[330] Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal
Bajaj, XIA SONG, Xian-Ling Mao, Heyan Huang, and Furu Wei. On the representation collapse of
sparse mixture of experts. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh,
editors, Advances in Neural Information Processing Systems, volume 35, pages 34600–34613. Curran
Associates, Inc., 2022.

[331] Huy Nguyen, Nhat Ho, and Alessandro Rinaldo. Sigmoid gating is more sample efficient than softmax
gating in mixture of experts. arXiv preprint arXiv:2405.13997, 2024.

[332] Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André
Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. In
M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in
Neural Information Processing Systems, volume 34, pages 8583–8595. Curran Associates, Inc., 2021.

[333] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit,
and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv,
abs/2010.11929, 2020.

[334] Ziteng Wang, Jun Zhu, and Jianfei Chen. Remoe: Fully differentiable mixture-of-experts with relu
routing. arXiv preprint arXiv:2412.14711, 2024.

[335] Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, and
Maosong Sun. Blockffn: Towards end-side acceleration-friendly mixture-of-experts with chunk-level
activation sparsity. arXiv preprint arXiv:2507.08771, 2025.

[336] Peng Jin, Bo Zhu, Li Yuan, and Shuicheng Yan. Moe++: Accelerating mixture-of-experts methods
with zero-computation experts. ArXiv, abs/2410.07348, 2024.

[337] Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam M. Shazeer, and
William Fedus. St-moe: Designing stable and transferable sparse expert models, 2022.

[338] Zihan Wang, Deli Chen, Damai Dai, Runxin Xu, Zhuoshu Li, and Yu Wu. Let the expert stick to his
last: Expert-specialized fine-tuning for sparse architectural large language models. arXiv preprint
arXiv:2407.01906, 2024.

[339] Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi’oro, Michal Krutul, Szymon An-
toniak, Kamil Ciebiera, Krystian Kr’ol, Tomasz Odrzyg’o’zd’z, Piotr Sankowski, Marek Cygan, and
Sebastian Jaszczur. Scaling laws for fine-grained mixture of experts. ArXiv, abs/2402.07871, 2024.

[340] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan
Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang,
Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He,
Junyang Lin, Kai Dang, Keming Lu, Ke-Yang Chen, Kexin Yang, Mei Li, Min Xue, Na Ni, Pei Zhang,

72
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang
Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu
Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yunyang Wan, Yunfei Chu,
Zeyu Cui, Zhenru Zhang, and Zhi-Wei Fan. Qwen2 technical report. ArXiv, abs/2407.10671, 2024.

[341] Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas
Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed Aly, Beidi Chen, and Carole-Jean
Wu. Layerskip: Enabling early exit inference and self-speculative decoding. ArXiv, abs/2404.16710,
2024.

[342] Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of
experts. ArXiv, abs/2308.00951, 2023.

[343] Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, and Chuang Gan. Module-
former: Learning modular large language models from uncurated data. ArXiv, abs/2306.04640,
2023.

[344] Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Wei Shen, Limao Xiong, Yuhao Zhou, Xiao Wang,
Zhiheng Xi, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang.
Loramoe: Alleviating world knowledge forgetting in large language models via moe-style plugin. In
Annual Meeting of the Association for Computational Linguistics, 2024.

[345] Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. Moelora:
Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language
models. ArXiv, abs/2402.12851, 2024.

[346] Xun Wu, Shaohan Huang, and Furu Wei. Mixture of lora experts. ArXiv, abs/2404.13628, 2024.

[347] Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xiaoye Qu, Wei Wei, and Yu Cheng. Make
lora great again: Boosting lora with adaptive singular values and mixture-of-experts optimization
alignment. arXiv preprint arXiv:2502.16894, 2025.

[348] Xu Owen He. Mixture of a million experts. ArXiv, abs/2407.04153, 2024.

[349] Jihai Zhang, Xiaoye Qu, Tong Zhu, and Yu Cheng. Clip-moe: Towards building mixture of experts for
clip with diversified multiplet upcycling. arXiv preprint arXiv:2409.19291, 2024.

[350] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation
for neural network pruning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 11256–11264, 2019.

[351] Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko lay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cris tian
Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu,
Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini,
Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, Artem
Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai
Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew
Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael
Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin
Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melissa Hall Melanie Kambadur, Sharan

73
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Narang, Aur’elien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open
foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.

[352] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony S. Hartshorn,
Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang,
Aur’elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron, Binh Tang,
Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian
Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cris tian Cantón Ferrer, Cyrus Nikolaidis,
Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv
Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab A. AlBadawy,
Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriele Synnaeve,
Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Grégoire Mialon, Guanglong Pang, Guillem
Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra,
Isabel M. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes,
Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya
Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe
Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Ju-Qing Jia, Kalyan Vasuden
Alwala, K. Upasani, Kate Plawiak, Keqian Li, Ken-591 neth Heafield, Kevin R. Stone, Khalid El-Arini,
Krithika Iyer, Kshitiz Malik, Kuen ley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der
Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas
Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh,
Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melissa Hall Melanie
Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi,
Niko lay Bashlykov, Nikolay Bogoychev, Niladri S. Chatterji, Olivier Duchenne, Onur cCelebi, Patrick
Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasić, Peter Weng, Prajjwal Bhargava, Pratik Dubal,
Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj
Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar,
Rohit Patel, Ro main Sauvestre, Ron nie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou,
Rui Wang, Saghar Hosseini, Sa hana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim,
Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Chandra Raparthy, Sheng Shen, Shengye
Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten
Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek
Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal
Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Vir
ginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whit ney Meers,
Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle
Goldschlag, Yashesh Gaur, Yasmine Babaei, Yiqian Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning
Mao, Zacharie Delpierre Coudert, Zhengxu Yan, Zhengxing Chen, Zoe Papakipos, Aaditya K. Singh,
Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adi Gangidi, Adolfo Victoria, Ahuva
Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein,
Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew
Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf,
Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau
James, Ben Maurer, Benjamin Leonhardi, Po-Yao (Bernie) Huang, Beth Loyd, Beto de Paola, Bhargavi
Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic,
Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu

74
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer,
Damon Civin, Dana Beaty, Daniel Kreymer, Shang-Wen Li, Danny Wyatt, David Adkins, David Xu,
Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le,
Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn,
Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk,
Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzm’an, Frank J. Kanayet, Frank Seide,
Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai,
Grant Herman, Grigory G. Sizov, Guangyi Zhang, Guna Lakshminarayanan, Hamid Shojanazeri,
Han Zou, Hannah Wang, Han Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren,
Hunter Goldman, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James
Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny
Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon
Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kaixing(Kai)
Wu, U KamHou, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich,
Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle
Huang, Lailin Chen, Lakshya Garg, A Lavender, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo,
Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria
Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev,
Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo,
Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert
Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks,
Natasha White, Navy ata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev,
Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli,
Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pe dro Rittner, Philip Bontrager, Pierre Roux, Piotr
Dollár, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel
Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan,
Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak
Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji
Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha,
Shiva Shankar, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala,
Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit
Gupta, Sung-Bae Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez,
Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews,
Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan,
Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Andrei Poenaru, Vlad T. Mihailescu, Vladimir
Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xia Tang, Xiaofang
Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi,
Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu Wang, Yuchen Hao, Yundi Qian, Yuzi
He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. The
llama 3 herd of models. ArXiv, abs/2407.21783, 2024.

[353] Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, and Tianlong Chen. Dlo: Dynamic layer
operation for efficient vertical scaling of llms. ArXiv, abs/2407.11030, 2024.

[354] Chen Zhang, Meizhi Zhong, Qimeng Wang, Xuantao Lu, Zheyu Ye, Chengqiang Lu, Yan Gao, Yao
Hu, Kehai Chen, Min Zhang, and Dawei Song. Modification: Mixture of depths made easy. ArXiv,
abs/2410.14268, 2024.

75
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[355] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot,
Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral
7b. arXiv preprint arXiv:2310.06825, 2023.

[356] Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert
Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated
linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427,
2024.

[357] Liu Xiao, Li Zhiyuan, and Lin Yueyu. WuNeng: Hybrid State with Attention, April 2025.

[358] Runpeng Yu, Qi Li, and Xinchao Wang. Discrete diffusion in large language and multimodal models:
A survey. arXiv preprint arXiv:2506.13759, 2025.

[359] Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T
Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language
models, 2024.

[360] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the
North American chapter of the association for computational linguistics: human language technologies,
volume 1 (long and short papers), pages 4171–4186, 2019.

[361] Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion
large language models via reinforcement learning, 2025.

[362] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in
neural information processing systems, 36:34892–34916, 2023.

[363] Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen,
Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool
reinforcement learning. arXiv preprint arXiv:2505.08617, 2025.

[364] Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide
Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods,
and future frontiers. arXiv preprint arXiv:2506.23918, 2025.

[365] Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang,
Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged
reinforcement learning. arXiv preprint arXiv:2506.04207, 2025.

[366] Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete diffusion multimodal large language
model with parallel decoding, 2025.

[367] Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and
Chunyuan Li. Llava-next: What else influences visual instruction tuning beyond data?, May 2024.

[368] Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang,
Zhangyang Wang, and Sijia Liu. Robust mixture-of-expert training for convolutional neural networks.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 90–101, 2023.

76
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[369] Mohammed Nowaz Rabbani Chowdhury, Shuai Zhang, Meng Wang, Sijia Liu, and Pin-Yu Chen. Patch-
level routing in mixture-of-experts is provably sample-efficient for convolutional neural networks. In
International Conference on Machine Learning, pages 6074–6114. PMLR, 2023.

[370] Chi-Sheng Chen, Guan-Ying Chen, Dong Zhou, Di Jiang, and Dai-Shi Chen. Res-vmamba: Fine-grained
food category visual classification using selective state space models with deep residual learning.
arXiv preprint arXiv:2402.15761, 2024.

[371] Zijie Fang, Yifeng Wang, Ye Zhang, Zhi Wang, Jian Zhang, Xiangyang Ji, and Yongbing Zhang. Mammil:
Multiple instance learning for whole slide images with state space models. In 2024 IEEE International
Conference on Bioinformatics and Biomedicine (BIBM), pages 3200–3205. IEEE, 2024.

[372] Jing Yao, Danfeng Hong, Chenyu Li, and Jocelyn Chanussot. Spectralmamba: Efficient mamba for
hyperspectral image classification. arXiv preprint arXiv:2404.08489, 2024.

[373] Qianning Wang, He Hu, and Yucheng Zhou. Memorymamba: Memory-augmented state space model
for defect recognition. arXiv preprint arXiv:2405.03673, 2024.

[374] Zeyu Wang, Chen Li, Huiying Xu, and Xinzhong Zhu. Mamba yolo: Ssms-based yolo for object
detection. arXiv preprint arXiv:2406.05835, 2024.

[375] Tianxiang Chen, Zi Ye, Zhentao Tan, Tao Gong, Yue Wu, Qi Chu, Bin Liu, Nenghai Yu, and Jieping
Ye. Mim-istd: Mamba-in-mamba for efficient infrared small target detection. IEEE Transactions on
Geoscience and Remote Sensing, 2024.

[376] Dunbin Shen, Xuanbing Zhu, Jiacheng Tian, Jianjun Liu, Zhenrong Du, Hongyu Wang, and Xiaorui
Ma. Htd-mamba: Efficient hyperspectral target detection with pyramid state space model. IEEE
Transactions on Geoscience and Remote Sensing, 2025.

[377] Tushar Verma, Jyotsna Singh, Yash Bhartari, Rishi Jarwal, Suraj Singh, and Shubhkarman Singh.
Soar: Advancements in small body object detection for aerial imagery using state space models and
programmable gradients. arXiv preprint arXiv:2405.01699, 2024.

[378] Haobo Yuan, Xiangtai Li, Lu Qi, Tao Zhang, Ming-Hsuan Yang, Shuicheng Yan, and Chen Change Loy.
Mamba or rwkv: Exploring high-quality and high-efficiency segment anything model. arXiv preprint
arXiv:2406.19369, 2024.

[379] Yunxiang Fu, Meng Lou, and Yizhou Yu. Segman: Omni-scale context modeling with state space
models and local attention for semantic segmentation. arXiv preprint arXiv:2412.11890, 2024.

[380] Zhaohui Chen, Elyas Asadi Shamsabadi, Sheng Jiang, Luming Shen, and Daniel Dias-da Costa. Vision
mamba-based autonomous crack segmentation on concrete, asphalt, and masonry surfaces. arXiv
preprint arXiv:2406.16518, 2024.

[381] Libo Wang, Dongxu Li, Sijun Dong, Xiaoliang Meng, Xiaokang Zhang, and Danfeng Hong. Pyramid-
mamba: rethinking pyramid feature fusion with selective space state model for semantic segmentation
of remote sensing imagery. arXiv preprint arXiv:2406.10828, 2024.

[382] Zhiwen Yang, Jiayin Li, Hui Zhang, Dan Zhao, Bingzheng Wei, and Yan Xu. Restore-rwkv: Efficient
and effective medical image restoration with rwkv. arXiv preprint arXiv:2407.11087, 2024.

77
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[383] Xiaoyan Lei, Wenlong Zhang, and Weifeng Cao. Dvmsr: Distillated vision mamba for efficient super-
resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 6536–6546, 2024.

[384] Yujie Chen, Haotong Qin, Zhang Zhang, Michelo Magno, Luca Benini, and Yawei Li. Q-mambair:
Accurate quantized mamba for efficient image restoration. arXiv preprint arXiv:2503.21970, 2025.

[385] Yuzhen Du, Teng Hu, Jiangning Zhang, Ran Yi Chengming Xu, Xiaobin Hu, Kai Wu, Donghao Luo,
Yabiao Wang, and Lizhuang Ma. Exploring real&synthetic dataset and linear attention in image
restoration. arXiv preprint arXiv:2412.03814, 2024.

[386] Wei-Tung Lin, Yong-Xiang Lin, Jyun-Wei Chen, and Kai-Lung Hua. Pixmamba: Leveraging state space
models in a dual-level architecture for underwater image enhancement. In Proceedings of the Asian
Conference on Computer Vision, pages 3622–3637, 2024.

[387] Meisheng Guan, Haiyong Xu, Gangyi Jiang, Mei Yu, Yeyao Chen, Ting Luo, and Yang Song.
Watermamba: Visual state space model for underwater image enhancement. arXiv preprint
arXiv:2405.08419, 2024.

[388] Jiesong Bai, Yuhao Yin, Qiyuan He, Yuanxian Li, and Xiaofeng Zhang. Retinexmamba: Retinex-based
mamba for low-light image enhancement. arXiv preprint arXiv:2405.03349, 2024.

[389] Xuanqi Zhang, Haijin Zeng, Jinwang Pan, Qiangqiang Shen, and Yongyong Chen. Llemamba:
Low-light enhancement via relighting-guided mamba with deep unfolding network. arXiv preprint
arXiv:2406.01028, 2024.

[390] Dong Li, Yidi Liu, Xueyang Fu, Senyan Xu, and Zheng-Jun Zha. Fouriermamba: Fourier learning
integration with state space models for image deraining. arXiv preprint arXiv:2405.19450, 2024.

[391] Yuan Shi, Bin Xia, Xiaoyu Jin, Xing Wang, Tianyu Zhao, Xin Xia, Xuefeng Xiao, and Wenming Yang.
Vmambair: Visual state space model for image restoration. IEEE Transactions on Circuits and Systems
for Video Technology, 2025.

[392] Mohammad Shahab Sepehri, Zalan Fabian, and Mahdi Soltanolkotabi. Serpent: Scalable and efficient
image restoration via multi-scale structured state space models. arXiv preprint arXiv:2403.17902,
2024.

[393] Juan Wen, Weiyan Hou, Luc Van Gool, and Radu Timofte. Matir: A hybrid mamba-transformer image
restoration model. arXiv preprint arXiv:2501.18401, 2025.

[394] Rui Deng and Tianpei Gu. Cu-mamba: Selective state space models with channel learning for image
restoration. In 2024 IEEE 7th International Conference on Multimedia Information Processing and
Retrieval (MIPR), pages 328–334. IEEE, 2024.

[395] Yao Lu, Shunzhou Wang, Ziqi Wang, Peiqi Xia, Tianfei Zhou, et al. Lfmamba: light field image
super-resolution with state space model. arXiv preprint arXiv:2406.12463, 2024.

[396] Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. Scalable diffusion models with
state space backbone. arXiv preprint arXiv:2402.05608, 2024.

[397] Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Dim:
Diffusion mamba for efficient high-resolution image synthesis. arXiv preprint arXiv:2405.14224, 2024.

78
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[398] Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes
Fischer, and Björn Ommer. Zigma: A dit-style zigzag mamba diffusion model. In European Conference
on Computer Vision, pages 148–166. Springer, 2024.
[399] Haopeng Li, Jinyue Yang, Kexin Wang, Xuerui Qiu, Yuhong Chou, Xin Li, and Guoqi Li. Scalable
autoregressive image generation with mamba. arXiv preprint arXiv:2408.12245, 2024.
[400] Wenchao Chen, Liqiang Niu, Ziyao Lu, Fandong Meng, and Jie Zhou. Maskmamba: A hybrid mamba-
transformer model for masked image generation. arXiv preprint arXiv:2409.19937, 2024.
[401] Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Youqiang Zhang, and Junshi Huang. Dimba:
Transformer-mamba diffusion models. arXiv preprint arXiv:2406.01159, 2024.
[402] Shentong Mo. Efficient 3d shape generation via diffusion mamba with bidirectional ssms. arXiv
preprint arXiv:2406.05038, 2024.
[403] Qiuhong Shen, Zike Wu, Xuanyu Yi, Pan Zhou, Hanwang Zhang, Shuicheng Yan, and Xinchao Wang.
Gamba: Marry gaussian splatting with mamba for single-view 3d reconstruction. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2025.
[404] Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Diffusion-rwkv: Scaling
rwkv-like architectures for diffusion models. arXiv preprint arXiv:2404.04478, 2024.
[405] Shu Yang, Hanzhi Ma, Chengting Yu, Aili Wang, and Er-Ping Li. Sdit: Spiking diffusion model with
transformer. arXiv preprint arXiv:2402.11588, 2024.
[406] Zhaohu Xing, Tian Ye, Yijun Yang, Guang Liu, and Lei Zhu. Segmamba: Long-range sequential
modeling mamba for 3d medical image segmentation. In International Conference on Medical Image
Computing and Computer-Assisted Intervention, pages 578–588. Springer, 2024.
[407] Tianxiang Chen, Xudong Zhou, Zhentao Tan, Yue Wu, Ziyang Wang, Zi Ye, Tao Gong, Qi Chu,
Nenghai Yu, and Le Lu. Zig-rir: Zigzag rwkv-in-rwkv for efficient medical image segmentation. IEEE
Transactions on Medical Imaging, 2025.
[408] Xudong Zhou and Tianxiang Chen. Bsbp-rwkv: Background suppression with boundary preservation
for efficient medical image segmentation. In Proceedings of the 32nd ACM International Conference on
Multimedia, pages 4938–4946, 2024.
[409] Shu Yang, Yihui Wang, and Hao Chen. Mambamil: Enhancing long sequence modeling with sequence
reordering in computational pathology. In International Conference on Medical Image Computing and
Computer-Assisted Intervention, pages 296–306. Springer, 2024.
[410] Jing Zou, Lanqing Liu, Qi Chen, Shujun Wang, Xiaohan Xing, and Jing Qin. Mmr-mamba: Multi-
contrast mri reconstruction with mamba and spatial-frequency information fusion. arXiv e-prints,
pages arXiv–2406, 2024.
[411] Jiahao Huang, Liutao Yang, Fanwen Wang, Yang Nan, Angelica I Aviles-Rivero, Carola-Bibiane
Schönlieb, Daoqiang Zhang, and Guang Yang. Mambamir: An arbitrary-masked mamba for joint
medical image reconstruction and uncertainty estimation. arXiv preprint arXiv:2402.18451, 2024.
[412] Ziyang Wang, Jian-Qing Zheng, Chao Ma, and Tao Guo. Vmambamorph: a visual mamba-based
framework with cross-scan module for deformable 3d image registration. arXiv e-prints, pages
arXiv–2404, 2024.

79
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[413] Omer F Atli, Bilal Kabas, Fuat Arslan, Arda C Demirtas, Mahmut Yurt, Onat Dalmaz, and Tolga Cukur.
I2i-mamba: Multi-modal medical image synthesis via selective state space modeling. arXiv preprint
arXiv:2405.14022, 2024.

[414] Rongchang Lu, Bingcheng Liao, Haowen Hou, Jiahang Lv, and Xin Hai. Delta-wkv: A novel meta-in-
context learner for mri super-resolution. arXiv preprint arXiv:2502.20852, 2025.

[415] Junming Wang, Wei Yin, Xiaoxiao Long, Xingyu Zhang, Zebin Xing, Xiaoyang Guo, and Qian Zhang.
Occrwkv: Rethinking efficient 3d semantic occupancy prediction with linear complexity. arXiv preprint
arXiv:2409.19987, 2024.

[416] Siran Chen, Yuxiao Luo, Yue Ma, Yu Qiao, and Yali Wang. H-mba: Hierarchical mamba adaptation
for multi-modal video understanding in autonomous driving. arXiv preprint arXiv:2501.04302, 2025.

[417] Yizhou Huang, Yihua Cheng, and Kezhi Wang. Trajectory mamba: Efficient attention-mamba fore-
casting model based on selective ssm. In Proceedings of the Computer Vision and Pattern Recognition
Conference, pages 12058–12067, 2025.

[418] Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Zefan Huang, Christina Dao Wen Lee, Dongen
Li, Yuhang Han, Anthony Wong, Keng Peng Tee, et al. Drama: An efficient end-to-end motion planner
for autonomous driving with mamba. arXiv preprint arXiv:2408.03601, 2024.

[419] Chunyu Zhao, Wentao Mu, Xian Zhou, Wenbo Liu, Fei Yan, and Tao Deng. Salm2 : An extremely
lightweight saliency mamba model for real-time cognitive awareness of driver attention. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 39, pages 1647–1655, 2025.

[420] Sijie Zhao, Hao Chen, Xueliang Zhang, Pengfeng Xiao, Lei Bai, and Wanli Ouyang. Rs-mamba for
large remote sensing image dense prediction. IEEE Transactions on Geoscience and Remote Sensing,
2024.

[421] Qinfeng Zhu, Yuanzhi Cai, Yuan Fang, Yihan Yang, Cheng Chen, Lei Fan, and Anh Nguyen. Samba:
Semantic segmentation of remotely sensed images with state space model. Heliyon, 10(19), 2024.

[422] Hongruixuan Chen, Jian Song, Chengxi Han, Junshi Xia, and Naoto Yokoya. Changemamba: Remote
sensing change detection with spatio-temporal state space model. IEEE Transactions on Geoscience
and Remote Sensing, 2024.

[423] Xianping Ma, Xiaokang Zhang, and Man-On Pun. Rs 3 mamba: Visual state space model for remote
sensing image semantic segmentation. IEEE Geoscience and Remote Sensing Letters, 2024.

[424] Judy X Yang, Jun Zhou, Jing Wang, Hui Tian, and Alan Wee Chung Liew. Hsimamba: Hyperpsectral
imaging efficient feature learning with bidirectional state space for classification. arXiv preprint
arXiv:2404.00272, 2024.

[425] Xuanhua He, Ke Cao, Jie Zhang, Keyu Yan, Yingying Wang, Rui Li, Chengjun Xie, Danfeng Hong,
and Man Zhou. Pan-mamba: Effective pan-sharpening with state space model. Information Fusion,
115:102779, 2025.

[426] Zihan Cao, Xiao Wu, Liang-Jian Deng, and Yu Zhong. A novel state space model with local enhancement
and state sharing for image fusion. In Proceedings of the 32nd ACM International Conference on
Multimedia, pages 1235–1244, 2024.

80
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[427] Huiling Zhou, Xianhao Wu, Hongming Chen, Xiang Chen, and Xin He. Rsdehamba: lightweight vision
mamba for remote sensing satellite image dehazing. arXiv preprint arXiv:2405.10030, 2024.

[428] Yi Xiao, Qiangqiang Yuan, Kui Jiang, Yuzeng Chen, Qiang Zhang, and Chia-Wen Lin. Frequency-
assisted mamba for remote sensing image super-resolution. IEEE Transactions on Multimedia, 2024.

[429] Chenyang Liu, Keyan Chen, Bowen Chen, Haotian Zhang, Zhengxia Zou, and Zhenwei Shi. Rscama:
Remote sensing image change captioning with state space model. IEEE Geoscience and Remote Sensing
Letters, 2024.

[430] Mehmet Hamza Erol, Arda Senocak, Jiu Feng, and Joon Son Chung. Audio mamba: Bidirectional
state space model for audio representation learning. IEEE Signal Processing Letters, 31:2975–2979,
2024.

[431] Sarthak Yadav and Zheng-Hua Tan. Audio mamba: Selective state spaces for self-supervised audio
representations. arXiv preprint arXiv:2406.02178, 2024.

[432] Siavash Shams, Sukru Samet Dindar, Xilin Jiang, and Nima Mesgarani. Ssamba: Self-supervised audio
representation learning with mamba state space model. In arXiv preprint arXiv:2405.11831, 2024.

[433] Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang,
Szu-Wei Fu, and Yu Tsao. An investigation of incorporating mamba for speech enhancement. arXiv
preprint arXiv:2405.06573, 2024.

[434] Yueyuan Sui, Minghui Zhao, Junxi Xia, Xiaofan Jiang, and Stephen Xia. Tramba: A hybrid trans-
former and mamba architecture for practical audio and bone conduction speech super resolution
and enhancement on mobile and wearable platforms. Proceedings of the ACM on Interactive Mobile,
Wearable and Ubiquitous Technologies, 8(4):1–29, 2024.

[435] Changsheng Quan and Xiaofei Li. Multichannel long-term streaming neural speech enhancement for
static and moving speakers. arXiv preprint arXiv:2403.07675, 2024.

[436] Ziru Huang, Jia Li, Wenjie Zhao, Yunhui Guo, and Yapeng Tian. Av-mamba: Cross-modality selective
state space models for audio-visual question answering. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshop (CVPRW), pages 1–4, 2024.

[437] Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, and Jiankang
Deng. Rwkv-clip: a robust vision-language representation learner. arXiv preprint arXiv:2406.06973,
2024.

[438] Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal
contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information
Processing Systems, 35:9564–9576, 2022.

[439] Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and
Min Zhang. Uni-moe: Scaling unified multimodal llms with mixture of experts. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2025.

[440] Yunshui Li, Binyuan Hui, ZhiChao Yin, Min Yang, Fei Huang, and Yongbin Li. Pace: Unified multi-modal
dialogue pre-training with progressive and compositional experts. arXiv preprint arXiv:2305.14839,
2023.

81
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[441] Yubiao Yue and Zhenzhang Li. Medmamba: Vision mamba for medical image classification. arXiv
preprint arXiv:2403.03849, 2024.

[442] Keyan Chen, Bowen Chen, Chenyang Liu, Wenyuan Li, Zhengxia Zou, and Zhenwei Shi. Rsmamba:
Remote sensing image classification with state space model. IEEE Geoscience and Remote Sensing
Letters, 2024.

[443] Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Xuhui Liu, Juan Zhang,
Guodong Guo, and Baochang Zhang. Fusion-mamba for cross-modality object detection. arXiv preprint
arXiv:2404.09146, 2024.

[444] Ruisheng Gao, Zeyu Xiao, and Zhiwei Xiong. Mamba-based light field super-resolution with efficient
subspace scanning. In Proceedings of the Asian Conference on Computer Vision, pages 531–547, 2024.

[445] Ali Nasiri-Sarvi, Mahdi S Hosseini, and Hassan Rivaz. Vision mamba for classification of breast
ultrasound images. In Deep Breast Workshop on AI and Imaging for Diagnostic and Treatment Challenges
in Breast Care, pages 148–158. Springer, 2024.

[446] Gaoyuan Ji and Pei Liu. Rnn-based multiple instance learning for the classification of histopathology
whole slide images. In International Conference on Medical Imaging and Computer-Aided Diagnosis,
pages 329–339. Springer, 2023.

[447] Yuelin Zhang, Kim Yan, Chun Ping Lam, Chengyu Fang, Wenxuan Xie, Yufu Qiu, Raymond Shing-Yan
Tang, and Shing Shin Cheng. Motion-guided dual-camera tracker for endoscope tracking and motion
analysis in a mechanical gastric simulator. arXiv preprint arXiv:2403.05146, 2024.

[448] Zelin Meng and Takanori Fukao. Enhancing autonomous driving perception with mamba-based
dual-branch depth estimation. IEEE Transactions on Intelligent Transportation Systems, 2025.

[449] Keyu An and Shiliang Zhang. Exploring rwkv for memory efficient and low latency streaming asr.
arXiv preprint arXiv:2309.14758, 2023.

82

Common questions

Powered by AI

Hardware-efficient implementations significantly contribute to the adoption of linear sequence modeling methods by providing optimized computation on modern GPUs, thus enhancing the models' overall efficiency and scalability. By leveraging optimized code tools like Triton and CUDA, these implementations allow linear sequence models to perform computations quickly, which is crucial for handling large-scale data and tasks . This efficiency makes it possible to process sequences faster and with lower hardware demands compared to traditional architectures, facilitating their deployment in more applications and encouraging continuous innovation in the field . Moreover, such implementations lower the barrier for researchers by providing accessible and efficient model training pipelines, thereby fostering broader experimentation and advancement in large language models .

The adaptive top-k routing approach in large language models aims to balance efficiency and performance by dynamically selecting the number of experts based on task complexity. However, this introduces challenges such as load balancing among experts and maintaining model sparsity . Challenges include potential unbalanced routing, where some experts may receive disproportionately more tokens, leading to inefficient training and underutilization of model capacity . Recent models address these challenges by implementing techniques such as load balancing losses to ensure equitable token distribution among experts . For example, MoE-Dynamic implements a confidence-based selection that allows dynamic allocation of computational resources to complex inputs while conserving resources for simpler ones , and Ada-K adjusts the number of active experts for each token using a learnable allocator, thus optimizing routing efficiency .

The Mamba architecture provides several advantages over traditional Transformer models, particularly in terms of efficiency and scalability. Mamba models are built by distilling LLaMA models into more efficient forms, focusing on reasoning scalability at test time . They utilize linearized transformers to enhance reasoning capabilities while maintaining efficient inference performance. Mamba's hybrid architecture combines the benefits of linear sequence modeling with modern transformer capabilities to reduce computational costs and enable scalable deployment . Compared to traditional models, Mamba achieves superior computational efficiency through the use of state-space models, which are expressed as convolutions for fast computation and make use of the Blelloch scan algorithm for enhanced recursion speed . This results in a model that requires fewer computational resources while still supporting complex reasoning tasks, outperforming traditional transformers in reasoning/recall tasks by a significant margin .

Sparse global memory augmentation techniques such as GMAT enhance the efficiency of Transformer models by providing dedicated global tokens for information sharing across different parts of a sequence. These global memory tokens act as centralized hubs that facilitate long-range dependency capture, allowing the model to maintain context effectively across the sequence without excessive computation . This setup enables the model to handle longer input contexts with greater precision while reducing memory requirements, as the sparse nature of the memory avoids the need for maintaining full attention matrices across all tokens . By concentrating computational effort on relevant parts of the input, GMAT improves model scalability and efficiency, making it better equipped for tasks involving extended contexts and complex dependencies .

The integration of a sparse global cache with low-rank linear attention in LoLA improves performance on long-context tasks by efficiently managing memory and maintaining high-fidelity representations. Low-rank linear attention provides efficient storage for global tokens, enabling the model to focus computing resources on crucial parts of the input . Meanwhile, the sparse global cache retains important key-value pairs that are prone to interference, allowing the model to preserve contextual knowledge across longer sequences . This combination ensures that the model effectively balances precision in local context modeling with efficient long-range token storage, leading to improved performance on tasks requiring the comprehension of extensive contexts .

The use of mixture-of-experts (MoE) architectures in large language models is motivated by the need to balance computational efficiency with model expressiveness. MoE architectures allow a model to dynamically allocate resources by selecting specific experts to process different parts of input data, thereby tailoring computational efforts to task complexity . This results in significant computational savings and improved efficiency, as resources are used more selectively and not all parts of the model are activated for every input . Additionally, MoE architectures can enhance model performance as they enable specialized processing through fine-grained expert selection and the use of adaptive routing strategies that optimize the experts' activation based on the input's nature and complexity .

Dynamic expert selection in MoE models such as DynMoE and MoE-Dynamic ensures efficient resource allocation by adjusting the number of active experts based on input complexity. In DynMoE, expert selection is treated as a multi-label classification problem that uses cosine similarities between input tokens and expert representations to determine the experts' relevance . This allows the model to activate only the most relevant experts for each input token, ensuring efficient and targeted resource use. Meanwhile, MoE-Dynamic introduces a top-p adaptive routing where experts are selected based on descending order of selection probabilities until a cumulative confidence threshold is met . This approach dynamically distributes computational resources according to the demands of different inputs, allowing more complex queries to utilize additional experts while conserving resources for simpler tasks, leading to optimized model performance .

Static sparse attention enhances efficiency by significantly reducing the computational complexity compared to full self-attention mechanisms. Traditional full self-attention methods scale quadratically with sequence length, which becomes a bottleneck in processing large datasets. In contrast, static sparse attention restricts each token's attention to a predefined subset of tokens, forming fixed sparsity patterns that reduce computation to near-linear complexity . This allows the model to preserve its ability to capture long-range dependencies while maintaining a more efficient processing architecture . These fixed patterns, such as global, window, and dilated sparsity, ensure that the complexity involved in attending to the sequence data is minimized while maintaining performance across tasks and contexts .

'Zero computational experts' play a vital role in improving the efficiency of MoE-based models by enabling some tokens to skip the computational process. This approach introduces heterogeneous load balancing where only the necessary computational resources are employed, allowing insignificant tokens to be routed to zero-expert slots that perform no operations . By reducing the active computational workload, this strategy helps maintain the model's overall efficiency, allowing hardware resources to concentrate on more critical computations. As a result, the model can achieve faster processing times while managing computational costs effectively, leading to overall heightened performance levels in practical applications .

Diffusion large language models differ from traditional autoregressive models by generating text through a process of progressively denoising a sequence from a noisy or masked state, rather than producing tokens sequentially one at a time . This fundamental change allows these models to operate without the strict token-wise dependency inherent in autoregressive frameworks, mitigating a significant bottleneck related to inference latency . By denoising iteratively, diffusion LLMs are less constrained by sequence length, enabling them to perform more efficiently when generating or iterating over large text bodies. The diffusion method can capture complex sequences in parallel rather than incrementally, which potentially enhances model scalability and performance in generating coherent textual outputs .

You might also like