flash-attention

Star

Here are 145 public repositories matching this topic...

QwenLM / Qwen

Star

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

natural-language-processing chinese pretrained-models large-language-models llm flash-attention

Updated Mar 5, 2026
Python

xlite-dev / LeetCUDA

Star

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

cuda cuda-kernels cuda-demo cuda-toolkit cuda-library cuda-kernel learn-cuda cuda-cpp hgemm flash-attention leet-cuda cuda-12

Updated May 29, 2026
Cuda

InternLM / InternLM

Star

Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).

chatbot chinese gpt pretrained-models llm long-context rlhf large-language-model flash-attention fine-tuning-llm

Updated Oct 30, 2025
Python

ymcui / Chinese-LLaMA-Alpaca-2

Star

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

nlp yarn llama alpaca 64k large-language-models llm rlhf flash-attention llama2 llama-2 alpaca-2 alpaca2

Updated Apr 19, 2026
Python

xlite-dev / Awesome-LLM-Inference

Star

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

mla vllm llm-inference awesome-llm flash-attention tensorrt-llm paged-attention deepseek flash-attention-3 deepseek-v3 minimax-01 deepseek-r1 flash-mla qwen3

Updated Apr 20, 2026
Python

MoonshotAI / MoBA

Star

MoBA: Mixture of Block Attention for Long-Context LLMs

pytorch transformer moe llm llm-serving llm-training flash-attention

Updated Apr 3, 2025
Python

NVIDIA / cudnn-frontend

Star

cuDNN Frontend is NVIDIA's modern, open-source entry point to the cuDNN library and a growing collection of high-performance open-source kernels.

Updated Jun 10, 2026
Python

HKUSTDial / flash-sparse-attention

Star

Trainable fast and memory-efficient sparse attention

kernel triton sparse-attention flash-attention flash-sparse-attention

Updated Jun 9, 2026
Python

NVlabs / rcm

Star

rCM & Causal-rCM: Leading and Unified Algorithms/Infrastructures for Bidirectional/Autoregressive Video Diffusion Distillation at Scale

real-time streaming-video diffusion distillation video-generation world-models distribution-matching jacobian-vector-product flash-attention consistency-model wan-video meanflow few-step-generation autoregressive-video-generation

Updated Jun 5, 2026
Python

InternLM / InternEvo

Star

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

pytorch multi-modal gemma pipeline-parallelism transformers-models tensor-parallelism llava llm-training internlm flash-attention zero3 llm-framework sequence-parallelism internlm2 ring-attention deepspeed-ulysses llama3 910b

Updated Aug 21, 2025
Python

xlite-dev / ffpa-attn

Star

🤖FFPA: Extends FlashAttention-2 via Split-D for large headdims, 1.5x~3×↑🎉 vs SDPA, up to 430T🎉 on H200.

cuda tensor-cores flash-attention gemma-4 gemma4

Updated Jun 11, 2026
Python

[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.

memory-efficient clip contrastive-learning flash-attention ring-attention infinite-batch-size

Updated Jan 16, 2025
Python

shreyansh26 / FlashAttention-PyTorch

Star

Implementation of FlashAttention (FA1-FA4) in PyTorch for educational and algorithmic clarity

flash-attention flash-attention-2 flash-attention-3 flash-attention-4

Updated Apr 12, 2026
Python

ot-triton-lab / flash-sinkhorn

Star

The official repository of FlashSinkhorn [ICML 2026 Oral]

machine-learning gpu cuda pytorch triton optimal-transport sinkhorn flash-attention entropic-optimal-transport flashsinkhorn

Updated Jun 3, 2026
Python

alexzhang13 / flashattention2-custom-mask

Star

Triton implementation of FlashAttention2 that adds Custom Masks.

deep-learning triton attention cuda-kernels attention-mechanism triton-lang flash-attention flash-attention-2

Updated Aug 14, 2024
Python

psmarter / CUDA-Practice

Star

CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.

parallel-computing cuda high-performance-computing cuda-kernels quantization cutlass gemm performance-optimization nccl gpu-programming roofline-model tensor-core llm-inference flash-attention nsight-compute

Updated May 11, 2026
Cuda

ai-bond / flash-attention-v100

Star

Implementation of FlashAttention-2 for Nvidia Tesla V100

gpu-acceleration tensorcore flash-attention cuda-core

Updated Jun 5, 2026
C++

iacopPBK / llama.cpp-gfx906

Star

llama.cpp-gfx906

kernel rocm amd-gpu llama-cpp flash-attention gfx906 mi50 vega20

Updated Mar 22, 2026
C++

patientx-cfz / comfyui-rocm

Star

Windows-only version of ComfyUI which uses AMD's official ROCm and PyTorch libraries to get better performance with AMD GPUs. [auto-installation and popular performance enhancing packages like triton * sage-attention * flash-attention * bitsandbytes included ]

windows triton rdna rocm miopen bitsandbytes flash-attention rdna3 rdna2 rdna4 sage-attention rdna1

Updated Jun 10, 2026
Python

zengxiao-he / tessera

Star

From teacher to tiles — a from-scratch LLM distillation & serving engine: custom Triton/CUDA kernels, FSDP distillation, paged-KV continuous batching, speculative decoding, a Rust gateway, a JAX oracle, and interpretability tooling.

rust cuda pytorch triton quantization knowledge-distillation inference-engine jax kv-cache ml-systems llm mechanistic-interpretability fsdp flash-attention speculative-decoding paged-attention

Updated Jun 5, 2026
Python

Improve this page

Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash-attention

Here are 145 public repositories matching this topic...

QwenLM / Qwen

xlite-dev / LeetCUDA

InternLM / InternLM

ymcui / Chinese-LLaMA-Alpaca-2

xlite-dev / Awesome-LLM-Inference

MoonshotAI / MoBA

NVIDIA / cudnn-frontend

HKUSTDial / flash-sparse-attention

NVlabs / rcm

InternLM / InternEvo

xlite-dev / ffpa-attn

DAMO-NLP-SG / Inf-CLIP

shreyansh26 / FlashAttention-PyTorch

ot-triton-lab / flash-sinkhorn

alexzhang13 / flashattention2-custom-mask

psmarter / CUDA-Practice

ai-bond / flash-attention-v100

iacopPBK / llama.cpp-gfx906

patientx-cfz / comfyui-rocm

zengxiao-he / tessera

Improve this page

Add this topic to your repo