Top LinkedIn Content on GPU Programming Insights

The largest AI Community 14 Million Members | Advisor @ Fortune 500 | Keynote Speaker

1,737,187 followers 1y

🚀 DeepSeek Just Dropped 3 Powerful Open-Source Releases – Here’s Why They Matter They’re rewriting the rulebook on efficient LLM training and deployment. Today, they open-sourced three incredibly small (yet powerful) repositories, each addressing a key bottleneck in large-scale AI infrastructure.👇 1️⃣ Profiling Data for AI Training Efficiency On the surface, this might not seem groundbreaking, but this dataset is a goldmine. It provides a real-world breakdown of how DeepSeek keeps GPUs fully utilized during training and inference, ensuring that every single compute cycle contributes to efficiency. ✅ Optimized scheduling = faster, cheaper AI training ✅ Helps teams visualize GPU workload distribution (viewable in Chrome tracing tools) ✅ A rare, transparent look into state-of-the-art AI scaling techniques I wish more open-source teams would release this kind of data, because training efficiency is the #1 challenge at massive scales. 2️⃣ Load Balancing for Mixture of Experts (MoE) Mixture of Experts (MoE) is a major reason why AI models can scale efficiently, but there’s always been one major problem: some GPUs get overloaded while others sit idle. DeepSeek’s Expert Parallelism Load Balancer (EPLB) solves this by: ✅ Duplicating and redistributing heavyloaded experts across GPUs ✅ Minimizing internode traffic, reducing delays ✅ Ensuring balanced workloads, preventing bottlenecks This is huge! MoE models are notoriously tricky to optimize, and this tool simplifies deployment for anyone working with expert-based architectures. If you’re serious about scaling efficient MoE models, this is an absolute must-try. 3️⃣ The Game-Changer: DualPipe – Zero-Bubble Parallelism 🔥 This is THE most exciting part of today’s release. Pipeline Parallelism (PP) is used to split LLM training across GPUs, but it comes with inefficiencies—idle time (bubbles) between forward and backward passes. DualPipe eliminates these bubbles, achieving a “zero-bubble regime” for the first time ever in large-scale AI training. 💡 Why this is huge? - Full computation-communication overlap (no wasted cycles) - Reduces training time and cost significantly - First-of-its-kind implementation, never reported before in SOTA training If you work with distributed AI training, this could dramatically improve efficiency and lower costs across the board. Final Thoughts DeepSeek is doing open-source right. Instead of just releasing models, they’re sharing the critical tools and techniques that power SOTA AI training. - GPU efficiency matters, profiling data like this is rare and invaluable. - Mixture of Experts isn’t magic, it needs proper balancing. EPLB makes it easy. - Zero-bubble training is a reality. DualPipe might become the new standard! How do you see AI training evolving? links in the comments.

56 Comments

Paolo Perrone

Shipping Production AI: Agents, Inference, GPU. Read by 1M+ AI engineers.

134,801 followers 9mo

"You're learning CUDA all wrong," the NVIDIA engineer said Then he showed me their internal training path "Wait, you DON'T start with code?" Here's the exact 90-day roadmap they use👇 Phase 1️⃣ Intuition (Week 1-2) Don't touch CUDA yet. Seriously Build your mental model of the hardware and the why first ▶︎ UC Berkeley CS 61C, Lecture 17 This is the physics layer. Understand why GPU differs from a CPU 🔗 https://lnkd.in/gVi6Bsut ▶︎ Coursera Parallel Computing Course (First 3 modules only) Learn parallel algorithms and thinking 🔗 https://lnkd.in/g4FtxbE5 ▶︎ Stanford CS231n Lecture 15 - Hardware/Software interface See how frameworks like PyTorch use hardware for AI 🔗 https://lnkd.in/gzaR7xrZ Phase 2️⃣ CUDA Basics (Week 3-4) Now we code ▶︎ NVIDIA's official CUDA C++ Programming Guide (Chapters 1-5 only) Learn threads, blocks, grids and kernel structure 🔗 https://lnkd.in/gsZsEqPp ▶︎ cuda-samples repo Reading isn't enough. Compile, run, and modify official NVIDIA examples 🔗 https://lnkd.in/gGRgvm7G ```cuda __global__ void vectorAdd(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; } ``` If this doesn't make sense yet, you skipped Phase 1 Phase 3️⃣ Memory Mastery (Week 5-8) Where 90% of developers fail, and where all performance hides ▶︎ Mark Harris's GTC Talk on Coalesced Memory Access Single most important CUDA performance concept Learn how threads must access global memory in aligned groups 🔗 https://lnkd.in/gz6Nbe5H ▶︎ GPU Gems 3, Chapter 39 - "Parallel Prefix Sum with CUDA" Masterclass in shared memory to avoid bank conflicts, a fundamental optimization 🔗 https://lnkd.in/gNhZRCHE ▶︎ CUDA C++ Best Practices Guide - "Memory Optimizations" Chapter Read to understand Global, Shared, Constant, Texture memory models 🔗 https://lnkd.in/grbhz7_V Phase 4️⃣ Real Kernels (Week 9-12) Stop playing with toy arrays. Build something that matters • Implement softmax (harder than you think) • Write a basic GEMM that doesn't suck • Port one PyTorch operation to CUDA Repos that ship: ▶︎ tiny-cuda-nn by NVIDIA Goldmine of highly optimized, real-world kernels for NN 🔗 https://lnkd.in/gGbFzVsb ▶︎ FlashAttention Reading this code teaches more on memory-aware kernel design than any book 🔗 https://lnkd.in/g6sMnBsC ▶︎ Triton Language Examples Modern, Pythonic way to write efficient GPU code, simplifying raw CUDA boilerplate 🔗 github.com/openai/triton ⚡ NVIDIA engineers 6-month shortcut Skip CUDA Learn Triton first (handles 80% of use cases better) Then return to CUDA when hitting limits The difference between you and everyone else? You have the map 90 days from now, you'll be shipping production kernels Not stuck debugging tutorials ♻️ Repost to give someone the shortcut you wish you had

143 Comments

Yangqing Jia

Co-founder & CEO of Lepton AI (now part of NVidia). Hiring top talents.

9,870 followers 2y

People often ask why prices like $2.8/m token for Llama 405B, while being super fast, are still profitable at Lepton AI. We've even been asked by a leading GPU provider! So, I figured we should share some technical analysis. This information could benefit the community. We've taken these statistics and analysis for granted, but they might not be obvious to everyone. 1. Big batches: Each request receives an output of ~30 tokens/second. Batching (grouping multiple requests simultaneously) significantly improves total throughput, often 10x or higher than a single request. GPUs are more efficient with larger batches. 2. Dynamic batching: This technique immediately adds a new request to an existing batch instead of making it wait, ensuring the GPU always works at high capacity. 3. Input tokens: The ~30 tokens/second refers to output tokens. Input tokens are processed much faster (known as "prefilling"). Typically, the input length is many times larger than the output (3x to 10x). This increases the total number of tokens processed, explaining why there is often separate billing for input and output. 4. Quantization: Using 8-bit integers or 8-bit floats instead of 16-bit floats reduces memory usage and speeds up processing because the GPU accesses less memory. Newer GPUs also have hardware instructions for lower bit numbers, increasing speed further. For example, the new Nvidia Blackwell GPU supports 4-bit floats (fp4). Quantization also saves memory, allowing even bigger batches from point 1, making it more economic. 5. Speculative decoding: This method uses a smaller model to predict the next token. For example, predicting "you" after "it is good to see" doesn't require a large model. Smaller models make such predictions faster. The Medusa algorithm by Tianle Cai is a specific example of this approach. 6. Prompt caching: LLMs often encounter repeated prefixes, such as "you are a smart AI agent" in system prompts. Caching these prefilled prompts avoids recalculating them, speeding up repeated requests. 7. Optimizing GPU setups: This involves using large GPUs for big models, small GPUs for small models, and matching GPUs to specific tasks—some are better for prefilling, others for decoding. There are many optimization opportunities here. This is not a complete list. We integrate these methods (and a growing number of more) in our runtime to ensure profitability with reasonable traffic. Lepton is created by experts who have developed key AI software over the past decade - Caffe, onnx, pytorch - alongside cloud experts like the creator of etcd and core contributors to Kubernetes. We provide not only LLM APIs, but also a full cloud-native experience to help you find, use, and optimize GPUs on our cloud platform. We love the open-source and open-access community. What AI technical explanation would you like to hear next?

23 Comments

Hao Hoang

I share daily insights on AI agents, LLMs, Data Science, Machine Learning | I help AI engineers crack top-tier interviews | 68K+ community | LLM System Design, RAG, Agents

67,328 followers 1mo

I just trimmed 25% off my Qwen3-14B QLoRA run. Same GPU. Same code. One `pip install -U`. The Unsloth AI team shipped a collab with NVIDIA that fixes three things most training stacks were quietly bleeding time on. No new model. No accuracy hit. No hyperparameter tuning. Here's what each fix is actually doing under the hood: 1️⃣ 𝐂𝐚𝐜𝐡𝐞𝐝 𝐩𝐚𝐜𝐤𝐞𝐝-𝐬𝐞𝐪𝐮𝐞𝐧𝐜𝐞 𝐦𝐞𝐭𝐚𝐝𝐚𝐭𝐚 Every transformer layer was rebuilding the same boundary info (cu_seqlens, max_seqlen, mask structure) and forcing a GPU-CPU sync per layer. Now it's built once per batch, reused L times. +43.3% forward, +14.3% per batch on Qwen3-14B QLoRA SFT. 2️⃣ 𝐃𝐨𝐮𝐛𝐥𝐞-𝐛𝐮𝐟𝐟𝐞𝐫𝐞𝐝 𝐜𝐡𝐞𝐜𝐤𝐩𝐨𝐢𝐧𝐭 𝐫𝐞𝐥𝐨𝐚𝐝𝐬 Activation reloads from pinned CPU were serializing on a single buffer, copy, wait, compute, next copy. Two buffers run copy + compute in parallel. +8.4% on 8B, +6.7% on 14B, +4.6% on 32B. Memory overhead stays under 0.5 GB. 3️⃣ 𝐀𝐫𝐠𝐬𝐨𝐫𝐭 + 𝐛𝐢𝐧𝐜𝐨𝐮𝐧𝐭 𝐌𝐨𝐄 𝐫𝐨𝐮𝐭𝐢𝐧𝐠 The naive torch.where(router_indices == expert_idx) loop was triggering one CPU-GPU sync per expert. One stable sort, one bincount, reuse offsets everywhere. +23% forward on GPT-OSS routing path. The pattern across all three: the math kernels were already fast. The bottleneck was glue code, rebuilding metadata, serializing copies, querying the runtime once per expert. Group once. Cache once. Overlap the rest. Auto-enabled on RTX laptops, B200 data center GPUs, and DGX Spark. Apache 2.0. Zero accuracy loss. If you train models, this is one update away. Link in the comments 👇

12 Comments

Daily Papers

Machine Learning Engineer at Hugging Face

13,286 followers 2mo

Training large language models typically means renting expensive A100s or H100s. But what if you could fine-tune a 32B parameter model on a single RTX 4090 instead? Researchers from Wuhan University and Peking University just released RoundPipe, a new training framework that makes this practical. Pipeline parallelism on consumer GPUs has always struggled with the "weight binding" problem, where uneven model stages create idle bubbles that waste precious VRAM and compute. RoundPipe breaks this constraint by treating GPUs as a pool of stateless workers, dynamically dispatching computation in a round-robin fashion to achieve near-zero pipeline bubbles. The results are striking. On an 8× RTX 4090 server, RoundPipe delivers 1.48–2.16× speedups over existing approaches. It enables full fine-tuning of 32B models—or LoRA fine-tuning of models up to 235B parameters—with sequence lengths exceeding 64K tokens on just 24GB of VRAM. Best of all, it feels like vanilla PyTorch. There is no complex parallel programming to learn, no training loop rewrites required for multi-GPU scaling, and it runs on NVIDIA, AMD, and Ascend hardware alike. Installation is as simple as pip install roundpipe. For researchers and developers working outside of hyperscaler budgets, this significantly lowers the barrier to training production-scale models. Paper: https://lnkd.in/ejc7-RNT Code: https://lnkd.in/eFTzZ4Tw Documentation: https://lnkd.in/eCgcHRWS

7 Comments

Elvis S.

Founder at DAIR.AI | Investor | Prev: Meta AI, Galactica LLM, Elastic, Ph.D. | Serving 7M+ learners around the world

88,427 followers 9mo

Banger paper from Meta and collaborators. This paper is one of the best deep dives yet on how reinforcement learning (RL) actually scales for LLMs. The team ran over 400,000 GPU hours of experiments to find a predictable scaling pattern and a stable recipe (ScaleRL) that consistently works as you scale up compute. Think of it as a practical guide for anyone trying to train reasoning or alignment models with RL. More on why this is a big deal: 1. The big insight: RL progress follows a predictable curve. When you plot model performance vs compute, the growth isn’t random; it follows a sigmoid (S-shaped) curve. The curve has three simple knobs: A = the best performance you’ll ever reach, B = how efficiently you reach it, C_mid = how much compute it takes to hit the halfway point. The amazing part: you can fit this curve using just small runs and accurately predict how a 100k-hour run will behave. So you no longer need to guess; you can forecast where your RL setup will top out before burning compute. 2. The ScaleRL recipe that just works. The authors tested dozens of RL variations and found one that scales cleanly to 100k GPU hours without blowing up: - Pipeline-RL (8 pipelines) with CISPO loss (a stabilized REINFORCE variant). - Prompt-level averaging and batch-level normalization to reduce variance. - FP32 logits for better stability and higher final accuracy. - No-Positive-Resampling curriculum to avoid reward hacking. - Forced interruptions (stopping long thoughts) instead of punishing long completions. - This combo, called ScaleRL, hit the best trade-off between stability, sample efficiency, and asymptotic performance. 3. What actually matters for better RL results. Not every trick helps equally: - Loss choice and precision matter most; CISPO + FP32 logits boosted final pass rates from ~52% to ~61%. - Normalization, aggregation, and curriculum mainly affect how fast you improve (efficiency), not how far you can go. - Fancy variants like GRPO, DAPO, or Magistral didn’t beat ScaleRL once scaled properly. 4. Scaling tips that really pay off. If you’re planning a long RL run: - Longer context budgets (up to 32k tokens) help final performance but make early training slower. - Bigger global batch sizes improve stability and final accuracy; small batches tend to stagnate. - Larger or MoE models get better reward ceilings with less total compute than dense ones. - More generations per prompt helps a little, but not as much as people think. 5. How to actually run it safely. - Use a 1k-prompt validation set and monitor your model’s pass rate curve. - Fit the sigmoid early; it’ll tell you if you’re wasting compute. - Watch truncation rates (too many interrupted outputs = unstable training). - Prefer interrupting long completions over penalizing them. - Choose your setup by optimizing for a higher ceiling (A) first, then tune efficiency (B).

+1

6 Comments

Nouamane Tazi

ML Research Engineer at Hugging Face 🤗

8,896 followers 8mo

After training 𝐒𝐦𝐨𝐥𝐋𝐌𝟑 on 𝟑𝟖𝟒 𝐇𝟏𝟎𝟎𝐬 for nearly a month, I've come to realize something most people overlook: 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐦𝐚𝐤𝐞-𝐨𝐫-𝐛𝐫𝐞𝐚𝐤 𝐟𝐚𝐜𝐭𝐨𝐫 𝐢𝐧 𝐋𝐋𝐌 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠. 🔥 Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious 𝐍𝐂𝐂𝐋 𝐞𝐫𝐫𝐨𝐫𝐬, or when your expensive GPU cluster is running at 𝟔𝟎% 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲, the problem isn't your model. It's most probably a 𝐦𝐢𝐬𝐮𝐬𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐡𝐚𝐫𝐝𝐰𝐚𝐫𝐞. Questions that seemed simple but had no clear answers: Why is 𝐌𝐨𝐄 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐥𝐨𝐰𝐞𝐫 𝐭𝐡𝐚𝐧 𝐝𝐞𝐧𝐬𝐞 𝐦𝐨𝐝𝐞𝐥𝐬? Which 𝐍𝐂𝐂𝐋 𝐟𝐥𝐚𝐠𝐬 should we actually set? How often should we checkpoint without killing throughput? That's why we built 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤 📖: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐥𝐚𝐲𝐞𝐫 that most teams get wrong. Here's what surprised us most: 𝘪𝘯𝘵𝘦𝘳𝘤𝘰𝘯𝘯𝘦𝘤𝘵 𝘵𝘰𝘱𝘰𝘭𝘰𝘨𝘺 𝘪𝘴 𝘢𝘭𝘮𝘰𝘴𝘵 𝘢𝘭𝘸𝘢𝘺𝘴 𝘮𝘪𝘴𝘶𝘯𝘥𝘦𝘳𝘴𝘵𝘰𝘰𝘥, and wrong configurations can silently destroy your GPU-to-GPU bandwidth. We spent weeks validating every layer of our AWS p5 system, and the results were eye-opening. 👀 We validated real vs theoretical bandwidth across the entire stack: 𝐇𝐁𝐌𝟑 𝐡𝐢𝐭𝐭𝐢𝐧𝐠 𝟑 𝐓𝐁/𝐬, 𝐍𝐕𝐋𝐢𝐧𝐤 𝟒.𝟎 𝐫𝐞𝐚𝐜𝐡𝐢𝐧𝐠 𝟕𝟖𝟔 𝐆𝐁/𝐬, 𝐏𝐂𝐈𝐞 𝐆𝐞𝐧𝟒 𝐚𝐭 𝟏𝟒.𝟐 𝐆𝐁/𝐬. Then we ran collective operations across 𝟏𝟐𝟖 𝐆𝐏𝐔𝐬 (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from 𝟒𝟖𝟎 𝐆𝐁/𝐬 on a single node to 𝟑𝟐𝟎-𝟑𝟓𝟎 𝐆𝐁/𝐬 across 16 nodes. The good news? Once you understand what's happening, you can fix it. We documented everything: bandwidth measurements, annotated topology diagrams, troubleshooting workflows. And listed the tools you can use: 𝐧𝐯𝐛𝐚𝐧𝐝𝐰𝐢𝐝𝐭𝐡 for measuring communication paths, 𝐍𝐒𝐢𝐠𝐡𝐭 𝐂𝐨𝐦𝐩𝐮𝐭𝐞 for roofline analysis, step-by-step guides for debugging your specific setup. Infrastructure shouldn't be this invisible layer that only a handful of experts understand. When you can 𝐦𝐞𝐚𝐬𝐮𝐫𝐞, 𝐯𝐢𝐬𝐮𝐚𝐥𝐢𝐳𝐞, 𝐚𝐧𝐝 𝐝𝐞𝐛𝐮𝐠 𝐢𝐭 𝐩𝐫𝐨𝐩𝐞𝐫𝐥𝐲, suddenly those mysterious slowdowns become solvable problems. 🚀 If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging. 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤: https://lnkd.in/e5MKXUHS Shared with ❤️ by the HuggingFace team

9 Comments

Ivan Nardini

Google Cloud AI/ML DevRel | Vertex AI dude | Research, Open Models, Ray & TPU | Startup Mentor | AI Champion Innovator

29,533 followers 4mo

Scaling LLM training isn't just about throwing more GPUs at the cluster. It's about squeezing every byte of VRAM out of your hardware. CrowdStrike trained specialized cybersecurity models on Vertex AI to counter threat actors using LLMs for social engineering and network operations. Here are the practical considerations that enabled and optimized training at scale: - Data Strategy: Augment datasets through synthetic generation to boost model robustness, especially for low-resource domain-specific languages. - Distributed Computing: Combine Data, Tensor, Pipeline, Context, and Expert parallelism (5D Parallelism) to fit massive models on constrained hardware. - Hardware Optimizations: Match algorithms to silicon. Swapping SDPA for Flash Attention 2 on newer GPUs took training performance from absolute slowest to absolute fastest. - Node Communication: Training on tokenized byte data requires massive context windows. DeepSpeed Ulysses sequence parallelism accelerated node communication by up to 6x. - Peak VRAM Spikes: Model training effectively doubles your VRAM footprint. Gradient checkpointing + DeepSpeed ZeRO 3 dropped peak VRAM requirements by 80% (31GB down to 6GB). For the full training story and architectural breakdown, check the blog linked in the Comments.

4 Comments

Shwetank Kumar

5,059 followers 2y

Supercharge Your Model Training: Essential Techniques and Tricks 🚀 Are you tired of long model training times and inefficient training process? I have always struggled to understand which techniques can be chained together towards cumulative improvement and the order of magnitude improvement from each. Here is an array of powerful techniques to accelerate training with their effect size. The key in most cases is to know the memory architecture for the GPU 💾 and utilize it optimally by reducing data movement between on chip registers, cache, and off chip high-bandwidth memory. Frameworks like PyTorch make this pretty simple allowing you to do this in a few lines of code at most. - Switch to Mixed Precision: 🔢 Implementing bfloat16 can lead to a potential 3x speedup by reducing the amount of data transferred, thus enabling larger batch sizes. Although GPUs may promise up to an 8x improvement, actual gains could be lower due to memory constraints. Benchmarking is essential! - PyTorch Compile: 🖥️ Experience about a 2.5x speed increase by minimizing unnecessary memory bus traffic. This approach prepares your computations for more efficient execution. - Flash Attention: ⚡ Utilize a fused kernel specifically optimized for attention-heavy models, which can boost performance by up to 40% by enhancing memory hierarchy utilization. - Optimized Data Formats: 📊 Aligning your vocab size to a power of 2 can provide a straightforward 10% speed boost by improving memory access efficiency. - Hyperparameter Tuning: 🛠️ Gain an additional 5-10% speed by tweaking hyperparameters and employing fused kernels for optimizers like AdamW. Bespoke Fused Kernels: 🧩 Push the boundaries with custom kernels designed specifically for your model’s architecture to achieve optimal performance. Leverage Additional Optimizations: ➕ Employ vector operations (e.g., AVX-512) on CPUs or use sparse kernels for pruned models to further enhance memory efficiency. Scale Responsibly: 📈 Before moving to a multi-GPU setup, ensure you've maximized the potential of single-GPU optimizations to avoid inefficiencies. Once your setup is optimized, scaling across multiple GPUs can dramatically reduce training times by parallelizing the workload and minimizing data transfers. You can do this almost trivially by using things like Hugging Face Accelerate. Remember, the effectiveness of these techniques can vary based on your specific model, hardware setup, and other variables. Extensive benchmarking is crucial to find the perfect balance between speed and accuracy. Optimization is a continuous journey. Stay proactive in exploring new methods to reduce training times and remain competitive in the fast-evolving field of machine learning. For more insights, check out Karpathy’s latest video where he replicates GPT-2 on 8x A100s, astonishingly beating GPT-3 on Hellaswag. It’s incredible to see such advancements, allowing what once took months to be accomplished virtually overnight. 🌙✨

2 Comments

GPU Programming Insights

More in GPU Programming Insights

More Artificial Intelligence topics

Explore categories