𝗘𝘅𝗽𝗹𝗮𝗶𝗻 𝗧𝗵𝗶𝘀: 𝗟𝗹𝗮𝗺𝗮 𝟯 𝗡𝗲𝗲𝗱𝘀 𝟮.𝟰𝗧𝗕. 𝗬𝗼𝘂𝗿 𝗚𝗣𝗨 𝗛𝗮𝘀 𝟴𝟬𝗚𝗕. 𝗜𝘁 𝗦𝘁𝗶𝗹𝗹 𝗧𝗿𝗮𝗶𝗻𝘀. Training Llama-3 405B needs ~2.4TB with BF16 + 8-bit Adam: • Weights: 810GB • Gradients: 810GB • Optimizer: 810GB (vs 3.24TB with standard Adam!) • Total: ~2.4TB (Illustrative budget—config-dependent; FP32 masters, ZeRO stage, and offload change totals) Your H100? 80GB. You'd need 30+ GPUs just to hold everything. 𝗧𝗵𝗿𝗲𝗲 𝗧𝗿𝗶𝗰𝗸𝘀 𝗧𝗵𝗮𝘁 𝗠𝗮𝗸𝗲 𝗜𝘁 𝗪𝗼𝗿𝗸 𝟭. 𝗗𝗮𝘁𝗮 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split batch. Problem: Each GPU needs 2.4TB. Fix: ZeRO splits it across N GPUs. 𝟮. 𝗠𝗼𝗱𝗲𝗹 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split layers. Problem: Sequential bottleneck. Fix: Pipeline batches. 𝟯. 𝗦𝗲𝗾𝘂𝗲𝗻𝗰𝗲 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split tokens. This is the game changer. 8K tokens → 8 GPUs → 1K each. But attention needs every token to see all others. 𝗧𝗵𝗲 𝗠𝗮𝗴𝗶𝗰 𝗠𝗼𝗺𝗲𝗻𝘁: Instead of moving the 2.4TB model, GPUs only exchange attention keys/values (K,V). Each GPU: • Computes K,V for its 1K tokens (32MB) • Sends to others via all-to-all • Receives 7×32MB = 224MB total • Computes attention, deletes copies 𝟮𝟮𝟰𝗠𝗕 𝗺𝗼𝘃𝗲𝗱 𝗶𝗻𝘀𝘁𝗲𝗮𝗱 𝗼𝗳 𝟮.𝟰𝗧𝗕. That's 10,000x less. 𝗧𝗵𝗲 𝗥𝗲𝘀𝘂𝗹𝘁: Combine all three (ZeRO + tensor + pipeline + sequence parallel). Each GPU holds ~75GB instead of 2.4TB. This exact choreography powers ChatGPT, Claude, and every frontier model. Without it? 10K token limits. With it? Entire books in one context. Not magic. Just brilliant engineering making the impossible routine.
GPU Programming Insights
Explore top LinkedIn content from expert professionals.
-
-
Most engineers think model cost is about API tokens or inference time. In reality, it’s about how your requests compete for GPU scheduling and how effectively your data stays hot in cache. Here’s the untold truth 👇 1. 𝐄𝐯𝐞𝐫𝐲 𝐦𝐢𝐥𝐥𝐢𝐬𝐞𝐜𝐨𝐧𝐝 𝐨𝐧 𝐚 𝐆𝐏𝐔 𝐢𝐬 𝐚 𝐰𝐚𝐫 𝐟𝐨𝐫 𝐩𝐫𝐢𝐨𝐫𝐢𝐭𝐲. . Your model doesn’t just “run.” It waits its turn. Schedulers (like Kubernetes device plugins, Triton schedulers, or CUDA MPS) decide who gets compute time — and how often. If your jobs are fragmented or unbatched, you’re paying for idle silicon. That’s like renting a Ferrari to sit in traffic. 2. 𝐂𝐚𝐜𝐡𝐢𝐧𝐠 𝐥𝐚𝐲𝐞𝐫𝐬 𝐪𝐮𝐢𝐞𝐭𝐥𝐲 𝐝𝐞𝐜𝐢𝐝𝐞 𝐲𝐨𝐮𝐫 𝐛𝐮𝐫𝐧 𝐫𝐚𝐭𝐞. Intermediate activations, embeddings, and KV caches live in high-bandwidth memory. If your model keeps reloading them between requests — you’re paying full price every time. That’s why serving infra (like vLLM, DeepSpeed, or FasterTransformer) focuses more on cache reuse than raw FLOPS. The real optimization isn’t in “faster models.” It’s in smarter scheduling and cache locality. Your cost per token can drop 50% with zero model changes — just better orchestration. 3. 𝐓𝐡𝐞 𝐡𝐢𝐝𝐝𝐞𝐧 𝐭𝐚𝐱: 𝐟𝐫𝐚𝐠𝐦𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐞𝐯𝐢𝐜𝐭𝐢𝐨𝐧. When too many models share the same GPU cluster, the scheduler starts slicing compute and evicting caches. This leads to context thrashing — where memory swaps cost more than inference. At scale, this kills both performance and margins. So if you’re wondering why your inference bill doubled while latency stayed the same — don’t blame the model. Blame the infrastructure design. The real bottleneck isn’t model size — it’s architectural awareness. Understanding schedulers, memory hierarchies, and caching strategies is what separates AI engineers from AI architects. And that’s exactly what we go deep into inside the Advanced System Design Cohort — a 3-month, high-intensity program for Senior, Staff, and Principal Engineers who want to master the systems that power modern AI infra. You’ll learn to think beyond API calls — about how compute, caching, and scheduling interact to define scale and cost. If you’re ready to learn the architectures behind real AI systems — there’s a form in the comments. Apply, and we’ll check if you’re a great fit. We’re selective, because this is where future technical leaders are being built.
-
RAGDOLL: Redefining Efficient RAG Serving on a Single GPU Excited to share insights from a recent work on RAGDOLL, a resource-efficient, self-adaptive Retrieval-Augmented Generation (RAG) serving system designed for single-GPU, memory-constrained environments. Developed by a leading university research group, RAGDOLL addresses the core challenge of deploying high-quality RAG pipelines on consumer-grade hardware, where both large language models (LLMs) and expansive knowledge bases compete for limited memory resources. How RAGDOLL Works Under the Hood - Decoupled Pipelines: RAGDOLL separates the retrieval (CPU-bound) and generation (GPU-bound) stages into parallel pipelines. This design enables both stages to run concurrently, significantly reducing idle times and boosting device utilization compared to traditional serial RAG workflows. - Joint Memory Placement: The system introduces a unified memory management strategy across GPU, CPU, and disk. By dynamically placing LLM tensors, KV caches, and database partitions where they fit best, RAGDOLL avoids memory thrashing and ensures optimal use of all available storage tiers. - Dynamic Batch Scheduling: Unlike static batch schedulers, RAGDOLL adapts batch sizes and resource allocations in real time, based on incoming workload and device utilization. This backlog-aware scheduling minimizes both external (waiting) and internal (device idle) latency, especially under fluctuating request rates. - Advanced Prefetching: RAGDOLL leverages a thread-based, asynchronous prefetching mechanism for LLM inference. By continuously queuing up future layers and managing data transfers with multiple CUDA streams, it aligns computation and communication, reducing bottlenecks from memory bandwidth and compute jitter. - Adaptive Configuration via Profiling: Before deployment, RAGDOLL profiles the hardware and explores configuration space to balance retrieval and generation latency. During operation, it dynamically tunes parameters like batch size and memory allocation, responding to real-time system feedback. Technical Impact - Achieves up to 3.6x speedup in average latency compared to leading serial RAG systems like vLLM, even when serving large models (8B-70B) with only 12-24GB GPU and 176-256GB RAM. - Reduces waiting and generation times by up to 20x and 5x, respectively, through its multi-pipeline and memory placement innovations. - Demonstrates robust adaptability across diverse workloads and hardware setups, making advanced RAG applications feasible on widely accessible consumer hardware. RAGDOLL marks a significant step forward in democratizing advanced LLM-based applications, bringing enterprise-grade RAG capabilities to resource-limited environments. If you're working on LLM serving or retrieval-augmented systems, this architecture is worth a deep dive.
-
AI Inference costs are killing your profit margins. Let me teach you how to reduce your Inference Overhead with Compiler & Graph Execution Running an LLM under PyTorch or TensorFlow looks simple, but the framework issues thousands of separate GPU kernel calls for every forward pass. Each kernel executes a small unit of work—like normalization or matrix multiplication—and writes the result to global GPU memory (HBM) before reading it back. While HBM bandwidth reaches 2–3 TB/s on an H100, that is 10–50x slower than the GPU’s on-chip registers. Every unnecessary trip to HBM is wasted potential. Worse, each kernel launch requires the CPU to coordinate with the GPU, adding tens of microseconds of overhead. Across thousands of tokens, this becomes milliseconds of latency. Three techniques—kernel fusion, CUDA graphs, and FlashAttention—target these bottlenecks. Kernel Fusion: Combining Operations Instead of launching separate kernels for LayerNorm and matrix multiplication, you fuse them into one. The compiler rewrites the computational graph to combine operations, ensuring intermediate results stay in the GPU’s fast on-chip registers instead of touching global HBM. This cuts memory traffic and eliminates redundant kernel launches. The tax: irregular shapes or dynamic padding can block fusion, leading to a mix of fused and unfused kernels. CUDA Graphs: Bypassing the CPU Inference involves repeating the same sequence of kernels for every generated token. Rather than the CPU re-issuing commands, CUDA graphs allow you to record the sequence once and replay it directly on the GPU. This bypasses the CPU scheduler entirely, eliminating launch overhead. The tax: graphs are tied to specific tensor shapes, requiring effective systems to capture "hot" shapes and fall back to standard execution for others. FlashAttention: Avoiding the Quadratic Wall Standard attention computes an N x N score matrix between queries and keys, which creates gigabytes of memory traffic per token. FlashAttention tiles this computation, loading small blocks of queries and keys into on-chip SRAM to compute partial attention scores incrementally. The result is mathematically identical, but the memory footprint is a fraction of the original. The tax: gains depend on sequence length, and for very short sequences, the overhead of tiling can outweigh benefits. Summary: The Performance Compound Kernel fusion ensures fewer writes and more work per cycle. CUDA graphs remove launch overhead, keeping the GPU in constant motion. FlashAttention prevents memory blowup, freeing bandwidth for compute.
-
𝗪𝗵𝘆 𝗱𝗼 𝗹𝗼𝗻𝗴-𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗟𝗟𝗠 𝗱𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁𝘀 𝗵𝗶𝘁 𝗮 𝘄𝗮𝗹𝗹 𝗲𝘃𝗲𝗻 𝘄𝗵𝗲𝗻 𝘆𝗼𝘂 𝗵𝗮𝘃𝗲 𝗽𝗹𝗲𝗻𝘁𝘆 𝗼𝗳 𝗚𝗣𝗨 𝗰𝗼𝗺𝗽𝘂𝘁𝗲? The bottleneck is often the KV cache: it avoids recomputing attention, but it grows with context length and quickly becomes a GPU memory and bandwidth problem A recent paper ( https://lnkd.in/gRqBt6dd ) maps the KV-cache optimization space into five practical levers: 𝗘𝘃𝗶𝗰𝘁: keep only the most useful history (windowing, top-K, learned eviction). 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀: quantize/compress KV tensors to cut footprint with controlled quality tradeoffs. 𝗛𝘆𝗯𝗿𝗶𝗱 𝗺𝗲𝗺𝗼𝗿𝘆: keep hot KV on GPU and page cold KV to CPU/NVMe with smart prefetch. 𝗥𝗲𝘁𝗵𝗶𝗻𝗸 𝗮𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻: use mechanisms that reduce dependence on full-history KV. 𝗖𝗼𝗺𝗯𝗶𝗻𝗲: use adaptive pipelines that mix these based on workload and hardware limits. 𝗞𝗲𝘆 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆: there’s no single winner; the right KV strategy depends on your serving mode (multi-turn chat vs long-context analysis vs high-throughput serving), and the best results often come from hybrid, scenario-driven combinations.
-
You're in a Senior ML Interview at NVIDIA. The interviewer sets a trap: "Your 7B model fits comfortably on a 24GB GPU. Yet, 10 minutes into a conversation, the service crashes with an Out-Of-Memory (OOM) error. Do we upgrade to an A100?" 90% of candidates walk right into it: "Yes, we need more VRAM." They think: "The model is running out of space, so we need a bigger bucket." This is the "Brute Force" approach. It solves the symptom for exactly one week until their users type longer prompts, and then they crash an 80GB card too. They just 4x'd the cloud bill without solving the physics of the problem. The reality is that they aren't optimizing for 𝐒𝐭𝐚𝐭𝐢𝐜 𝐌𝐞𝐦𝐨𝐫𝐲 (𝐖𝐞𝐢𝐠𝐡𝐭𝐬). They are dying from 𝐃𝐲𝐧𝐚𝐦𝐢𝐜 𝐒𝐭𝐚𝐭𝐞 (𝐂𝐨𝐧𝐭𝐞𝐱𝐭). In a production environment, GPU memory is consumed by two things: - 𝘔𝘰𝘥𝘦𝘭 𝘞𝘦𝘪𝘨𝘩𝘵𝘴: Fixed. (e.g., ~14GB for a 7B param model in FP16). - 𝘒𝘝 𝘊𝘢𝘤𝘩𝘦: Variable. This grows linearly with every single token generated. A 7B model with a batch size of 64 and a context length of 2048 tokens can generate over 30GB of KV cache. The "Ghost Memory" is larger than the model itself. ----- 𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: The real problem isn't just the size of the cache - it's Memory Fragmentation. Standard PyTorch allocates contiguous memory blocks. As requests grow and shrink, they leave "holes" in your VRAM that are too small to use but add up to gigabytes of wasted space. This is The Swiss Cheese Effect. The fix isn't hardware. It's Architecture: 1️⃣ 𝘗𝘢𝘨𝘦𝘥𝘈𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 (𝘷𝘓𝘓𝘔): Treat GPU memory like an Operating System treats RAM. Break the KV cache into non-contiguous "pages" so you can fill every byte of VRAM without needing a continuous block. 2️⃣ 𝘒𝘝 𝘊𝘢𝘤𝘩𝘦 𝘖𝘧𝘧𝘭𝘰𝘢𝘥𝘪𝘯𝘨: If a user pauses for 30 seconds, move their KV cache to CPU RAM (cheap) and swap it back to GPU (expensive) only when they type again. 𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝: "Buying GPUs is a band-aid. The bottleneck is the KV Cache growing linearly with context. I would implement PagedAttention to eliminate memory fragmentation and KV Offloading to handle idle sessions. We only upgrade hardware if the active computation, not the idle state, saturates the compute units." #MachineLearning #DeepLearning #GenerativeAI #LLM #AIEngineering #MLOps #NVIDIA
-
People often ask why prices like $2.8/m token for Llama 405B, while being super fast, are still profitable at Lepton AI. We've even been asked by a leading GPU provider! So, I figured we should share some technical analysis. This information could benefit the community. We've taken these statistics and analysis for granted, but they might not be obvious to everyone. 1. Big batches: Each request receives an output of ~30 tokens/second. Batching (grouping multiple requests simultaneously) significantly improves total throughput, often 10x or higher than a single request. GPUs are more efficient with larger batches. 2. Dynamic batching: This technique immediately adds a new request to an existing batch instead of making it wait, ensuring the GPU always works at high capacity. 3. Input tokens: The ~30 tokens/second refers to output tokens. Input tokens are processed much faster (known as "prefilling"). Typically, the input length is many times larger than the output (3x to 10x). This increases the total number of tokens processed, explaining why there is often separate billing for input and output. 4. Quantization: Using 8-bit integers or 8-bit floats instead of 16-bit floats reduces memory usage and speeds up processing because the GPU accesses less memory. Newer GPUs also have hardware instructions for lower bit numbers, increasing speed further. For example, the new Nvidia Blackwell GPU supports 4-bit floats (fp4). Quantization also saves memory, allowing even bigger batches from point 1, making it more economic. 5. Speculative decoding: This method uses a smaller model to predict the next token. For example, predicting "you" after "it is good to see" doesn't require a large model. Smaller models make such predictions faster. The Medusa algorithm by Tianle Cai is a specific example of this approach. 6. Prompt caching: LLMs often encounter repeated prefixes, such as "you are a smart AI agent" in system prompts. Caching these prefilled prompts avoids recalculating them, speeding up repeated requests. 7. Optimizing GPU setups: This involves using large GPUs for big models, small GPUs for small models, and matching GPUs to specific tasks—some are better for prefilling, others for decoding. There are many optimization opportunities here. This is not a complete list. We integrate these methods (and a growing number of more) in our runtime to ensure profitability with reasonable traffic. Lepton is created by experts who have developed key AI software over the past decade - Caffe, onnx, pytorch - alongside cloud experts like the creator of etcd and core contributors to Kubernetes. We provide not only LLM APIs, but also a full cloud-native experience to help you find, use, and optimize GPUs on our cloud platform. We love the open-source and open-access community. What AI technical explanation would you like to hear next?
-
GPU matrix multiplication may be the most expensive algorithm that exists. It is the main operation that OpenAI, Anthropic, Meta spend billions of $ of compute on. There are only 8 kernel optimizations you need to understand for 93.7% perf of NVIDIA’s state of the art cuBLAS library In this thread, we’ll go over kernels that get progressively more performant from an Anthropic engineer's blog post following the attached diagram. Kernel 1: Simply multiplies two matrices. We’ll use CUDA’s grid, block and thread hierarchy to assign each thread a unique entry in the result matrix C. This works, but only gets us 309 GFLOPs/s (1.3% of an A6000 GPU's potential), we can do much better. Kernel 2: Enables global memory coalescing by using “warps” (groups of threads). Threads part of the same warp can group their memory accesses into one. This dramatically improves memory throughput (110GB/s vs 15GB/s). Result: 1986 GFLOPs/s (8.5% of cuBLAS) Kernel 3: Utilizes on-chip shared memory (SMEM). SMEM bandwidth is much higher than global memory (12,080GiB/s vs 750GiB/s). We load chunks from A and B into SMEM and then perform as much work as possible on them. Result: 2980 GFLOPs/s (12.8% of cuBLAS). Kernel 4: Uses 1D blocktiling for calculating multiple results per thread. It works like the last one but adds an inner loop for multiple C entries per thread (does more in SMEM) with a 4KB SMEM cache per block. Result: 8474 GFLOPs/s, ~3x faster than the last (36.5% of cuBLAS) Kernel 5: Increases arithmetic intensity via 2D blocktiling. We compute a grid of 8*8 results per thread, leveraging shared memory and local registers to reduce global memory accesses. It offers another ~2x performance boost. Result: 15971 GFLOPs/s (68.7% of cuBLAS) Kernel 6: Vectorizing memory accesses. The key is to transpose loads from A, enabling the use of 128-bit load instructions (LDS.128) instead of 32-bit loads. This enables more efficient data movement. Result: 18237 GFLOPs/s (78.4% of cuBLAS) Kernel 7: Tunes params for how much data we cache in SMEM and registers which improves performance. We use a bash script to search all valid combinations to find the optimal settings. Result: 19721 GFLOPs/s (84.8% of cuBLAS) Kernel 8: Adds "warptiling". This is yet another form of tiling (on top of blocktiling and threadtiling). Warptiling allows different warps to execute in parallel on different warp schedulers. Leverages hardware for even more parallelism. Result: 21779 GFLOPs/s (93.7% cuBLAS) From reading the original post, I learned that optimizing GPU kernels requires a deep understanding of the hardware and memory access patterns. The basics are simple and get you most of the way there (author got ~80% of the perf in 2 weekends). It took another 4 weekends to get the last 14% (classic power law). For much more in-depth explanations with helpful diagrams and code snippets, check out the original post here it's really interesting: https://lnkd.in/gi-y4NFB
-
Everyone with AI knows quantization. FP16 to INT8 to FP4. Sacrifice precision, gain speed. Standard tradeoff. Emulation is the opposite. The idea: use low-precision tensor cores to produce a high-precision result. Not approximate. Mathematically exact. Two algorithms in cuBLAS right now as example. 𝗕𝗙𝟭𝟲𝘅𝟵 for FP32. Each FP32 value is split into three BF16 components. The matrix multiply expands into 9 BF16 GEMMs on tensor cores. Recombine. Full FP32 precision recovered. Why it works: BF16 and FP32 share the same 8-bit exponent. You redistribute the 23 mantissa bits across three 7-bit BF16 slots. Nothing is lost. 𝗢𝘇𝗮𝗸𝗶 𝗦𝗰𝗵𝗲𝗺𝗲 for FP64. Each FP64 value is scaled by a shared power-of-two factor, then sliced into INT8 chunks. Multiple INT8 GEMMs on tensor cores. Recombination uses error-free transformations to guarantee zero accumulated rounding error. The number of slices depends on the input data range. cuBLAS picks automatically via ADP (Automatic Dynamic Precision). The results on Blackwell: up to 13x faster than native FP64 on RTX PRO 6000. Same accuracy or better. Quantization: you trade precision for speed. Emulation: you keep precision AND gain speed. The cost is more GEMMs. But when tensor core throughput is 10x+ higher than native arithmetic, the math works out. This matters for HPC. Weather simulation, quantum chemistry, materials science. People who need FP64 but want GPU speed. cuBLAS now gives them both. CUBLAS_COMPUTE_32F_EMULATED_16BFX9 CUBLAS_COMPUTE_64F_EMULATED_FIXEDPOINT Two enum values. That's all it takes to switch.
-
I sped up our LLM inference by 300% without buying a single new GPU. We didn't change the model (Llama-3-70B). We didn't change the hardware (A100s). We just changed how we asked for the tokens. If you are running raw inference, you are wasting 90% of your GPU's potential. LLMs are Memory Bound, not Compute Bound. The GPU spends most of its time waiting to fetch weights from VRAM, not actually calculating. The Lesson: We implemented Speculative Decoding. Here is the "Big vs. Small" trick: We run a tiny, cheap model (like Llama-8B) to "guess" the next 5 tokens. It does this instantly. We ask the giant model (70B) to verify those 5 tokens in a single parallel batch. It is much faster to say: "Check these 5 words" than to say: "Generate word 1... wait... Generate word 2... wait..." The Math: • If the tiny model guesses right (which is easy for common phrases), you get 5 tokens for the latency cost of 1. • If it guesses wrong, you just discard them and fall back to the big model. No accuracy loss. You are effectively trading "Compute" (which you have in surplus) for "Memory Bandwidth" (which is your bottleneck). The Result: • Latency: Dropped from 40ms/token to 12ms/token. • User Experience: Real-time. • Accuracy: Identical (Verified). You don't need to write this from scratch. The vLLM library supports this out of the box. (𝘓𝘪𝘯𝘬 𝘪𝘯 𝘤𝘰𝘮𝘮𝘦𝘯𝘵𝘴) #MachineLearning #LLM #SystemDesign
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development