Top LinkedIn Content on Machine Learning Model Tuning

PhD in AI, author of 📖 The Hundred-Page Language Models Book and 📖 The Hundred-Page Machine Learning Book

488,346 followers 3w

An absolute must read. LLMs cost a lot to run, so a common move is to train a small model to imitate a big one — feeding the small "student" the same inputs and having it match, word by word, the probabilities the large "teacher" assigns to each possible next word, a procedure called knowledge distillation. That matching is done on a fixed collection of example sentences, but a model writing text builds each sentence out of its own earlier words, so once the student makes an early choice that none of the training examples contained, it ends up in situations it was never shown, and small mistakes feed into later ones until the text degrades. In this ICLR 2024 paper from Google, Mila, and UoT, the authors instead have the student write sentences itself and use those sentences to choose the situations it gets tested on: at each point in a student-written sentence they take the words so far, ask the teacher what the distribution over the next word should be there, and push the student toward the teacher's answer — so the teacher supplies every target while the student's own writing decides where those targets get applied, which is exactly the off-track spots its writing tends to wander into. Tested on summarization, English-to-German translation, and grade-school math problems where the model writes out its reasoning before answering, this self-generated-data approach beats standard distillation recipes across a range of student sizes, and it slots into reinforcement-learning fine-tuning cleanly because both only need samples drawn from the student rather than gradients passed back through the sampling step. Read with an AI tutor and quizzes for better retention: https://lnkd.in/efguF7mr PDF: https://lnkd.in/edfTWfgt

Zain Hasan

I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: Vector DBs, Data Scientist, Lecturer & Health Tech Founder | 🇺🇸🇨🇦🇵🇰

20,042 followers 1y

An explanation of language model distillation, how it works, why it’s useful, and examples of how you can perform distillation. What is distillation? Distillation is a model compression technique where a smaller "student" model is trained to mimic the behavior of a larger "teacher" model. This is achieved by transferring knowledge from the teacher to the student, usually through methods like logit-based or hidden states-based distillation. These methods are designed to help the student model replicate the teacher's output distribution or internal representations, often leading to a more efficient model with comparable performance. When would we use this? Distillation is commonly used when deploying large models is impractical due to resource constraints, such as in real-time applications or edge devices. For instance, a smaller student model can be distilled from a powerful teacher model like Llama3.1 405B, retaining much of the original model’s capability but with significantly lower computational demands. Distillation is also useful when adapting models to specific tasks or domains, as seen in domain-specific distillation cases like "function calling," where specialized knowledge from a teacher model is transferred to a smaller model for specific use cases. What’s the benefit? Distillation offers a significant reduction in model size and computational requirements while maintaining a high level of performance. This is especially valuable in scenarios where memory and processing power are limited. Moreover, distillation allows for flexibility in model architecture choices; for example, distilling knowledge from a Llama-3.1-70B model into a much smaller StableLM-2-1.6B model. Distillation methods like those provided in Arcee-AI's DistillKit, including logit-based and hidden states-based distillation, can lead to substantial performance gains over traditional training routines without requiring additional data. Examples of Distillation Techniques: (1) Logit-based Distillation: This method involves transferring knowledge by using both the hard targets (actual labels) and soft targets (teacher logits) to guide the student model. The student is trained to minimize the difference between its output distribution and the teacher’s output, typically using Kullback-Leibler (KL) divergence. This method is particularly effective for maintaining performance close to the teacher model while improving the student’s generalization abilities. (2) Hidden States-based Distillation: Here, the focus is on aligning the intermediate layer representations of the student with those of the teacher. This layer-wise guidance helps the student model capture similar features and improves its performance and generalization. This method also allows for cross-architecture distillation, enabling knowledge transfer between different model architectures, such as distilling from a Llama-3.1-70B model into a StableLM-2-1.6B model.

Stefano Puntoni

Wharton Professor - AI & Behavioral Science

54,363 followers 1y

One of the most important findings coming out of the new behavioral science of AI is that LLMs are acquiring superhuman powers of persuasion. Adding to a string of recent papers, a new article found that LLMs were more effective than humans in changing people’s mind about sociopolitical issues, but only when they were provided basic demographic information about the person to be persuaded. Without personalization the LLM was not significantly more persuasive than people (although the direction of the difference points to a potentially significant effect with a larger sample). Interestingly, personal information did not help people be more persuasive in this study. The results show the effectiveness of microtargeting at scale via LLMs and have broad implications for business (especially marketing) and democracy. I’ll make a note to develop a fuller segment on this topic in my “AI in Our Lives” course at The Wharton School Link to article in comment. Wharton AI & Analytics Initiative

Ranjani Mani

75,359 followers 1y

🚀 Exciting News - Introduction of Direct Preference Optimization (DPO) 🚀 aka how do you get LLMs to align closer to human preferences 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 - Here is a really succinct summary of RLHF vs DPO from Andrew NG "RLHF has been a key technique for training LLMs. In brief, RLHF (i) Gets humans to specify their preferences by ranking LLM outputs, (ii) Trains a reward model (used to score LLM outputs) -- typically represented using a transformer network -- to be consistent with the human preferences, (iii) Uses reinforcement learning to tune an LLM, also represented as a transformer, to maximize rewards. This requires two transformer networks, and RLHF is also finicky to the choice of hyperparameters. DPO simplifies the whole thing. Via clever mathematical insight, the authors show that given an LLM, there is a specific reward function for which that LLM is optimal. DPO then trains the LLM directly to make the reward function (that’s now implicitly defined by the LLM) consistent with the human rankings. So, you no longer need to deal with a separately represented reward function -- you just need the LLM transformer -- and you can train the LLM directly and more efficiently to optimize the same objective as RLHF." 'DPO is particularly beneficial in scenarios where there is no clear-cut correct answer, and subjective elements like tone, style, or specific content preferences are important. This approach allows the model to learn from both positive examples (what's considered correct or ideal) and negative examples (what's less desired or incorrect)' Azure OpenAI Service has launched the public preview of Direct Preference Optimization (DPO) 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆 𝗮𝗻𝗱 𝗦𝗶𝗺𝗽𝗹𝗶𝗰𝗶𝘁𝘆: Unlike traditional methods like Reinforcement Learning from Human Feedback (RLHF), DPO does not require a separate reward model, making it computationally lighter and faster while maintaining effectiveness. 𝗦𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗮𝗻𝗱 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆: DPO directly optimizes the policy based on human preferences, avoiding the instability often associated with training multiple models, leading to more consistent and reliable outcomes. 𝗕𝗶𝗮𝘀 𝗠𝗶𝘁𝗶𝗴𝗮𝘁𝗶𝗼𝗻: By incorporating human preferences directly into the optimization process, DPO helps reduce unintended biases in the model's behaviour, ensuring more desirable and ethical outputs. 𝗩𝗲𝗿𝘀𝗮𝘁𝗶𝗹𝗲 𝗔𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀: DPO is particularly beneficial in scenarios where subjective elements like tone, style, or specific content preferences are important, making it ideal for tasks such as customer service, recommendation systems, and creative content generation. Check out the link in comments to explore DPO on Azure OpenAI ******************************************************** Ranjani Mani #reviewswithranjani #Technology | #Books | #Beingbetter

Anthony Leiserowitz

Professor at the Yale School of the Environment

223,344 followers 1y

Large language models (LLMs) such as ChatGPT are becoming more popular among researchers, for tasks ranging from idea generation to writing code. Can they accurately estimate public opinion about global warming? Our new study finds: 1. Large language models (LLMs) such as ChatGPT can estimate public opinion about global warming, with ChatGPT-4 outperforming ChatGPT-3.5. 2. Prompting LLMs with information about public engagement with global warming (e.g., issue involvement) yields better estimates than using demographics alone. 3. LLMs underestimate the percentage of Black Americans who believe global warming is happening. Learn more: https://ow.ly/8jre50STp5I

Victoria Slocum

Machine Learning Engineer @ Weaviate

48,102 followers 6mo

Google just proved that bigger isn't always better. Their 308M parameter model is outperforming models 2x its size. Google just released 𝗘𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴𝗚𝗲𝗺𝗺𝗮, and it's proving that lightweight embedding models can punch way above their weight class. At just 308M parameters (578MB), it's the new state-of-the-art for models under 500M parameters across MTEB multilingual, English, and code benchmarks. But the really impressive part is that it ranks 8th overall on MTEB(Multilingual, v2) - that's 𝟭𝟳 𝗽𝗹𝗮𝗰𝗲𝘀 above the second-best sub-500M model, and it's delivering performance 𝗰𝗼𝗺𝗽𝗮𝗿𝗮𝗯𝗹𝗲 𝘁𝗼 𝗺𝗼𝗱𝗲𝗹𝘀 𝗻𝗲𝗮𝗿𝗹𝘆 𝗱𝗼𝘂𝗯𝗹𝗲 𝗶𝘁𝘀 𝘀𝗶𝘇𝗲. There are three key parts of their training recipe that sets it apart: 𝟭. 𝗘𝗻𝗰𝗼𝗱𝗲𝗿-𝗗𝗲𝗰𝗼𝗱𝗲𝗿 𝗜𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 Instead of starting from a decoder-only Gemma 3 model, they first adapted it to encoder-decoder, then used just the encoder. By basing EmbeddingGemma off an LLM that already has world and language understanding, it gives it a stronger starting point. 𝟮. 𝗧𝗵𝗿𝗲𝗲-𝗟𝗼𝘀𝘀 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 They combine three different loss functions, instead of just having one: • Contrastive loss (NCE) with in-batch negatives and hardness weighting • Spread-out regularization to ensure embeddings utilize the full space (for quantization and ANN retrieval) • Embedding matching distillation from Gemini Embedding - not just learning from relevance scores, but directly aligning the embedding space with the teacher model 𝟯. 𝗠𝗼𝗱𝗲𝗹 𝗦𝗼𝘂𝗽𝗶𝗻𝗴 Rather than just averaging checkpoints from the same training run, they use optimization techniques to find multiple specialized training mixtures. Each mixture creates an "expert" model in different domains, and averaging all their parameters creates a final model that's actually better than individual models. Extras: • Matryoshka embeddings supporting 768, 512, 256, and 128 dimensions • Quantization-aware training - maintains quality even at int4 precision • 100+ languages from Gemma 3 pretraining • Exceptional performance on low-resource languages (check their XTREME-UP results) Is it the absolute best embedding model? No - Gemini Embedding still leads overall. But that's not really the point. EmbeddingGemma proves you can achieve state-of-the-art performance in a small package that's actually deployable on-device, in low-latency applications, and in resource-constrained environments. This makes good embeddings accessible for use cases that I'm seeing more and more: offline applications, privacy-sensitive deployments, and high-throughput scenarios where inference cost actually matters. Full paper: https://lnkd.in/eCiu-NDc Shoutout to the EmbeddingGemma team at Google DeepMind for this awesome open source work 💙 and to Danny Williams for helping me with this video! 🫶

Philipp Schmid

Agents & Gemini API, MTS at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

165,695 followers 1y

How can we create smaller, more efficient LLMs from larger ones? 🤔 NVIDIA combined structured weight pruning with knowledge distillation to reduce Meta Llama 3.1 8B to a 4B model. 🦙 Pruning reduces the model's depth and width based. TL;DR: 🔪 Pruning: Removed 16 layers (50%) and reduced embedding and MLP dimensions 🧠 Distillation: Used classical knowledge distillation with the original 8B model as a teacher 🔬 Tuned on 94B tokens with specific learning rate and batch size parameters 📊 Llama-3.1-Minitron 4B (width-pruned) achieved 92.7% of Llama 3.1 8B's MMLU score 🏆 Outperformed other similarly-sized models trained from scratch 🏎️ ~1.8x throughput increase compared to Llama 3.1 8B Models: https://lnkd.in/eibcSCMK Paper: https://lnkd.in/eWYG45kz

Valerio Capraro

Associate Professor at the University of Milan Bicocca

13,839 followers 5mo

Major preprint just out! We compare how humans and LLMs form judgments across seven epistemological stages. We highlight seven fault lines, points at which humans and LLMs fundamentally diverge: The Grounding fault: Humans anchor judgment in perceptual, embodied, and social experience, whereas LLMs begin from text alone, reconstructing meaning indirectly from symbols. The Parsing fault: Humans parse situations through integrated perceptual and conceptual processes; LLMs perform mechanical tokenization that yields a structurally convenient but semantically thin representation. The Experience fault: Humans rely on episodic memory, intuitive physics and psychology, and learned concepts; LLMs rely solely on statistical associations encoded in embeddings. The Motivation fault: Human judgment is guided by emotions, goals, values, and evolutionarily shaped motivations; LLMs have no intrinsic preferences, aims, or affective significance. The Causality fault: Humans reason using causal models, counterfactuals, and principled evaluation; LLMs integrate textual context without constructing causal explanations, depending instead on surface correlations. The Metacognitive fault: Humans monitor uncertainty, detect errors, and can suspend judgment; LLMs lack metacognition and must always produce an output, making hallucinations structurally unavoidable. The Value fault: Human judgments reflect identity, morality, and real-world stakes; LLM "judgments" are probabilistic next-token predictions without intrinsic valuation or accountability. Despite these fault lines, humans systematically over-believe LLM outputs, because fluent and confident language produce a credibility bias. We argue that this creates a structural condition, Epistemia: linguistic plausibility substitutes for epistemic evaluation, producing the feeling of knowing without actually knowing. To address Epistemia, we propose three complementary strategies: epistemic evaluation, epistemic governance, and epistemic literacy. Full paper in the first comment. Joint with Walter Quattrociocchi and Matjaz Perc.

Tanner Norton

2,115 followers 1y

How do you make a small model perform like a large one? Let's talk about the idea of knowledge distillation. The animation below shows a small "student" model learning to mimic a larger, more accurate "teacher" model. Left: Teacher’s decision boundary Middle: Student learning over time Right: Error between student and teacher Instead of learning from just the correct label, the student learns from the teacher’s soft probabilities—confidence values that tell it how sure the teacher is. These soft targets provide richer signal than hard 0/1 labels. This allows a smaller model to: Learn faster Generalize better Preserve most of the teacher’s behavior Distillation is especially valuable when deploying models to edge devices, real-time systems, or anywhere size and speed matter. If you're building production ML systems, this is a technique worth knowing. Let me know if you'd like the Python code for this demo.

Ravid Shwartz Ziv

AI Researcher| Meta | NYU | Consultant | LLMs - Memory, World Models, Compression, & Tabular Data

19,661 followers 1y

You know all those arguments that LLMs think like humans? Turns out it's not true 😱 In our new paper we put this to the test by checking if LLMs form concepts the same way humans do. Do LLMs truly grasp concepts and meaning analogously to humans, or is their success primarily rooted in sophisticated statistical pattern matching over vast datasets? We used classic cognitive experiments as benchmarks. What we found is surprising... 🧐 We used seminal datasets from cognitive psychology that mapped how humans actually categorize things like "birds" or "furniture" ('robin' as a typical bird). The nice thing about these datasets is that they are not crowdsourced, they're rigorous scientific benchmarks. We tested 30+ LLMs (BERT, Llama, Gemma, Qwen, etc.) using an information-theoretic framework that measures the trade-off between: - Compression (how efficiently you organize info) - Meaning preservation (how much semantic detail you keep) Finding #1: The Good News LLMs DO form broad conceptual categories that align with humans significantly above chance. Surprisingly (or not?), smaller encoder models like BERT outperformed much larger models. Scale isn't everything! Finding #2: But LLMs struggle with fine-grained semantic distinctions. They can't capture "typicality" - like knowing a robin is a more typical bird than a penguin. Their internal concept structure doesn't match human intuitions about category membership. Finding #3: The Big Difference Here's the kicker: LLMs and humans optimize for completely different things. - LLMs: Aggressive statistical compression (minimize redundancy) - Humans: Adaptive richness (preserve flexibility and context) This explains why LLMs can be simultaneously impressive AND miss obvious human-like reasoning. They're not broken - they're just optimized for pattern matching rather than the rich, contextual understanding humans use. What this means: - Current scaling might not lead to human-like understanding - We need architectures that balance compression with semantic richness - The path to AGI ( 😅 ) might require rethinking optimization objectives Our paper gives tools to measure this compression-meaning trade-off. This could guide future AI development toward more human-aligned conceptual representations. Cool to see cognitive psychology and AI research coming together! Thanks to Chen Shani, Ph.D., who did all the work and Yann LeCun and Dan Jurafsky for their guidance

Machine Learning Model Tuning

More in Machine Learning Model Tuning

More Artificial Intelligence topics

Explore categories