This Stanford study examined how six major AI companies (Anthropic, OpenAI, Google, Meta, Microsoft, and Amazon) handle user data from chatbot conversations. Here are the main privacy concerns. 👀 All six companies use chat data for training by default, though some allow opt-out 👀 Data retention is often indefinite, with personal information stored long-term 👀 Cross-platform data merging occurs at multi-product companies (Google, Meta, Microsoft, Amazon) 👀 Children's data is handled inconsistently, with most companies not adequately protecting minors 👀 Limited transparency in privacy policies, which are complex and hard to understand and often lack crucial details about actual practices Practical Takeaways for Acceptable Use Policy and Training for nonprofits in using generative AI: ✅ Assume anything you share will be used for training - sensitive information, uploaded files, health details, biometric data, etc. ✅ Opt out when possible - proactively disable data collection for training (Meta is the one where you cannot) ✅ Information cascades through ecosystems - your inputs can lead to inferences that affect ads, recommendations, and potentially insurance or other third parties ✅ Special concern for children's data - age verification and consent protections are inconsistent Some questions to consider in acceptable use policies and to incorporate in any training. ❓ What types of sensitive information might your nonprofit staff share with generative AI? ❓ Does your nonprofit currently specifically identify what is considered “sensitive information” (beyond PID) and should not be shared with GenerativeAI ? Is this incorporated into training? ❓ Are you working with children, people with health conditions, or others whose data could be particularly harmful if leaked or misused? ❓ What would be the consequences if sensitive information or strategic organizational data ended up being used to train AI models? How might this affect trust, compliance, or your mission? How is this communicated in training and policy? Across the board, the Stanford research points that developers’ privacy policies lack essential information about their practices. They recommend policymakers and developers address data privacy challenges posed by LLM-powered chatbots through comprehensive federal privacy regulation, affirmative opt-in for model training, and filtering personal information from chat inputs by default. “We need to promote innovation in privacy-preserving AI, so that user privacy isn’t an afterthought." How are you advocating for privacy-preserving AI? How are you educating your staff to navigate this challenge? https://lnkd.in/g3RmbEwD
Training AI Models With Limited Data
Explore top LinkedIn content from expert professionals.
-
-
🚀 SmolVLA is live! We just released a 450M parameter model to control robots with natural language. And it's fully open-source. SmolVLA delivers: ✅ Real-time inference ✅ Strong performance across diverse tasks ✅ Training and deployment recipes that fit on a single consumer GPU How? We gathered all the open @LeRobotHF robotics datasets on the Hugging Face Hub, cleaned them up, and used them to pretrain SmolVLA. This step alone improved downstream success rates by 26%. We also introduced asynchronous inference, so robots can act and react at the same time, a game-changer for fast control. But this isn’t just a model release. It’s a step towards accessible, community-driven robotics. Base models should be built on public data, reproducible code, and affordable hardware! 🛠️ Everything’s open: • Model weights • Code • Data • Demo • Blog 📖 Dive into the blog post to explore the architecture, benchmarks, and how to get started, check the comments!
-
🦾 Great milestone for open-source robotics: pi0 & pi0.5 by Physical Intelligence are now on Hugging Face, fully ported to PyTorch in LeRobot and validated side-by-side with OpenPI for everyone to experiment with, fine-tune & deploy in their robots! π₀.₅ is a Vision-Language-Action model which represents a significant evolution from π₀ to address a big challenge in robotics: open-world generalization. While robots can perform impressive tasks in controlled environments, π₀.₅ is designed to generalize to entirely new environments and situations that were never seen during training. Generalization must occur at multiple levels: - Physical Level: Understanding how to pick up a spoon (by the handle) or plate (by the edge), even with unseen objects in cluttered environments - Semantic Level: Understanding task semantics, where to put clothes and shoes (laundry hamper, not on the bed), and what tools are appropriate for cleaning spills - Environmental Level: Adapting to "messy" real-world environments like homes, grocery stores, offices, and hospitals The breakthrough innovation in π₀.₅ is co-training on heterogeneous data sources. The model learns from: - Multimodal Web Data: Image captioning, visual question answering, object detection - Verbal Instructions: Humans coaching robots through complex tasks step-by-step - Subtask Commands: High-level semantic behavior labels (e.g., "pick up the pillow" for an unmade bed) - Cross-Embodiment Robot Data: Data from various robot platforms with different capabilities - Multi-Environment Data: Static robots deployed across many different homes - Mobile Manipulation Data: ~400 hours of mobile robot demonstrations This diverse training mixture creates a "curriculum" that enables generalization across physical, visual, and semantic levels simultaneously. Huge thanks to the Physical Intelligence team & contributors Model: https://lnkd.in/eAEr7Yk6 LeRobot: https://lnkd.in/ehzQ3Mqy
-
Struggling to build a data foundation that helps you deploy AI models at scale? Regulation can help. Too often in my professional life I have heard the old adage that regulation is a blocker to innovation. In my experience, what actually impedes on innovation is uncertainty; specifically when relevant rules are missing, unclear, or poorly aligned. No doubt this was true for both the GDPR and AI Act, at least in the beginning. What is often overlooked, however, is that these laws also provide notable benefits: among others, guiding organizations how to approach data-driven innovation in a structured and sensible way. ➡️ How GDPR supports data readiness Art. 5 GDPR requires, e.g., purpose limitation, data minimization, accuracy, integrity, confidentiality, and accountability. Organizations must decide which personal data they need, why, and who is responsible. This amounts not only to a responsible but also strategic approach to handling data - and not just personal data. ➡️ How the AI Act builds on this Art. 6 AI Act links an AI system’s obligations to its intended use and impact on people’s health, safety, and fundamental rights. Art. 10 then mandates data governance requirements for high-risk AI systems, e.g., that training, validation, and test datasets are relevant, representative, complete, and documented. Providers must implement measures covering provenance, cleaning, annotation, assumptions, gap analysis, bias detection, and ongoing monitoring. These rules offer a practical blueprint for AI-ready data. ➡️ Why this matters for AI strategy A strong data foundation improves model performance, but also reveals when AI is not the right tool. A rules-based system might achieve the same outcome with less risk and less complexity. The decision when not to use AI should be part of any good AI strategy too. ➡️ What organizations should do ✅ Define the purpose of processing: What are you trying to achieve? How does this improve the status quo? What tradeoffs do you need to consider? ✅ Use Art. 5 GDPR to decide what personal data you need to achieve your processing purpose in the least intrusive way. ✅ Evaluate whether you need AI - or if a rules-based system suffices. ✅ If you do need AI, leverage the AI Act’s Art. 6 intended use test and Art. 10 data governance rules as a readiness checklist. In particular, if it looks like you would be developing or deploying a high-risk AI system, make sure you have the necessary resources to do so. ✅ Create clear roles and responsibilities along the lifecycle of data processing to continuously ensure the quality, consistency, and reliability of data. ✅ Delete data when you no longer need it. This not only saves resources, but minimizes your compliance exposure. Too often, regulation is framed as a constraint. In reality, it can help organizations plan and implement data projects in a strategic and purposeful way. #DataReadiness #AIGovernance #GDPR #AIAct #ResponsibleAI
-
If you are an organisation using AI or you are an AI developer, the Australian privacy regulator has just published some vital information about AI and your privacy obligations. Here is a summary of the new guides for businesses published today by the Office of the Australian Information Commissioner which articulate how Australian privacy law applies to AI and set out the regulator’s expectations. The first guide is aimed to help businesses comply with their privacy obligations when using commercially available AI products and help them to select an appropriate product. The second provides privacy guidance to developers using personal information to train generative AI models. GUIDE ONE: Guidance on privacy and the use of commercially available AI products Top five takeaways * Privacy obligations will apply to any personal information input into an AI system, as well as the output data generated by AI (where it contains personal information). * Businesses should update their privacy policies and notifications with clear and transparent information about their use of AI * If AI systems are used to generate or infer personal information, including images, this is a collection of personal information and must comply with APP 3 (which deals with collection of personal info). * If personal information is being input into an AI system, APP 6 requires entities to only use or disclose the information for the primary purpose for which it was collected. * As a matter of best practice, the OAIC recommends that organisations do not enter personal information, and particularly sensitive information, into publicly available generative AI tools. GUIDE 2: Guidance on privacy and developing and training generative AI models Top five takeaways * Developers must take reasonable steps to ensure accuracy in generative AI models. * Just because data is publicly available or otherwise accessible does not mean it can legally be used to train or fine-tune generative AI models or systems.. * Developers must take particular care with sensitive information, which generally requires consent to be collected. * Where developers are seeking to use personal information that they already hold for the purpose of training an AI model, and this was not a primary purpose of collection, they need to carefully consider their privacy obligations. * Where a developer cannot clearly establish that a secondary use for an AI-related purpose was within reasonable expectations and related to a primary purpose, to avoid regulatory risk they should seek consent for that use and/or offer individuals a meaningful and informed ability to opt-out of such a use. https://lnkd.in/gX_FrtS9
-
Federated learning enables enterprises to leverage private business data to improve Large Language Models (LLMs) while maintaining data privacy and security. This approach allows organizations to train AI models on sensitive information without sharing raw data outside their firewalls. Key providers like Google, NVIDIA, and FATE-LLM offer enterprise solutions for implementing federated learning. A notable healthcare use case demonstrates how multiple hospitals improved cardiovascular risk prediction accuracy by 29% through collaborative model training while keeping patient data secure. This technology is crucial for businesses seeking to enhance their AI capabilities while maintaining data sovereignty and regulatory compliance.
Why Federated Learning is the Killer App
www.linkedin.com
-
🚀𝐖𝐞 𝐚𝐫𝐞 𝐢𝐧𝐭𝐫𝐨𝐝𝐮𝐜𝐢𝐧𝐠 𝐒𝐦𝐨𝐥𝐕𝐋𝐀-𝟒𝟓𝟎𝐌, 𝐚𝐧 𝐨𝐩𝐞𝐧-𝐬𝐨𝐮𝐫𝐜𝐞 𝐕𝐢𝐬𝐢𝐨𝐧-𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞-𝐀𝐜𝐭𝐢𝐨𝐧 𝐦𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐫𝐨𝐛𝐨𝐭𝐢𝐜𝐬! SmolVLA achieves best-in-class performance and inference speed, and the best part? It’s trained entirely on open-source datasets from the 🤖 LeRobot project hosted on the Hugging Face Hub. 🔍 Why is SmolVLA so good? Turns out that pretraining on a large, diverse and noisy collection of real-world community robotics data leads to better generalization and control. We saw a 26% boost in task success rate simply from adding community dataset pretraining! ⚡ Why is SmolVLA so fast? 1. We halved the size of SmolVLM and extract intermediate representations 2. Introduced interleaved cross- and self-attention layers in the action expert 3. Enabled asynchronous inference so the robot acts and reacts simultaneously 💡 Unlike most academic datasets, these community-contributed datasets are naturally diverse: ✅ Multiple robots, camera angles, and manipulation tasks ✅ Real-world messiness and complexity ✅ Crowd-sourced and community-cleaned using Qwen2.5-VL for high-quality task descriptions 🌍 SmolVLA is a step toward making robotics research more affordable, reproducible, and collaborative. 📖 Want to dive deeper? Check out our blog post & start using it today: https://lnkd.in/e3Gmy8gT Huge thanks to the team who made this possible: @Mustafa Shukor Francesco Capuano Remi Cadene, and the entire Lerobot team, amazing HF team Andrés Marafioti Merve Noyan Aritra Roy Gosthipaty Pedro Cuenca Loubna Ben Allal, Thomas Wolf and to the amazing contributors to the LeRobot community: Ville Kuosmanen, Alexandre Chapin, Marina Barannikov, and more!
-
Robotics models are increasingly bulky and difficult to run directly on robots. With Remi Cadene and the team at LeRobot and Hugging Face, we’re changing that. Today, we're introducing SmolVLA, a sub-500M VLA designed for efficient training and inference. We present to the robotics and open-source community three main contributions: 1️⃣ SmolVLA is *small*. We use: - A small pretrained VLMs - Fewer visual tokens - A layer-skipping mechanism in the VLM - Interleaving self-/cross-attention layers in the action expert ➡️ This allows us to train on a single consumer-grade GPU, and run SmolVLA efficiently, even on CPUs 💥 2️⃣ We ditch proprietary megadatasets, using community-contributed datasets - SmolVLA is trained end-to-end on <30k episodes - All datasets are publicly available, community-contributed and available on Hugging Face Hub - Training this SmolVLA is a community effort, resulting on training on ~10x less data than SOTA VLAs ➡️ Across real-world and simulations, we match much larger models trained on 10x more data 💥 3️⃣ SmolVLA is deployed asynchronously, for greater adaptability - Robots shouldn’t lag (let that sink in) - With SmolVLA, we present an asynchronous inference stack decoupling action execution from planning. ➡️ This results in fast, smooth, and resource-efficient control in the real world (2x throughput!) With this, we hope to push open-source robotics research. We're releasing everything, from data, to training and inference recipes---the Hugging Face way 🤗 SmolVLA has been proudly brought to life by the SmolVLA team, Mustafa Shukor, Dana Aubakirova and yours truly 😊, standing on the shoulders of the entire LeRobot team, and with the guidance of Remi Cadene and Thomas Wolf. Interested? Check out our technical report (🔗 link in the first comment!) Thank you to the whole team behind this project & amazing co-authors Pepijn Kooijmans Michel Aractingi Adil Zouitine Martino Russi Caroline Pascal Andrés Marafioti Thomas Wolf
-
A health system deploys an AI coding tool. Accuracy improves measurably. The vendor asks to use operational data to refine the model for that health system's documentation patterns. The health system's counsel says no. Blanket prohibition, non-negotiable. Do you know about the HIPAA provision that creates a blanket prohibition on using Protected Health Information (PHI) for AI model training? It doesn’t exist. I’ve negotiated AI language in technology transactions from multiple vantage points over the last several years. I’ve requested “no training” language from vendors. I’ve represented healthcare organizations in vendor negotiations. And I’ve responded to this language as a health information technology company serving healthcare provider organizations, health plans, and pharmacutical companies, and more. A pattern recurs: a contractual position on “no model training with PHI” that organizations adopt reflexively but often struggle to ground in a consistent regulatory explanation. Organizational policies and upstream contractual commitments can limit how PHI is used with AI models, and those limitations may be perfectly rational. But they are business constraints, not regulatory ones. The HIPAA Privacy Rule does not prohibit using PHI for AI model training. It provides a framework of data use purposes (e.g., treatment, healthcare operations, research, proper management and administration for business associates) that help determine what permissions and safeguards apply. Also, where PHI is involved, the structural terms of the deal affect the regulatory analysis. And before the characterization analysis begins, there's a threshold question: if the training uses deidentified data under HIPAA, HIPAA's use restrictions don't apply. A "no training" clause that covers deidentified data is restricting something outside HIPAA's scope. The healthcare industry would be better served if more organizations worked through HIPAA’s regulatory framework before defaulting to a blanket prohibition. A carefully crafted prohibition may be appropriate in some cases. But it may also foreclose activities that are permissible and beneficial. I’ve seen firsthand how model accuracy improves when models learn from the operational patterns of the healthcare organizations they serve. I’d like to see sharper discourse about using PHI with AI (and separately, creating PHI with AI...). I'm working on a deeper analysis of the HIPAA characterization framework for AI model training using recent real world examples. If this is a conversation you're having internally or if you're negotiating these provisions, I'm interested to hear how your organization is approaching it.
-
World model trained on 44,000 hours of human videos! 🌍 NVIDIA Robotics, UC Berkeley, HKUST, and UT Austin just released DreamDojo, a foundation world model for robots trained on the largest video dataset to date for world model pretraining. The closer it gets to GTC, the more NVIDIA is cooking. 😮💨 The dataset: 44,000 hours of diverse human egocentric videos. That's 15x longer duration, 96x more skills, and 2,000x more scenes than the previously largest dataset for world model training. DreamDojo learns comprehensive physical knowledge from large-scale human data through pre-training with latent actions, then post-trains on specific robot embodiments with continuous robot actions. Strong generalization to diverse objects and environments after post-training. The model produces realistic action-conditioned rollouts for GR-1, G1, AgiBot, and YAM robots across wide-ranging environments and object interactions. After distillation, the model achieves long-horizon autoregressive generation with stable real-time interactions at 10 FPS for over 1 minute. The distillation pipeline enables deployment speed comparable to direct policy execution. Live teleoperation with real-time rollout generation, reliable policy evaluation without real-world deployment, and model-based planning for test-time improvement. Comparison with baselines shows DreamDojo generates more accurate physical interactions due to large-scale human data pretraining. The model learned physics and object manipulation priors from massive human video data, then transferred this knowledge to robot control. This is the same pattern as language models: pre-train on massive human-generated data (text → video), then fine-tune for specific tasks (completion → robot control). Congrats Jim Fan and team behind this project: https://lnkd.in/dV-nzDip ~~ ♻️ Join the weekly robotics newsletter, and never miss any news → ziegler.substack.com
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development