Data Quality for AI

Explore top LinkedIn content from expert professionals.

  • View profile for Jim Fan
    Jim Fan Jim Fan is an Influencer

    NVIDIA Director of AI & Distinguished Scientist. Co-Lead of Project GR00T (Humanoid Robotics) & GEAR Lab. Stanford Ph.D. OpenAI's first intern. Solving Physical AGI, one motor at a time.

    242,506 followers

    Exciting updates on Project GR00T! We discover a systematic way to scale up robot data, tackling the most painful pain point in robotics. The idea is simple: human collects demonstration on a real robot, and we multiply that data 1000x or more in simulation. Let’s break it down: 1. We use Apple Vision Pro (yes!!) to give the human operator first person control of the humanoid. Vision Pro parses human hand pose and retargets the motion to the robot hand, all in real time. From the human’s point of view, they are immersed in another body like the Avatar. Teleoperation is slow and time-consuming, but we can afford to collect a small amount of data.  2. We use RoboCasa, a generative simulation framework, to multiply the demonstration data by varying the visual appearance and layout of the environment. In Jensen’s keynote video below, the humanoid is now placing the cup in hundreds of kitchens with a huge diversity of textures, furniture, and object placement. We only have 1 physical kitchen at the GEAR Lab in NVIDIA HQ, but we can conjure up infinite ones in simulation. 3. Finally, we apply MimicGen, a technique to multiply the above data even more by varying the *motion* of the robot. MimicGen generates vast number of new action trajectories based on the original human data, and filters out failed ones (e.g. those that drop the cup) to form a much larger dataset. To sum up, given 1 human trajectory with Vision Pro  -> RoboCasa produces N (varying visuals)  -> MimicGen further augments to NxM (varying motions). This is the way to trade compute for expensive human data by GPU-accelerated simulation. A while ago, I mentioned that teleoperation is fundamentally not scalable, because we are always limited by 24 hrs/robot/day in the world of atoms. Our new GR00T synthetic data pipeline breaks this barrier in the world of bits. Scaling has been so much fun for LLMs, and it's finally our turn to have fun in robotics! We are creating tools to enable everyone in the ecosystem to scale up with us: - RoboCasa: our generative simulation framework (Yuke Zhu). It's fully open-source! Here you go: http://robocasa.ai - MimicGen: our generative action framework (Ajay Mandlekar). The code is open-source for robot arms, but we will have another version for humanoid and 5-finger hands: https://lnkd.in/gsRArQXy - We are building a state-of-the-art Apple Vision Pro -> humanoid robot "Avatar" stack. Xiaolong Wang group’s open-source libraries laid the foundation: https://lnkd.in/gUYye7yt - Watch Jensen's keynote yesterday. He cannot hide his excitement about Project GR00T and robot foundation models! https://lnkd.in/g3hZteCG Finally, GEAR lab is hiring! We want the best roboticists in the world to join us on this moon-landing mission to solve physical AGI: https://lnkd.in/gTancpNK

  • View profile for Amanda Bickerstaff
    Amanda Bickerstaff Amanda Bickerstaff is an Influencer

    Educator | AI for Education Founder | Keynote | Researcher | LinkedIn Top Voice in Education

    92,752 followers

    As GenAI becomes more ubiquitous, research alarmingly shows that women are using these tools at lower rates than men across nearly all regions, sectors, and occupations.   A recent paper from researchers at Harvard Business School, Berkeley, and Stanford synthesizes data from 18 studies covering more than 140k individuals worldwide.   Their findings:   • Women are approximately 22% less likely than men to use GenAI tools • Even when controlling for occupation, age, field of study, and location, the gender gap remains • Web traffic analysis shows women represent only 42% of ChatGPT users and 31% of Claude users   Factors Contributing the to Gap:   - Lack of AI Literacy: Multiple studies showed women reporting significantly lower familiarity with and knowledge about generative AI tools as the largest gender gap driver. - Lack of Training & Confidence: Women have lower confidence in their ability to effectively use AI tools and more likely to report needing training before they can benefit from generative AI.   - Ethical Concerns & Fears of Judgement: Women are more likely to perceive AI usage as unethical or equivalent to cheating, particularly in educational or assignment contexts. They’re also more concerned about being judged unfairly for using these tools.   The Potential Impacts: - Widening Pay & Opportunity Gap: Considerably lower AI adoption by women creates further risk of them falling behind their male counterparts, ultimately widening the gender gap in pay and job opportunities. - Self-Reinforcing Bias: AI systems trained primarily on male-generated data may evolve to serve women's needs poorly, creating a feedback loop that widens existing gender disparities in technology development and adoption.   As educators and AI literacy advocates, we face an urgent responsibility to close this gap and simply improving access is not enough. We need targeted AI literacy training programs, organizations committed to developing more ethical GenAI, and safe and supportive communities like our Women in AI + Education to help bridge this expanding digital divide.   Link to the full study in the comments. And a link also to learn more or join our Women in AI + Education Community. AI for Education #Equity #GenAI #Ailiteracy #womeninAI

  • View profile for Sol Rashidi, MBA
    Sol Rashidi, MBA Sol Rashidi, MBA is an Influencer
    117,586 followers

    AI is not failing because of bad ideas; it’s "failing" at enterprise scale because of two big gaps: 👉 Workforce Preparation 👉 Data Security for AI While I speak globally on both topics in depth, today I want to educate us on what it takes to secure data for AI—because 70–82% of AI projects pause or get cancelled at POC/MVP stage (source: #Gartner, #MIT). Why? One of the biggest reasons is a lack of readiness at the data layer. So let’s make it simple - there are 7 phases to securing data for AI—and each phase has direct business risk if ignored. 🔹 Phase 1: Data Sourcing Security - Validating the origin, ownership, and licensing rights of all ingested data. Why It Matters: You can’t build scalable AI with data you don’t own or can’t trace. 🔹 Phase 2: Data Infrastructure Security - Ensuring data warehouses, lakes, and pipelines that support your AI models are hardened and access-controlled. Why It Matters: Unsecured data environments are easy targets for bad actors making you exposed to data breaches, IP theft, and model poisoning. 🔹 Phase 3: Data In-Transit Security - Protecting data as it moves across internal or external systems, especially between cloud, APIs, and vendors. Why It Matters: Intercepted training data = compromised models. Think of it as shipping cash across town in an armored truck—or on a bicycle—your choice. 🔹 Phase 4: API Security for Foundational Models - Safeguarding the APIs you use to connect with LLMs and third-party GenAI platforms (OpenAI, Anthropic, etc.). Why It Matters: Unmonitored API calls can leak sensitive data into public models or expose internal IP. This isn’t just tech debt. It’s reputational and regulatory risk. 🔹 Phase 5: Foundational Model Protection - Defending your proprietary models and fine-tunes from external inference, theft, or malicious querying. Why It Matters: Prompt injection attacks are real. And your enterprise-trained model? It’s a business asset. You lock your office at night—do the same with your models. 🔹 Phase 6: Incident Response for AI Data Breaches - Having predefined protocols for breaches, hallucinations, or AI-generated harm—who’s notified, who investigates, how damage is mitigated. Why It Matters: AI-related incidents are happening. Legal needs response plans. Cyber needs escalation tiers. 🔹 Phase 7: CI/CD for Models (with Security Hooks) - Continuous integration and delivery pipelines for models, embedded with testing, governance, and version-control protocols. Why It Matter: Shipping models like software means risk comes faster—and so must detection. Governance must be baked into every deployment sprint. Want your AI strategy to succeed past MVP? Focus and lock down the data. #AI #DataSecurity #AILeadership #Cybersecurity #FutureOfWork #ResponsibleAI #SolRashidi #Data #Leadership

  • View profile for Prukalpa ⚡
    Prukalpa ⚡ Prukalpa ⚡ is an Influencer

    Founder & Co-CEO at Atlan, The Context Layer for AI

    55,501 followers

    𝟯𝟴% 𝗯𝗲𝘁𝘁𝗲𝗿 𝗔𝗜 𝗮𝗰𝗰𝘂𝗿𝗮𝗰𝘆. 𝗡𝗼 𝗻𝗲𝘄 𝗺𝗼𝗱𝗲𝗹. 𝗡𝗼 𝗻𝗲𝘄 𝗱𝗮𝘁𝗮. 𝗝𝘂𝘀𝘁 𝗯𝗲𝘁𝘁𝗲𝗿 𝗰𝗼𝗻𝘁𝗲𝘅𝘁. That’s the headline from a controlled NL-to-SQL experiment discussed by Manoj Shanmugasundaram in Metadata Weekly. Across 522 query evaluations, the only variable that changed was context quality — and it made all the difference. Concise, high-signal context (business definitions, SQL patterns, domain rules) drove a 38% accuracy gain. Verbose, catalog-style documentation? Performance dropped and costs rose. More words diluted the signal. The biggest lift wasn't on simple or extreme queries. It was on medium-complexity ones — the joins and aggregations that make up everyday analytics work — where focused context delivered a 𝟮.𝟭𝟱𝘅 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁. The mindset shift: metadata was built for humans to browse. Now it also needs to work for machines to reason. Most teams are still optimizing for readability, not machine usability. If your "talk to data" initiative is stalling, it might not be a model problem. It might be a context problem. Manoj breaks down what machine-usable context actually looks like — and how to get started without rebuilding your stack. Read it in Metadata Weekly 👇

  • View profile for Pooja Jain

    Open to collaboration | Storyteller | Lead Data Engineer@Wavicle| Linkedin Top Voice 2025,2024 | Linkedin Learning Instructor | 2xGCP & AWS Certified | LICAP’2022

    195,814 followers

    You wouldn't cook a meal with rotten ingredients, right? Yet, businesses pump messy data into AI models daily— ..and wonder why their insights taste off. Without quality, even the most advanced systems churn unreliable insights. Let’s talk simple — how do we make sure our “ingredients” stay fresh? Start Smart → Know what matters: Identify your critical data (customer IDs, revenue, transactions) → Pick your battles: Monitor high-impact tables first, not everything at once Build the Guardrails: → Set clear rules: Is data arriving on time? Is anything missing? Are formats consistent? → Automate checks: Embed validations in your pipelines (Airflow, Prefect) to catch issues before they spread → Test in slices: Check daily or weekly chunks first—spot problems early, fix them fast Stay Alert (But Not Overwhelmed): → Tune your alarms: Too many false alerts = team burnout. Adjust thresholds to match real patterns → Build dashboards: Visual KPIs help everyone see what's healthy and what's breaking Fix It Right: → Dig into logs when things break—schema changes? Missing files? → Refresh everything downstream: Fix the source, then update dependent dashboards and reports → Validate your fix: Rerun checks, confirm KPIs improve before moving on Now, in the era of AI, data quality deserves even sharper focus. Models amplify what data feeds them — they can’t fix your bad ingredients. → Garbage in = hallucinations out. LLMs amplify bad data exponentially → Bias detection starts with clean, representative datasets → Automate quality checks using AI itself—anomaly detection, schema drift monitoring → Version your data like code: Track lineage, changes, and rollback when needed Here's the amazing step-by-step guide curated by DQOps - Piotr Czarnas to deep dive in the fundamentals of Data Quality. Clean data isn’t a process — it’s a discipline. 💬 What's your biggest data quality challenge right now?

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    635,204 followers

    One of the hardest parts of fine-tuning models? Getting high-quality data without breaching compliance. This Synthetic Data Generator Pipeline ia built to solve exactly that, and it is open-sources for you to use! You can now generate task-specific, high-quality synthetic datasets without using a single piece of real data, and still fine-tune performant models. Here’s what makes it different: → LLM-driven config generation Start with a simple prompt describing your task. The pipeline auto-generates YAMLs with structured I/O schemas, filters for diversity, and LLM-based evaluation criteria. → Streaming synthetic data generation The system emits JSON-formatted examples, prompt, response, metadata at scale. Each example includes row-level quality scores. You get transparency at both data and job level. → SFT + RFT with evaluator feedback We use models like DeepSeek R1 as judges. Low-quality clusters are automatically identified and regenerated. Each iteration teaches the model what “good” looks like. → Closed-loop optimization The pipeline fine-tunes itself, adjusting decoding params, enriching prompt structures, or expanding label schemas based on what’s missing. → Zero reliance on sensitive data No PII. No customer data. This is purpose-built for enterprise, healthcare, finance, and anyone who’s building responsibly. And it works: 📊 On an internal benchmark: - SFT with real, curated data: 79% accuracy - RFT with synthetic-only data: 73% accuracy That’s huge, especially when your hands are tied on data access. If you’re building copilots, vertical agents, or domain-specific models and want to skip the data wrangling phase, this is for you. Built by Fireworks AI 🔗 Try it out: https://lnkd.in/dXXDdyuM

  • View profile for Jayashankar Attupurathu

    Turning AI ambition into outcomes | CTO/CTPO | Credit Suisse · HSBC · Citicorp | Building in India

    7,999 followers

    Your cleanest data might not be your most useful data for AI. We've spent decades building clean, governed, audited data estates. Structured tables. Standardised labels. Perfectly reconciled records. It works well for reporting. But AI systems don’t just learn from clean data. They learn from 𝐜𝐨𝐧𝐭𝐞𝐱𝐭-𝐝𝐫𝐢𝐯𝐞𝐧 𝐝𝐚𝐭𝐚. Sensor readings that freeze. Logs with inconsistencies. Categories that evolve over time. This is the data most systems try to eliminate. It’s also the data that often makes models robust. Because “good data” in AI isn’t about cleanliness. It’s about 𝐟𝐢𝐭 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐩𝐫𝐨𝐛𝐥𝐞𝐦 𝐛𝐞𝐢𝐧𝐠 𝐬𝐨𝐥𝐯𝐞𝐝. Most enterprise data systems are optimized for: → Accuracy → Consistency → Auditability But AI systems depend on: → Variation → Edge cases → Imperfect signals That mismatch is where performance quietly lags behind. Data preparation becomes the hidden bottleneck. It doesn’t ship features. It doesn’t get board visibility. But when it fails, outputs look confident and wrong. 𝐓𝐡𝐞 𝐬𝐡𝐢𝐟𝐭 𝐢𝐬 𝐬𝐢𝐦𝐩𝐥𝐞. 𝐓𝐡𝐞 𝐞𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧 𝐢𝐬𝐧’𝐭. Adopt these 3 moves to optimize your execution: → Redefine “good data” as use-case fit, not just cleanliness → Move teams beyond ETL into AI-specific validation → Make data preparation visible in planning and budgets The next AI advantage won’t come from better models. It will come from how well your data reflects reality, not 𝐡𝐨𝐰 𝐜𝐥𝐞𝐚𝐧 𝐢𝐭 𝐥𝐨𝐨𝐤𝐬 𝐨𝐧 𝐩𝐚𝐩𝐞𝐫. #ArtificialIntelligence #MachineLearning #DataScience #AIEngineering #TechLeadership

  • View profile for Greg Coquillo
    Greg Coquillo Greg Coquillo is an Influencer

    AI Infrastructure Product Leader | Scaling GPU Clusters for Frontier Models | Microsoft Azure AI & HPC | Former AWS, Amazon | Startup Investor | Linkedin Top Voice | I build the infrastructure that allows AI to scale

    231,594 followers

    Serious question: Which of these 12 foundations is missing in your current AI architecture? Very few talk about what actually makes AI Agents work in production. It’s not prompts. It’s not models. It’s data foundations. Agentic AI systems don’t run on magic. They run on ingestion pipelines, governed datasets, vector retrieval, streaming events, and reliable storage layers. Without strong data infrastructure, agents hallucinate, break workflows, and make unsafe decisions. This guide breaks down the 12 data foundations every production-grade agentic system needs: 1. Data Ingestion – Brings data from apps, APIs, and files into unified raw storage. 2. ETL / ELT Pipelines – Cleans, validates, and transforms raw inputs into analytics-ready datasets. 3. Feature Stores – Centralize reusable features for consistent training and real-time inference. 4. Vector Pipelines – Power RAG by chunking documents, generating embeddings, and enabling semantic retrieval. 5. Metadata Management – Captures schemas, ownership, and tags so agents understand available data. 6. Data Governance – Enforces policies, access controls, audits, and compliance across all data assets. 7. Data Quality Checks – Detect anomalies early and prevent bad data from silently breaking agents. 8. Data Lineage – Tracks data from source to consumption for traceability and impact analysis. 9. Data Warehouses & Lakes – Provide centralized analytical storage queried by humans, models, and agents. 10. Streaming Data – Enables real-time ingestion so agents can react instantly to events. 11. Data Labeling – Converts raw samples into training-ready datasets through human and AI feedback. 12. Data Versioning – Makes experiments reproducible and production rollbacks possible. Together, these form the operating backbone of Agentic AI. Models reason. Agents act. But data determines whether they succeed in the real world. If your agent stack lacks even a few of these layers, you don’t have Agentic AI yet - you have demos.

  • View profile for Clare Kitching

    Transform your AI & data ambition into action | xQuantumBlack, xMcKinsey | Global top 100 Innovators in Data & Analytics | AI & data strategy, governance and capability building

    75,687 followers

    Data isn't the hard part. Understanding each other is. Ontology. Lineage. Semantic layers. Vector databases. I've been in data for over 15 years, and sometimes even I feel like I'm decoding a foreign language. We've turned simple ideas into jargon that makes non-data people tune out. Here's what these terms actually mean and why they matter for AI: ▶️ Ontology A shared definition of your core business concepts and how they relate. It gives AI clear concepts to reason about instead of guessing. ▶️ Entity A real world thing like a customer, product or event. It helps AI tell the difference between people, products and moments in time. ▶️ Metadata Data that explains other data. It tells AI what something means, how fresh it is and whether it can be trusted. ▶️ Physical layer Where data is stored and processed. It shapes how fast, scalable and reliable AI workloads can be. ▶️ Logical layer How data is organised conceptually, not physically. It shields AI from raw technical mess. ▶️ Semantic layer A business friendly layer with agreed definitions and metrics. It stops humans and AI arguing over what a number actually means. ▶️ Schema The formal structure of what data exists and what type it is. It gives consistency so AI knows what to expect. ▶️ Data modelling How entities and their relationships are designed. It reduces confusion in how AI interprets data. ▶️ Data virtualisation Accessing data from many sources without copying it all. It lets AI work across systems seamlessly. ▶️ Vector database A database that searches by similarity, not exact matches. It enables richer retrieval and context for AI. ▶️ Data pipeline How data flows from creation to consumption. It keeps AI fed with timely and relevant inputs. ▶️ Orchestration Coordinating when and how pipelines run. It keeps jobs reliable and in the right order. ▶️ Data quality How accurate, complete and consistent the data is. It directly affects confidence in AI outputs. ▶️ Observability Seeing what data systems are doing and spotting issues early. It helps catch drift and weird behaviour before damage is done. ▶️ Data lineage Where data comes from, how it changes and where it’s used. It adds transparency and explainability to AI decisions. None of this is magic. But together, it’s the foundation AI stands on. What other terms would you add as essential? ♻️ Repost to help someone get their idea into action. 🔔 Follow Clare Kitching for insights on unlocking value with data & AI. 💎 Get more from me with my free newsletter here: https://lnkd.in/giQ3b6Fi

  • View profile for Andreas Horn

    Head of AIOps @ IBM || Speaker | Lecturer | Advisor

    245,293 followers

    𝗘𝘃𝗲𝗿𝘆𝗼𝗻𝗲 𝘄𝗮𝗻𝘁𝘀 “𝗔𝗜 𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻𝘀.” 𝗔𝗹𝗺𝗼𝘀𝘁 𝗻𝗼𝗯𝗼𝗱𝘆 𝘄𝗮𝗻𝘁𝘀 “𝗱𝗮𝘁𝗮 𝗳𝘂𝗻𝗱𝗮𝗺𝗲𝗻𝘁𝗮𝗹𝘀.” That’s the real reason most GenAI projects stall after the demo. Because the bottleneck was never the model. The bottleneck is the data layer you built (or didn’t) over the last 5-10 years. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗵𝗮𝗽𝗽𝗲𝗻𝗶𝗻𝗴 𝗶𝗻 𝗽𝗿𝗮𝗰𝘁𝗶𝗰𝗲: 1. Teams start with use cases and prompts 2. Then they realize they can’t find the right data 3. Or the data is inconsistent across systems 4. Or access takes weeks because governance is missing 5. Or nobody trusts outputs because lineage and quality are unclear So the “AI initiative” becomes a political negotiation around data. Your AI capability will not exceed your data capability. If you want AI that ships, scales, and survives audits, treat this as the real roadmap: → Define data products (owners, SLAs, consumers) → Fix identity, permissions, and access paths (fast, controlled) → Instrument quality (freshness, completeness, consistency) → Build lineage you can show to Risk and Compliance → Close the loop: capture feedback from users back into data The pattern is consistent: real GenAI ships only where data is industrialized. Yes, it’s less sexy than a new agent demo. But it’s the difference between a prototype and a platform. ↓ 𝗜𝗳 𝘆𝗼𝘂 𝘄𝗮𝗻𝘁 𝘁𝗼 𝘀𝘁𝗮𝘆 𝗮𝗵𝗲𝗮𝗱 𝗮𝘀 𝗔𝗜 𝗿𝗲𝘀𝗵𝗮𝗽𝗲𝘀 𝘄𝗼𝗿𝗸 𝗮𝗻𝗱 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀, 𝘆𝗼𝘂 𝘄𝗶𝗹𝗹 𝗴𝗲𝘁 𝗮 𝗹𝗼𝘁 𝗼𝗳 𝘃𝗮𝗹𝘂𝗲 𝗳𝗿𝗼𝗺 𝗺𝘆 𝗳𝗿𝗲𝗲 𝗻𝗲𝘄𝘀𝗹𝗲𝘁𝘁𝗲𝗿: https://lnkd.in/dbf74Y9E

Explore categories