Important Notice: The content in this organization's repositories has been generated with AI assistance and is currently undergoing human review and verification. While we strive for accuracy, the content may contain errors, inaccuracies, or outdated information.
Status: 🔄 Human review and runtime validation in progress
Please use this content as a learning resource with appropriate caution. We recommend:
- Cross-referencing with official documentation
- Testing all code examples in a safe environment
- Reporting any errors or inaccuracies via GitHub issues or discussions
We appreciate your understanding as we continue improving content quality and accuracy.
A comprehensive, hands-on learning path for AI Infrastructure Engineers at all levels - from entry-level to principal roles.
This curriculum provides production-focused training for AI Infrastructure Engineers, covering everything from foundational Python and Kubernetes to distributed training, LLM infrastructure, MLOps, platform engineering, security, and enterprise architecture.
The organization is now in a different phase than it was in late 2025: the repo surface is largely in place, and the main work is now depth, validation, and human review rather than simply standing up missing repositories.
Total Content:
- 🏢 27 Organization Repositories (24 curriculum + 3 support)
- 📚 12 Learning Tracks (Junior → Principal levels)
- ✅ 12 Solutions Repositories
- 🎓 525+ Hands-On Exercises
- 🚀 45+ Real-World Projects
- ⏱️ 6,000+ Hours of learning material
Recent Org Updates (May 2026):
- 🚀 Engineer Solutions Buildout Complete (May 26, 2026) - All 23 exercise solutions in
ai-infra-engineer-solutions(covering modules mod-102 through mod-110: cloud computing, containerization, Kubernetes, data pipelines, MLOps, GPU computing, monitoring, IaC, LLM infrastructure) shipped with real, runnable implementations. ~20,000 LOC across 114 source files, 500+ passing tests, every exercise has a working CLI demo. No more# TODO: Implementstubs in the engineer-solutions repo. - 🧹 Placeholder Cleanup Pass (May 26, 2026) - 4 stale
PLACEHOLDER - Content Coming SoonREADME banners replaced with accurate content (principal-architect-solutions, principal-engineer-solutions, team-lead-solutions, security-solutions — all repos have real content but the original scaffolding banners were never removed). Stripped "Coming Soon" Slack/Twitter/Newsletter/Website links across engineer + mlops + ml-platform learning repos. Removed 5 stale "lecture materials in development" notices from junior-engineer modules whose content has been live for weeks. Completed the mod-103 GPU-docker lecture exercise. 11 of 25 repos now show 0 placeholder markers (up from 4). - 🎯 Capstone Projects Completed (May 26, 2026) - 11 new project specs added: 5 ML Platform capstones (platform-core, feature-store, workflow-orchestration, model-registry, developer-portal — 355 hours total) + 5 Senior Architect strategic capstones (project-402 through project-406 — 285 hours total). Both tracks now structurally complete at the project layer.
- 🛡️ Security Track Completed (May 26, 2026) - All 12 security-learning modules now published (foundations → zero-trust → cryptography → network → secrets → adversarial ML → compliance → runtime → policy → supply chain → SecOps → capstone), ~335 hours, ~105K words of curriculum
- 📚 SOLUTION.md Sweep - 15 new design-rationale docs across all 12 solutions repos (project-level
SOLUTION.mdfiles plus track-levelSOLUTION_OVERVIEW.mdfor module-only repos) - 🔄 Full Curriculum Refresh (May 23, 2026) - 21 repos updated, 40+ commits, 350+ new exercises, and 50+ new modules
- 🗺️ Curriculum Cross-Reference - role and skill progression mapping across tracks
- 📈 Career Progression Guide - career ladder from junior through principal architect
- 📝 Engineer Answer Keys - 248 quiz questions organized across the 10 engineer modules
- 🧱 Advanced and Leadership Tracks Published - specialization, senior, architect, and leadership tiers now have live curriculum structure
Entry Level (0-2 years)
↓
Junior Engineer → Engineer
↓
Intermediate (2-4 years)
↓
┌─────────────────────┬──────────────────────┬─────────────────────────┐
│ │ │ │
MLOps Engineer ML Platform Engineer Performance Engineer Security Engineer
│ │ │ │
└─────────────────────┴──────────────────────┴─────────────────────────┘
↓
Advanced (4-6 years)
↓
Senior Engineer ────────────→ Architect
↓ ↓
Leadership (6-8 years) Advanced Arch (8-10 years)
↓ ↓
Team Lead ───────────────→ Senior Architect
↓ ↓
Principal Level (8-15+ years)
↓ ↓
Principal Engineer ──────→ Principal Architect
|
Time: 440 hours Status: ✅ Complete What You'll Learn:
Coverage: 67 hands-on exercises across 10 modules, 5 capstone projects, 53 reference solutions |
Time: 440 hours Status: ✅ Complete What You'll Learn:
Coverage: 181 hands-on exercises across 10 modules, 3 production-system projects, 122 reference solutions |
|
Time: 580 hours Status: 🟡 Published (10 modules, 5 projects; review ongoing) What You'll Learn:
Coverage: 50 hands-on exercises across 10 modules, 5 capstone projects, 55 reference solutions |
Time: 600-700 hours Status: 🟡 Published (9 modules live; module-first today) What You'll Learn:
Coverage: 45 hands-on exercises across 9 modules, 45 reference solutions; project layer still needs more build-out |
|
Time: 200-250 hours Status: 🟡 Published (8 modules + 3 projects) What You'll Learn:
Coverage: 41 hands-on exercises across 8 modules, 3 optimization-focused projects, 40 reference solutions (with autograders) |
Time: 335 hours (12 modules) Status: ✅ Complete (12 modules + 5 projects + capstone) What You'll Learn:
Coverage: 61 hands-on exercises across 12 modules, 5 project implementations + capstone synthesis (NorthBridge Health), 5 project-level reference solutions |
|
Time: 400-500 hours Status: 🟡 Published (10 modules + 4 projects) What You'll Learn:
Coverage: 36 hands-on exercises across 10 modules, 4 capstone projects, 54 reference solutions |
Time: 600 hours Status: 🟡 Published (10 modules + 5 projects) What You'll Learn:
Coverage: 50 hands-on exercises across 10 modules, 5 architecture projects, 55 reference solutions |
|
Time: 500 hours Status: 🟡 Strategic track live (5 modules + 5 projects) What You'll Learn:
Coverage: 25 hands-on exercises across 5 modules, 5 leadership projects, 25 reference solutions |
Time: 420 hours Status: 🟡 Strategic modules live (10 modules; project layer still shallow) What You'll Learn:
Coverage: 45 hands-on exercises across 10 modules, 51 reference solutions; 1 live project scaffold today (more depth still needed) |
|
Time: 680 hours Status: 🟡 Strategic track live (5 modules + 5 projects) What You'll Learn:
Coverage: 25 hands-on exercises across 5 modules, 5 high-impact projects, 25 reference solutions |
Time: 520 hours Status: 🟡 Strategic track live (5 modules + 5 projects) What You'll Learn:
Coverage: 25 hands-on exercises across 5 modules, 5 strategic projects, 25 reference solutions |
| Track | Status | Current Coverage | Notes |
|---|---|---|---|
| Junior Engineer | ✅ Complete | 10 modules, 5 projects, 67 exercises, 53 reference solutions | Best starting point for new learners |
| Engineer | ✅ Complete | 10 modules, 3 projects, 181 exercises, 122 reference solutions | Strongest hands-on core track |
| MLOps | 🟡 Published | 10 modules, 5 projects, 50 exercises, 55 reference solutions | Validation and review ongoing |
| ML Platform | 🟡 Published | 9 modules, 45 exercises, 45 reference solutions | Module-first today; projects need more build-out |
| Performance | 🟡 Published | 8 modules, 3 projects, 41 exercises, 40 reference solutions | Good depth in core modules |
| Security | ✅ Complete | 12 modules, 5 projects + capstone, 61 exercises | Full track with NorthBridge Health capstone synthesis |
| Senior Engineer | 🟡 Published | 10 modules, 4 projects, 36 exercises, 54 reference solutions | Needs continued depth passes |
| Architect | 🟡 Published | 10 modules, 5 projects, 50 exercises, 55 reference solutions | Structurally strong, still maturing |
| Senior Architect | 🟡 Strategic Live | 10 modules, 1 project, 45 exercises, 51 reference solutions | Project layer is still shallow |
| Team Lead | 🟡 Strategic Live | 5 modules, 5 projects, 25 exercises, 25 reference solutions | Leadership scaffolds are live |
| Principal Engineer | 🟡 Strategic Live | 5 modules, 5 projects, 25 exercises, 25 reference solutions | Strategic scaffolds are live |
| Principal Architect | 🟡 Strategic Live | 5 modules, 5 projects, 25 exercises, 25 reference solutions | Strategic scaffolds are live |
The audits in _meta/QUALITY_REPORT.md and _meta/EXERCISE_SOLUTION_PARITY.md (both regenerated 2026-05-26) surface the following concrete gaps. Every module exists in the planned modules' directory; every learning exercise has a corresponding solution where one is expected. The remaining work is filling in specific files and project specs.
✅ All resolved in the May 26, 2026 pass. The audit-flagged missing files have been added:
- Junior Engineer Learning: 7 new
resources.mdfiles added (mod-001, 003, 004, 005, 007, 008, 010). - Engineer Learning:
resources.mdadded to mod-101-foundations. - Senior Engineer Learning: 10
quiz.mdfiles relocated fromexercises/quiz.mdto module root viagit mv(history preserved). - Senior Architect Learning: mod-401-enterprise-ai-strategy's empty
lecture-notes/now contains01-overview.mdmatching the peer-module pattern.
Track scores improved correspondingly: Junior 59 → 76, Senior Engineer 51 → 75, Senior Architect 66 → 77, Engineer 55 → 79 (the audit script was updated on May 26, 2026 to recognize the engineer-track numbered-lecture convention — no more false positives).
✅ All resolved in the May 26, 2026 pass.
ML Platform Engineer — all 5 planned capstone projects now live with the full 4-file scaffold (README + architecture + requirements + STEP_BY_STEP):
- ✅
project-01-platform-core(80h) — Self-Service ML Platform Core - ✅
project-02-feature-store(70h) — Enterprise Feature Store - ✅
project-03-workflow-orchestration(75h) — ML Workflow Orchestration Engine - ✅
project-04-model-registry(70h) — Model Management System - ✅
project-05-developer-portal(60h) — Developer Portal & SDK
Senior Architect — all 6 planned capstones now live (single-README strategic-deliverable format):
- ✅
project-401-ai-transformation-strategy(60h) - ✅
project-402-global-ai-platform-architecture(70h) - ✅
project-403-responsible-ai-framework(60h) - ✅
project-404-innovation-program-design(50h) - ✅
project-405-industry-thought-leadership(50h) - ✅
project-406-enterprise-governance-model(55h)
- Leadership and principal tracks (Team Lead, Principal Engineer, Principal Architect): Each has the full 5-module / 5-project / 25-exercise scaffold checked in, but lecture content depth is shallower than the Engineer / MLOps / Security tracks. The strategic content is appropriate to the tier; the per-module lecture depth is the next iteration.
- ML Platform Engineer: 9 deep modules + 45 hands-on exercises + 5 capstone projects (added May 26, 2026). Track structurally complete.
- Senior Engineer: Lab structure is now 5 labs per module after the May 2026 parity pass; the labs themselves are senior-scale framings pointing back to the engineer-track for implementation depth — fuller standalone lab content is the natural next iteration.
- Engineer Solutions implementation depth: ✅ Resolved May 26, 2026 — all 114 originally-scaffolded Python files across 23 exercises in modules
mod-102throughmod-110now have real, runnable, tested implementations (~20,000 LOC + 500+ passing tests). Each exercise ships with a working CLI demo, a pluggable abstraction for cloud-API integration, and unit + integration tests that run end-to-end without cloud credentials. Track is now structurally and substantively complete.
- Security Engineer: The parity audit lists 61 "missing solutions" for security-learning. These are design-based exercises (write a threat model, produce a DPIA, conduct a tabletop). They are answered by rubrics and the 5 project-level
SOLUTION.mdfiles inai-infra-security-solutions, not by per-exercise reference code. The discrepancy is structural, not a content gap.
- Runtime validation: Several Docker / Kubernetes / cloud-heavy repos still need fuller execution validation beyond static structure checks.
- Human review: AI-assisted material across the org still needs ongoing factual review, correction, and link cleanup. The Security track went through a deliberate ML-domain pass during the May 2026 build; others would benefit from a similar pass.
- Audit-script heuristics: ✅ Resolved May 26, 2026 —
_meta/scripts/audit_curriculum_quality.pynow (1) recognizes the engineer-track numbered-lecture convention via a^\d+-.+\.md$regex check, (2) skips markdown code-block content (no longer flagsgrep -r "TODO"examples or gRPCStubAPI names), and (3) carries a ~70-entry false-positive phrase list distinguishing teaching scaffolds (**TODO:** Complete this Dockerfile, "stubs with educational TODO comments", template| CISO | TBD |rows) from actual unfinished work. Sampled-marker count dropped from 294 → 182 across the May 26 passes.
Select based on your current experience level and career direction.
# Example: Junior Engineer track
git clone https://github.com/ai-infra-curriculum/ai-infra-junior-engineer-learning.git
cd ai-infra-junior-engineer-learning# Read the curriculum
cat README.md
# Start with Module 001
cd lessons/mod-001-python-fundamentals
cat README.mdWork through the hands-on exercises in each module.
Use the companion solutions repository for comparison and reference.
Languages: Python, Bash, HCL (Terraform), YAML ML Frameworks: PyTorch, TensorFlow, Scikit-learn Orchestration: Kubernetes, Helm, ArgoCD, FluxCD Cloud: AWS, GCP, Azure Containers: Docker, containerd MLOps: MLflow, Kubeflow, DVC, Feast Monitoring: Prometheus, Grafana, Loki, Jaeger IaC: Terraform, Pulumi CI/CD: GitHub Actions, GitLab CI LLMs: vLLM, Llama, Mistral, RAG systems GPU: CUDA, NCCL, TensorRT
- Working code in core engineering, security, and solutions repos
- Metrics-driven project and exercise design
- Solution repos plus strategic templates where code is not the main deliverable
- Best practices and anti-patterns
- 525+ hands-on exercises
- 45+ real-world projects
- Full learning/solutions repo pairs
- Cross-references, quizzes, and answer keys
- Start with fundamentals
- Build to production systems
- Specialize across MLOps, platform, performance, and security
- Advance into architecture, leadership, and principal-level strategy
- Org-wide structural refresh completed in May 2026
- Human review and validation are still ongoing
- Cross-reference and answer-key maintenance are active
- Feedback is welcome through Issues and Discussions
We welcome contributions across the organization.
Ways to contribute:
- Fix broken links, stale references, or inaccurate explanations
- Add depth to thin modules, projects, or strategic artifacts
- Improve validation for runnable exercises and projects
- Report issues or suggest improvements via GitHub Discussions
- Follow the
CONTRIBUTING.mdin the specific repository you want to improve
Most curriculum repositories are MIT-licensed. See the target repository's LICENSE file for the authoritative terms.
- Issues: Use the relevant repository's GitHub Issues
- Discussions: Use organization discussions
- Docs: See Career Progression and Curriculum Cross-Reference
Current Status (May 2026):
- ✅ All
27org repositories are live - ✅ The May 23, 2026 chain pass refreshed
21repos - ✅ Junior, Engineer, and Security tracks are now fully-developed entry-to-specialization paths
- ✅ All 12 solutions repos now have
SOLUTION.md(per-project) orSOLUTION_OVERVIEW.md(per-track) design-rationale docs - 🟡 Specialization and senior tracks are published and usable
- 🟡 Leadership and principal tracks are structurally live but still need depth
- 🟡 ML Platform Engineer needs project-layer build-out
- 🟡 Senior Architect needs deeper project artifacts
Current Focus (2026):
- Human review and factual verification of AI-assisted content
- Runtime validation for code-heavy projects and labs
- Deeper lecture and artifact development for leadership-tier tracks
- ML Platform and Senior Architect project layer
- Cross-reference, navigation, and link cleanup across the org
- Reduce infrastructure costs by 30-50%
- Improve GPU utilization from low baseline usage to production efficiency
- Cut deployment time from days to hours
- Scale from a first model service to multi-team platform thinking
- Covers real production concerns across serving, MLOps, platform, security, and architecture
- Includes working code, solution repos, and strategic planning artifacts
- Refreshed through the May 2026 org-wide update pass
- Aligned with role progression from junior to principal
- Clear progression path from Junior to Principal
- Multiple specialization tracks
- Leadership development included
- Portfolio-ready projects and artifacts
Start your AI Infrastructure Engineering journey today! 🚀
Choose Your Track | Quick Start | Contributing
Last Updated: May 26, 2026 Total Repositories: 27 org-wide (24 curriculum + 3 support)
Maintained by VeriSwarm.ai