Skip to content

rohanmistry231/AI-From-Scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

AI From Scratch

Every Algorithm. Implemented From Scratch. No Black Boxes.

ML DL NLP LLM GenAI TimeSeries Total License: MIT


200+ algorithms across 6 domains of artificial intelligence — all implemented from scratch using pure NumPy.

No TensorFlow. No PyTorch. No HuggingFace. No sklearn (except in benchmarks). Just math, logic, and working code.


The Series

# Repo Algorithms Focus
1 ML-From-Scratch 48 Regression, Classification, Ensembles, Clustering, Dimensionality Reduction, Neural Nets, Optimization, Recommenders, RL, Probabilistic
2 DL-From-Scratch 28 CNNs, RNNs, LSTMs, GRUs, Attention, Transformers, GANs, VAEs, Normalization, Residual Networks
3 NLP-From-Scratch 33 Tokenization, Word Embeddings (Word2Vec, GloVe, FastText, ELMo), Sequence Labeling (HMM, CRF), Text Classification, Topic Modeling (LDA, NMF, LSA), Text Generation, Evaluation Metrics
4 LLM-From-Scratch 28 GPT Architectures, Attention Variants, Positional Encodings, KV Cache, MoE, RLHF, PPO, DPO, Quantization, Fine-Tuning, RAG
5 GenAI-From-Scratch 30 VAEs, GANs, Diffusion Models, Flow Matching, Autoregressive Models, Normalizing Flows, Energy-Based Models, Score Matching
6 TimeSeries-From-Scratch 33 ARIMA, SARIMA, ETS, Holt-Winters, State Space Models, Kalman Filter, Structural Time Series, Decomposition (STL), Feature Extraction, Backtesting

Total: 200+ algorithms and growing.


Why This Series?

Most ML education stops at model.fit(). This series exists to answer the question:

"What actually happens when I call .fit()?"

Each repo in this series is built on three principles:

1. Zero Black Boxes

Every algorithm is written in raw NumPy. You can step through the code, examine every matrix multiplication, and understand exactly how predictions are made.

2. Production-Ready Tooling

Despite being educational, every repo includes:

  • Type hints and NumPy-style docstrings
  • pyproject.toml for pip-installable packages
  • Unit tests with pytest
  • Ruff for linting
  • Pre-commit hooks
  • CI/CD with GitHub Actions
  • Runnable examples for every algorithm
  • Jupyter notebooks for visualization

3. Interview-First Design

Every algorithm folder includes a 12-section README with:

  • Intuition and mathematical formulation
  • Pseudocode
  • Time and space complexity analysis
  • Interview Q&A — real questions from real interviews

Learning Roadmap

Phase 1 — ML Foundations (Start Here)

ML-From-Scratch
├── Linear Regression
├── Logistic Regression
├── Decision Trees
├── Random Forest
├── SVM
├── K-Means
├── PCA
└── Neural Networks (MLP)

Phase 2 — Deep Learning

DL-From-Scratch
├── CNN
├── RNN / LSTM / GRU
├── Attention & Transformers
├── Autoencoders & VAEs
└── GANs

Phase 3 — Natural Language Processing

NLP-From-Scratch
├── Tokenization & Preprocessing
├── Word Embeddings (Word2Vec, GloVe, FastText)
├── Sequence Labeling (HMM, CRF)
├── Text Classification
└── Topic Modeling (LDA, NMF)

Phase 4 — Large Language Models

LLM-From-Scratch
├── GPT Architecture
├── Attention Variants (Grouped, Flash, Multi-Latent)
├── Training (Pre-training, SFT, RLHF)
├── Inference (KV Cache, Speculative Decoding)
└── Applications (RAG, Agents, Quantization)

Phase 5 — Generative AI

GenAI-From-Scratch
├── VAEs & GANs
├── Diffusion Models
├── Flow Matching
├── Normalizing Flows
└── Score-Based Models

Phase 6 — Time Series

TimeSeries-From-Scratch
├── Classical (ARIMA, SARIMA, ETS)
├── State Space Models (Kalman, DLM)
├── Decomposition (STL, Seasonal)
├── Feature Extraction
└── Evaluation & Backtesting

Algorithm Index (All 200+)

ML-From-Scratch (48)

Regression: Linear Regression (Normal Equation, GD), Polynomial, Ridge, Lasso, Elastic Net

Classification: Logistic Regression, KNN, Naive Bayes (Gaussian, Multinomial), Perceptron, SVM (Linear, Kernel), Decision Tree (Classifier, Regressor)

Ensembles: Bagging, Random Forest, AdaBoost, Gradient Boosting, Stacking

Clustering: K-Means, K-Means++, Hierarchical, DBSCAN, Mean Shift, GMM

Dimensionality Reduction: PCA, SVD, LDA, t-SNE

Neural Networks: Perceptron, MLP, CNN, RNN, LSTM, Autoencoder

Optimization: Batch GD, SGD, Mini-Batch GD, Momentum, RMSProp, Adam

Recommender Systems: Collaborative Filtering, Matrix Factorization

Reinforcement Learning: Q-Learning, SARSA

Probabilistic: HMM, Apriori


DL-From-Scratch (28)

Foundations: Dense Layer, Activation Functions (ReLU, Sigmoid, Tanh, Softmax, GELU, Swish), Weight Initialization (Xavier, He), Loss Functions, Batch Normalization, Layer Normalization, Dropout

Convolutional: Conv2D, MaxPooling, Flatten

Recurrent: RNN Cell, LSTM Cell, GRU Cell, Bidirectional RNN

Attention: Scaled Dot-Product Attention, Multi-Head Attention, Self-Attention, Cross-Attention

Transformer: Transformer Encoder, Transformer Decoder, Positional Encoding

Advanced: Residual Block, Skip Connection, GAN, VAE


NLP-From-Scratch (33)

Preprocessing: Tokenizer, Subword Tokenizer (BPE), Text Normalizer, Edit Distance, Spell Checker

Feature Extraction: Bag of Words, TF-IDF, N-Gram Language Model, PMI, Text Vectorizer

Word Embeddings: Word2Vec CBOW, Word2Vec Skip-Gram, GloVe, FastText, ELMo (LSTM)

Sequence Labeling: HMM, Viterbi Decoder, CRF, Maximum Entropy Classifier

Classification: Naive Bayes (Text), Logistic Regression (Text), SVM (Text), Perceptron (Text)

Topic Modeling: LDA, NMF, LSA

Generation: Beam Search, Temperature Sampling, Top-K & Top-P Sampling

Metrics: BLEU, ROUGE, Perplexity, Word Error Rate


LLM-From-Scratch (28)

Architecture: GPT, Causal Attention, Grouped Query Attention, Multi-Query Attention, Multi-Latent Attention, Flash Attention, ALiBi, RoPE, Relative Positional Encoding

Scaling: Mixture of Experts, Sparse MoE, Model Parallelism, Tensor Parallelism, Pipeline Parallelism

Training: Pre-training, Causal LM Loss, Curriculum Learning, Warmup-Cosine Schedule

Fine-Tuning: SFT, LoRA, QLoRA, Adapter

RLHF: PPO, DPO, Reward Modeling, Rejection Sampling

Inference: KV Cache, Speculative Decoding, GQA, Quantization (GPTQ, AWQ)

Applications: RAG, Tool Use, Agent Loop


GenAI-From-Scratch (30)

VAEs: VAE, Beta-VAE, Conditional VAE, VQ-VAE, VQ-VAE-2

GANs: GAN, DCGAN, Conditional GAN, InfoGAN, Wasserstein GAN, WGAN-GP, LSGAN, CycleGAN, StyleGAN, Progressive GAN, SAGAN, BigGAN

Diffusion: DDPM, DDIM, Classifier-Free Guidance, Latent Diffusion, Stable Diffusion

Advanced: Flow Matching, Autoregressive Models (PixelCNN, PixelRNN), Normalizing Flows (RealNVP, Glow), EBM, Score Matching (SMLD, NCSN)


TimeSeries-From-Scratch (33)

Foundations: White Noise, Random Walk, Autocorrelation Function, Partial ACF, Stationarity Tests, Differencing, Lag Features, Rolling Statistics

Baselines: Naive Forecast, Seasonal Naive, Drift Method, Mean Forecast

Classical: AR, MA, ARMA, ARIMA, SARIMA, SARIMAX

Exponential Smoothing: Simple Exponential, Holt's Linear, Holt-Winters, Damped Trend, ETS

Advanced Statistical: GARCH, ARCH, VAR, Dynamic Regression

State Space: Kalman Filter, Kalman Smoother, DLM, Structural Time Series

Decomposition: Classical Decompose, STL, Seasonal Decompose, Moving Average Smoothing

Features: Time Series Features (mean, variance, trend, seasonality, entropy), Feature Engineering

Evaluation: Time Series Cross-Validation, Walk-Forward Validation, Metrics (MSE, MAE, MAPE, SMAPE, MASE)


Getting Started

# Clone any repo
git clone https://github.com/rohanmistry231/ML-From-Scratch.git
cd ML-From-Scratch

# Install dependencies (numpy + matplotlib + sklearn for benchmarks)
pip install -r requirements.txt

# Run any algorithm
cd algorithms/01_supervised_regression/linear_regression_normal_equation/
python example.py

Each repo is fully standalone. Clone one, clone all — they share no dependencies between them.


Philosophy

Principle Why
No ML libraries in algorithm code The whole point is learning what's inside the box
NumPy only Vectorized math is how industry implements these algorithms
scikit-learn only in example.py For validation and benchmarking against known implementations
Every algorithm has a README Code without understanding teaches nothing
Every README has Interview Q&A Make your study immediately useful for job prep
Type hints & docstrings Professional-grade, readable code
Tests & CI The code actually works, not just looks good

License

All repos are MIT licensed. Use them, learn from them, build on them.


Star History

If this series helps you learn or land a job — drop a star on any (or all) of the repos. It helps others find them.


"You don't truly understand an algorithm until you can implement it from scratch."

About

A unified hub for the Scratch Series — 200+ algorithms across ML, DL, NLP, LLM, GenAI, and TimeSeries, all implemented from scratch with NumPy.

Topics

Resources

License

Stars

Watchers

Forks

Contributors