Neural Network and Deep
Learning
Samatrix Consulting Pvt Ltd
What is a Neural Network?
What is a Neural Network
• Robot Brain: A neural network (NN) is like a "robot brain" that learns from
examples, just like how a child learns to recognize dogs vs. cats.
• Inspired by Biology: Mimics how human brain cells (neurons) work, but uses
math instead of biology.
• Learns Patterns: It finds patterns in data (e.g., photos, sounds, numbers) to make
predictions or decisions.
• Example: Filters spam emails by learning words like "free" or "win."
What is a Neural Network
• Layers Do the Work:
• Input Layer: Receives data (e.g., pixels of an image).
• Hidden Layers: Process data step-by-step (like a chef cooking ingredients).
• Output Layer: Gives the final answer (e.g., "cat" or "dog").
• Used Everywhere: Powers face recognition, voice assistants (Siri/Alexa), and self-
driving cars!
• Real-Life Comparison:
• Imagine a NN as a team of detectives: Each layer (detective) analyzes clues (data)
to solve a case (prediction).
Why use Neural Networks?
They're Smart Problem-Solvers!
• Learn from examples - baby learning cats vs. dogs seeing many pictures
• Handle messy data - work with blurry images, typos in text, noisy sounds
• Spot hidden patterns - Finds connections humans might miss (e.g., "rain +
weekend = more pizza orders")
They're Super Flexible!
• Image tasks: Face unlock, medical scans
• Language tasks: Siri/Alexa, translation
• Real-world AI: Self-driving cars, fraud detection
Neural Networks vs. Traditional Programming
[Link] Programming:
1. You give exact rules (like a recipe).
2. Example: "If temperature > 30°C, say 'hot'; else, say 'cold'."
3. Computer blindly follows instructions.
[Link] Networks:
1. You give examples (data) and let the NN figure out rules itself.
2. Example: Show 1000 pics labeled "hot" or "cold" → NN learns patterns (e.g.,
sun = hot, snow = cold).
3. Computer learns like a human!
Neural Networks vs. Traditional Programming
[Link] Difference:
1. Traditional: You write the logic.
2. NN: The logic is a mystery—even to programmers! (But it works!)
• Fun Example:
• Traditional: "If someone wears glasses, call them 'smart.'" (Rigid
rule).
• NN: Analyzes 1000 smart people—some wear glasses, some
don’t—and learns true patterns (e.g., books + lab coats = smarter
guess).
• Final Thought:
• NNs are lazy but genius—they learn from mistakes instead of
needing perfect rules!
Biological vs Artificial Neurons
What is a Neuron?
• A neuron is a tiny decision-maker inside a neural network.
• It takes some input (like numbers), does a little math, and sends an output.
• It’s like a mini calculator or a tiny brain cell.
• Each neuron connects to other neurons — like a big team!
• Neurons work together to learn patterns in data — like faces, text, or voices.
Think of a neuron like a filter: “Should I pass this info forward or not?”
Biological Neuron
• A neuron is a special cell in your brain.
• It helps you think, learn, and remember things.
• It has tiny branches called dendrites to receive signals.
• The signal goes through the cell body (like a control center).
• Then it’s sent out through the axon to other neurons.
• Neurons are like messengers in your brain — passing info super fast!
What is an Artificial Neuron (Node)?
• An artificial neuron is like a tiny calculator in a neural network.
• It takes numbers as input, does some math, and gives one output.
• It multiplies inputs by weights, adds a bias, and applies an activation function.
• It's the basic building block of neural networks.
• Works kind of like a light switch — only turns on if the signal is strong enough.
Think of it as a mini decision-maker in the brain of a computer!
What is an Artificial Neuron (Node)?
• How It Works:
• Inputs: Receives multiple numbers (like temperature, humidity)
• Weights: Gives importance to each input (temperature matters more than
humidity)
• Combination: Mixes all weighted inputs together
• Decision: Uses rules to say "yes" (1) or "no" (0), or something in between
• Real-World Example - Loan Approval:
• Inputs: Salary (5,000), CreditScore(700), Debt(5,000), CreditScore(700),
Debt(200)
• Neuron learns: Salary very important, debt somewhat important
• Result: Approves loan if weighted combination passes threshold
Biological vs. Artificial Neurons
Biological Neuron Artificial Neuron (Node)
Found in the human brain Found in computer models (AI)
Passes signals between brain cells Passes numbers between layers
Dendrites receive signals Inputs receive numbers
Axon sends signals to next cell Output sends result to next node
Works with electric & chemical signals Works with math (add, multiply)
Learns from life experience Learns from data
Very complex and powerful Much simpler and easy to program
Grows and changes in the brain Adjusts weights and bias during training
Architecture of Neural Networks
Structure of Neural Network
1. Input Layer
• Receives raw data (pixels, words, numbers)
• Like a questionnaire filling in your details
2. Hidden Layers
• Where the magic happens!
• Each layer extracts patterns (edges → shapes → objects in images)
• More layers = more complex patterns learned
3. Output Layer
• Gives the final answer:
• Classification: "Cat" or "Dog"
• Prediction: "Will rain tomorrow? (87% yes)"
Input Layer
Takes in Raw Data
Images → Pixel values (0=black, 255=white)
Text → Word counts or codes
Sound → Volume/frequency numbers
Organizes the Data
Flattens images into a list of numbers
Scales values (e.g., 0 to 1)
Feeds It Forward
Passes data to hidden layers like a conveyor belt
Real-World Example: A self-driving car’s input layer:
Takes camera footage → Converts into millions of numbers (pixels) → Sends for processing
Hidden Layer
Processes Information
• Takes numbers from input layer
• Mixes them like a recipe (weights = ingredient amounts)
Learns Patterns
• Finds edges in images → shapes → objects
• Spots word combinations in text (e.g., "free" + "win" = spam)
Makes Decisions
• Decides what’s important (e.g., "cat ears" > "background color")
• Passes clues to the next layer
Why It’s Cool:
• The more hidden layers, the smarter the network! (But slower)
• Works like a detective – Finds hidden clues in data
Output Layer: The Network's Final Answer
Gives the Final Decision
• Classification: Picks a category ("Dog")
• Regression: Predicts a number ("$299,000")
• Probability: Shows confidence ("95% cat")
Formats the Answer
• Uses Softmax (for categories)
• Uses Sigmoid (for yes/no)
• Uses Linear (for numbers)
Connects to the Real World
• Sends results to apps, robots, or users
Simple Neural Network Example
• Let’s say we want to teach a computer to recognize handwritten numbers (0–9).
• We give it an image with 784 pixels (28×28) — that’s our input!
• It goes through a few hidden layers to learn patterns like curves, lines, and edges.
• The final layer gives us 10 outputs — one for each number!
• The network picks the output with the highest score as its answer.
It’s like the computer says, “Hmm… I’m 90% sure this is a 3!”
Real Life Analogy
Input Layer = Order Ticket
• "Pepperoni, extra cheese" → Just like pixels/words entering the network
Hidden Layers = Chef's Thinking
• Remembers recipes (weights)
• Adjusts ingredients (learning)
• Tastes while cooking (backpropagation)
Real Life Analogy
Output Layer = Served Pizza
• Perfect pie → Correct prediction
• Burnt crust → Error (needs adjustment!)
Why This Fits:
• Both learn from mistakes (bad pizza → better next time!)
• Complex behind-the-scenes (you only see the final result)
• Improves with practice (just like a chef's skills)
Fun Twist:
• "The more customers (data), the smarter the chef (network) becomes!"
Activation Functions
What is an Activation Function?
Decides If a Neuron Should "Fire"
• Like a brain cell deciding: "Is this important enough to react?"
Squishes Numbers Into a Useful Range
• Sigmoid: Turns any number into 0–1 (like a probability)
• ReLU: Negative → 0 | Positive → Unchanged (like a "bad news filter")
Adds Non-Linearity (Makes the Network Smarter!)
• Without it, NNs would just be fancy calculators
Example:
• Water Faucet : Controls flow (like ReLU)
• Volume Knob : Smoothly adjusts (like Sigmoid)
Why Do We Need Activation Functions?
Adds "Smart" Decisions
• Without it: Every neuron would just copy inputs (boring!)
• With it: Neurons can say "YES!" or "ignore this"
Allows Learning Complex Patterns
• Straight lines (no activation) → Can’t learn curves/wiggles
• Non-linear (ReLU/Sigmoid) → Can learn anything!
Controls Output Range
• Sigmoid: 0 to 1 (perfect for "probabilities")
• ReLU: 0 to ∞ (great for positive values)
Why Do We Need Activation Functions?
Real-Life Analogy:
• Traffic Light (Activation Function):
• Green (≥0.5) → "GO!" (Output = 1)
• Red (<0.5) → "STOP!" (Output = 0)
Why It Matters:
• Without them: Neural Network = Fancy Calculator
• With them: Neural Network = Super Learner!
Fun Test: "Should this neuron fire?"
• Input=10 → "YES!" (ReLU)
• Input=-5 → "NOPE!" (ReLU)
Sigmoid Function - The "Maybe" Button
Squishes Any Number into 0–1
• 5 → 0.99 ("Probably yes")
• -5 → 0.01 ("Probably no")
Perfect for Probabilities
• "Is this a cat?" → 0.87 = 87% confident
Smooth & Predictable
• Like a dimmer switch (not sudden on/off)
Pros & Cons:
✓ Easy to understand (0% to 100%)
✗ Slow for deep networks (gets "lazy" during training)
Tanh Function: The "Strong Maybe" Button
Why Tanh Rocks:
Better for negative numbers (unlike Sigmoid)
Centered at 0 (helps networks learn faster)
Still smooth (like a volume knob)
Real-Life Analogy: Volume Control:
• Left (-1) = Mute
• Middle (0) = Normal
• Right (+1) = Max Volume
When to Use It:
• When your data has both positive + negative values
• In hidden layers (especially RNNs)
ReLU: The "If Positive, Say It!" Function
• What It Does:
• Input ≥ 0? → Passes it through unchanged
• Example: 3 → 3
• Input < 0? → Sets it to zero
• Example: -2 → 0
• Why We Love It:
• Super Simple: Just two rules!
• Fast & Efficient: Helps networks train quickly
• Avoids "Dead Neurons": (But sometimes they nap)
• ReLU is the most popular activation function in deep learning!
Softmax Function: The "Pick One!"
What It Does:
• Takes a bunch of numbers (e.g., scores for "cat/dog/bird")
• Squishes them into probabilities (0% to 100%)
• All probabilities add up to 1.0 (100%)
Example
• Input Scores: [3.0, 1.0, 0.2] (Cat, Dog, Bird)
• Softmax Magic → [0.72, 0.23, 0.05]
• Output: ”72% cat!" (Highest probability wins)]
Key Idea:
"Softmax turns scores into a voting system where everyone gets a fair share!"
Softmax Function: The "Pick One!"
Why It’s Cool:
Perfect for classification (e.g., "Which animal is this?")
All scores compete (like a talent show)
Highlights the winner (but keeps small chances for others)
Real-Life Analogy:
• Dividing a Pizza:
• Biggest slice = Most likely answer
• But tiny slices still exist for other options!
Fun Example:
• Input: [5.0, 2.0, 1.0] (Ice Cream, Cake, Fruit)
• Softmax Says: ”62.5% ice cream, 25% cake, 12.5% fruit!"
How to Pick the Right Activation Function
Output Layer:
• Yes/No? → Sigmoid (e.g., spam detection)
• Multiple Choices? → Softmax (e.g., cat/dog/bird)
• Any Number? → Linear (e.g., house price)
Hidden Layers:
• Usually ReLU (fast & simple)
• RNNs? → TanH (handles negatives better)
Avoid:
• Sigmoid/TanH in deep networks (they get "lazy")
• Linear in hidden layers (no learning power)
Pro Tip: "Start with ReLU—it works 90% of the time!"
Project
Title - Forward & Backward Pass in Neural Networks — with NumPy
Objective –
• Understand what activation functions do in a neural network.
• Visualize and compare common activation functions: Sigmoid, Tanh, ReLU, and Leaky ReLU.
• Observe how each function transforms input values between different ranges.
• Learn why different activation functions are used for different types of problems.
• Build intuition about non-linearity, saturation, and probabilistic outputs.
• Practice using TensorFlow operations and Matplotlib for plotting.
Project File - [Link]
Forward Propagation
What is Forward Propagation
• Forward propagation is the process of moving data through a neural network.
• It starts at the input layer and flows to the output layer.
• At each layer:
• Inputs are multiplied by weights
• A bias is added
• An activation function decides what to pass next
• The final result is the network’s prediction!
• It's how a network makes a guess based on what it sees.
Real-Life Analogy:
• Like baking a cake:
Ingredients (input) → Mixing (hidden layers) → Cake (output)!
What Happens in Each Layer?
• In forward propagation, data moves from left to right through the layers.
• Here’s what each layer does:
Input Layer:
• Just receives the data (like an image or number).
• Passes it to the next layer without changing it.
Hidden Layer(s):
• Takes the data, multiplies it by weights.
• Adds a bias (a small number to adjust).
• Passes the result through an activation function to decide what to keep.
• Think of this layer as the “thinking part” of the network.
What Happens in Each Layer?
Output Layer:
• Takes the final results from hidden layers.
• Gives us the network’s guess or answer
• Example: “This is a dog” or “Number is 7”.
Real-Life Analogy:
• Like passing a message through a group — each person adds a bit of info before it
reaches the final answer!
• Input Layer: Holds raw data
• Hidden Layers: Find patterns
• Output Layer: Makes decisions
Forward Propagation – Step 1: Input →
Hidden Layer
Inputs:
𝑥1 = 2, 𝑥2 = 3
Weights to Hidden Neuron:
𝑤1 = 0.4, 𝑤2 = 0.6
Bias (b):
𝑏=1
Calculation at Hidden Neuron:
𝑧 = 𝑥1 × 𝑤1 + 𝑥2 × 𝑤2 + 𝑏
𝑧 = 2 × 0.4 + 3 × 0.6 + 1
𝑧 = 0.8 + 1.8 + 1 = 3.6
Outcome: This 3.6 is passed to the activation function.
Forward Propagation – Step 2: Apply
Activation Function
Let’s use ReLU activation (Rectified Linear Unit)
• Formula:
ReLU 𝑥 = max(0, 𝑥)
• Apply to our value from before:
ReLU 3.6 = max(0,3.6)
• Since it’s positive, the value stays the same.
• If it was negative, it would become 0.
• Output of hidden layer = 3.6
• Analogy: Like a gate that only opens for strong signals.
Forward Propagation – Step 3: Hidden →
Output
• Now, we pass the result (3.6) to the output layer.
• Weight to output neuron: 𝑤 = 0.7
• Bias: 𝑏 = 0.5
• Output calculation:
𝑧 = 3.6 × 0.7 + 0.5 = 3.02
• Let’s apply Sigmoid to turn it into a probability:
1
𝑆𝑖𝑔𝑚𝑜𝑖𝑑 𝑥 =
1 + 𝑒 −𝑥
𝑆𝑖𝑔𝑚𝑜𝑖𝑑 3.6 ≈ 0.953
Final Output: The model says: “I’m 95.3% confident this is a match (e.g., it’s a cat)!”
Summary Recap: Inputs → Multiply with weights → Add bias → Activate → Output
Real Life Analogy: Cooking a Meal
Input: "Customer orders vanilla cake" (data)
Hidden Layers:
• Mixer: Combines ingredients (weights adjust flavors)
• Oven: Bakes at 180°C (ReLU = "Is it cooked? Yes/No")
Output: "Vanilla cake ready!" (prediction)
Key Idea:
"Data (order) transforms step-by-step into output (cake)!"
What Are Weights and Bias?
Weights
• A weight is a number that decides how important an input is.
• Bigger weight = input has more influence
• Think of it like turning the volume knob up or down
Bias
• A bias is like a default push added to the output.
• It’s like a default value that’s always added, no matter what the input is.
• It helps the network be more flexible and accurate when learning.
• Analogy: Like a "minimum score" to pass a test
• Weights = importance of what you say.
• Bias = Like a starting bonus in a game — you always get a few extra points, even
before you begin!
What Are Nodes and Layers?
Node (Mini Brain Cell)
• What? A tiny calculator that takes inputs, does math, gives output.
• Like a brain cell — thinking and passing the message!
• Example: One node checks "Is there a cat ear shape?".
• Real-life: Like one worker in a team assembly line .
Layer (Team Huddle)
• What? A group of nodes working together.
• Example:
• Input layer: "Eyes" (sees raw data)
Real-Life Analogy:
• Hidden layer: "Brain" (thinks) Imagine a group of students (nodes) in a class (layer),
• Output layer: "Mouth" (answers) all solving a problem and passing their answers forward.
•
Loss Functions
What is a Loss Function?
• A loss function tells the model how wrong its guess was.
• It compares the model's prediction with the real answer.
• More mistake = higher loss
• Less mistake = lower loss
• The goal is to make the loss as small as possible during training.
Real-Life Analogy:
• It’s like a scorecard after a quiz — it shows how many answers were wrong and
how far off you were!
Example: Guessed "70% cat" → Truth = "DOG" → "Oops, that’s bad!"
Why Do We Need a Loss Function?
• It tells the neural network:
“Hey, you need to improve this part!”
• Without a loss function, the model wouldn’t know what to fix.
• The model learns by reducing the loss every time it trains.
• It’s like a teacher giving feedback:
“Try again, here’s where you messed up!”
Real-Life Analogy:
• Just like a test score tells you what to study more,
the loss tells the model how to get better!
Mean Squared Error – What Is It?
• MSE tells us how far off our predictions are from the real answers.
• We calculate the difference, then square it so it’s always positive.
• Then we average all the squared errors to get one final number.
• A smaller number = the model is doing well
• A bigger number = the model needs improvement
Formula: MSE = Average of (Prediction − Actual)²
Real-Life Analogy:
• Like guessing prices of 5 items at a store —
You write down how wrong each guess was, square those numbers, and then
average them. That’s your MSE!
Why Use MSE?
• It's simple and clear — the smaller the number, the better.
• Squaring the error means big mistakes get punished more.
• It helps the model focus on reducing big errors first.
• It’s commonly used in tasks like:
• Predicting house prices
• Estimating temperature
• Forecasting sales
Everyday Example:
• If you’re guessing your friends’ heights, and you’re 1 cm off vs 10 cm off — MSE
makes sure the 10 cm mistake matters more!
•
Cross Entropy – What Is It?
• Cross Entropy tells us how good our prediction is for classification problems (like “Is this
a cat or a dog?”).
• It compares:
• The true label (the real answer)
• The predicted probabilities (what the model thinks)
• If the model is confident and right, the loss is low
• If the model is confident and wrong, the loss is high
Real-Life Analogy:
• Imagine you guessed that your friend’s secret number was 90% likely to be 5 — and it
really was 5! You did great = low loss.
• But if you guessed 90% sure it's 2 — and it's actually 5? Oops! Big mistake = high loss.
Why Use Cross Entropy in Classification?
• It works well when we have multiple categories (like digits 0–9 or types of
animals).
• Helps the model learn to give high confidence to the correct answer.
• It’s the most common loss function for:
• Image classification
• Text classification
• Any task with multiple choices
Everyday Example:
• Like a quiz with multiple choices —
• You don’t just want the model to pick the right one, you want it to be really
confident about it!
Common Loss Functions
Root Mean Squared Error (RMSE)
• Like Mean Squared Error, but we take the square root at the end.
• Gives a result in the same units as the data (e.g., dollars, cm).
• Think: “How off were we, on average?”
Mean Squared Error (MSE)
• Measures how far off the model is by squaring the errors.
• Bigger mistakes = much more penalty.
• Great for predicting numbers (like house prices).
Common Loss Functions
Mean Absolute Error (MAE)
• Just takes the absolute difference (no squaring).
• Treats all mistakes equally.
• Easier to understand than MSE, but less sensitive to big errors.
Mean Absolute Percentage Error (MAPE)
• Measures error as a percentage of the actual value.
• Easy to explain — “The model was off by 10% on average.”
• Good for business and sales forecasting!
Common Loss Functions
Binary Crossentropy
• Used when there are only 2 possible outcomes (yes/no, 0/1).
• Example: Is the email spam or not?
Categorical Crossentropy
• Used when there are more than 2 classes (like digits 0–9).
• Example: Is this a cat, dog, or rabbit?
Tip
• Choose the loss based on the type of problem:
• Regression = RMSE, MSE, MAE, MAPE
• Classification = Binary or Categorical Crossentropy
Gradient Descent & Learning
Rate
What is Gradient Descent?
• A gradient is like a slope or the steepness of a hill.
• Gradient Descent is a way to teach the model how to improve its guesses.
• It helps the model find the best weights by checking how wrong it is (using loss).
• It takes small steps in the direction that makes the loss smaller.
• The goal? Reach the lowest point on the error graph — the best result!
• The model repeats this step again and again during training.
Real-Life Analogy: Hill Descent
• Imagine you’re standing on a hill blindfolded!
• Your goal is to walk down to the lowest point — without falling!
• You feel the slope under your feet — it tells you which way is downhill.
• You take small, careful steps in that direction.
Each step gets you closer to the bottom.
• That’s exactly how gradient descent works:
• The hill = the loss
• Your feet = the gradient
• Your steps = weight updates
• Gradient Descent = Slowly walking downhill
until you reach the lowest error!
How Gradient Descent Works
Check the Slope (Gradient):
• "Which way is downhill?" (Math calculates the steepest drop)
Take a Step (Update Weights):
• New weight = Old weight − (Learning rate × Slope)
• Small step: Safe but slow Watch Out For:
• Big step: Risky (might overshoot!) Local Valleys: Small dips that aren’t the lowest point
Learning Rate: Like stride length—
Repeat Until Flat (Converge): too big or small causes problems!
• Stops when the slope ≈ 0 ("Valley reached!")
Key Idea:
"Follow the slope downhill until you find the lowest point of error!"
Learning Rate
What It Does:
• Controls how big a step to take downhill
• Like choosing stride length while hiking:
• Big steps → Fast but risky (might trip!)
• Tiny steps → Safe but slow
Why It Matters:
• Too High: Misses the best solution (diverges)
• Too Low: Trains slower than a sleepy sloth
• Just Right: Finds the sweet spot efficiently
Tip: Start with ~0.01 and adjust! (Common first guess)
Gradient Descent Problems: Watch Out!
1. Local Minima (Fake Valleys)
Gets stuck in small dips (like a ball in a tiny hole)
Solution: Use momentum (keeps rolling past small bumps)
2. Vanishing Gradients (Too Flat)
Slope ≈ 0 → Updates barely change weights (snail pace )
Solution: Use ReLU activation
3. Exploding Gradients (Too Steep)
Slope = HUGE → Wild jumps (like a bouncing ball )
Solution: Clip gradients
4. Saddle Points (Flat Spots)
Slope = 0 in all directions (confused hiker )
Solution: Adaptive optimizers (Adam, RMSprop)
Gradients and Saddle Point
What is a Gradient?
• A gradient is like a slope or the steepness of a hill.
• It tells us which direction to move to make the loss
smaller.
• In math terms: It’s the rate of change of the loss
function.
• Positive slope = go downhill left
• Negative slope = go downhill right
• Zero slope = you’re flat — maybe at the bottom or
top!
What is a Local Minimum?
• A local minimum is a “small valley” in the loss
curve.
• The model thinks it found the best spot…
...but it’s not the lowest possible point.
• The gradient is zero, so it stops moving
• But there might be a better solution nearby!
• Fix: Use momentum (like a rolling ball that ignores
tiny bumps)
What is a Local Maximum?
• A local maximum is a “small peak” where the
model thinks it’s at the top.
• The slope is zero — so the model may get stuck
there.
• It’s like a false win — we don’t want to stop there.
• Fix: Use momentum (like a rolling ball that ignores
tiny bumps)
Saddle Points (The Flat Plateau)
• A saddle point is a spot on the curve where:
• It looks flat like a minimum in one direction...
• ...but it’s actually going up in another direction!
• The gradient becomes zero, so the model may think
it’s “done”
• But it’s not the best point — it’s just tricky.
Real-Life Analogy:
• Imagine sitting on a horse saddle
• You feel downward in one direction (the dip)
And upward in another (the sides)
• That’s a saddle point!
Why Are Saddle Points a Problem?
• The model might get stuck at a saddle point — like a false flat spot.
• It’s not a real minimum, but the gradient is zero, so it stops moving.
• This can slow down learning or lead to wrong solutions.
• Saddle points are common in deep neural networks because they have many
dimensions.
Optimizers
What is an Optimizer?
• An optimizer is like a coach that helps the neural network get better at
predictions.
• It uses the loss value to decide how to adjust the weights.
• The goal: Make the model smarter step-by-step.
• Optimizers use techniques like gradient descent to find the best path.
• Different optimizers have different styles — some are faster, some are smoother.
What Optimizer Do?
• Adjust Learning Rates Automatically
• Like cruise control (faster on flat roads, slower on curves)
• Add Momentum
• Rolls past small bumps (avoids getting stuck in tiny valleys)
• Key Idea:
"Optimizers help gradient descent learn faster and smarter!”
• Fun Test:
• "Which optimizer would you use for most problems?”
• Start with Adam — it's like the all-in-one tool that works in most cases!
What is SGD? (Stochastic Gradient Descent)
• SGD is a basic way to train a model — it takes small, random steps to
improve.
• Updates the model’s weights using one piece of data at a time (or a small
batch).
• Instead of waiting for the whole dataset, it learns faster with quick
guesses.
• It’s a little noisy — sometimes the steps go a bit off, but it usually gets to
the goal.
• It helps the model move towards lower loss, one step at a time.
Real-Life Analogy:
• Imagine learning math by solving one question at a time, checking your
mistake, and adjusting your strategy — that’s what SGD does!
Momentum: The "Rolling Ball" Trick
• Momentum, an optimizer helps the model move faster in the right direction.
• It remembers past steps, so it doesn’t keep changing direction too much.
What It Does:
• Adds speed to gradient descent (remembers past steps)
• Like a ball rolling downhill: Gains momentum → rolls past small bumps!
Why It Helps:
• Escapes tiny valleys (local minima)
• Smoother path → Faster training
• Less wobbly than plain SGD
• Set momentum between 0.8–0.99 (higher = more speed but risk overshooting)
What is Adam Optimizer?
• Adam is a smart optimizer that combines the best of two worlds:
• It has memory like Momentum
• It adapts like RMSProp
• It adjusts the step size automatically for each weight — no need to guess!
• It works great for most deep learning problems — fast and reliable.
• That’s why it’s the go-to choice for many AI models today
Why Everyone Loves It:
• Fast (gets close to the answer quickly)
• Handles noisy data (no sweat!)
• Works out-of-the-box (just hit go!)
"Adam is like a self-driving car—it adjusts speed and direction automatically!"
What is RMSProp
RMSProp is an optimizer that adapts the step size during training.
• It focuses more on recent gradients (recent changes).
• Helps when training is bumpy or unstable — smooths things out!
• It slows down the steps if the gradients are too big — so it doesn’t
overshoot.
• Great for tricky problems like recurrent neural networks or noisy data.
• Automatically slows down for steep slopes
• Big gradient → Smaller step
• Small gradient → Bigger step
• Like a smart hiker: Short steps on cliffs, long strides on flat ground
Backpropagation
What is Backpropagation?
• Backpropagation is how a neural network learns from its mistakes.
• First, the network makes a prediction (forward pass).
• Then, it checks how wrong it was using a loss function.
• The mistake (error) is sent backward through the network.
• Each layer gets feedback and adjusts its weights to do better next time!
Real-Life Analogy:
• Imagine you took a test and got some answers wrong.
• The teacher gives your paper back with notes.
• You study what you missed, fix your thinking, and improve!
• That’s backpropagation — the network correcting itself after every quiz.
Why do we need Backpropagation?
Problem:
• A neural network makes a mistake (e.g., calls a cat a dog).
• But how does it know which weights to fix?
Solution: Backpropagation!
Finds the Culprits:
• Calculates which weights caused the error (using gradients).
• Like tracing back "Who messed up?" in a team project .
Updates Smartly:
• Adjusts weights proportionally to their blame.
• Big mistake? Big change. Small mistake? Tiny tweak.
Learns Fast:
• Fixes all layers at once (not just the last one).
How Neural Networks Learn from Mistakes
Step 1: Forward Pass (Guess)
• Data flows forward → Makes prediction (e.g., "80% cat")
• Oops! True label = "dog" → Loss = "How wrong?"
Step 2: Calculate Gradients (Blame)
• Math trick: Chain Rule traces error backward
• Finds which weights caused the mistake
Step 3: Update Weights (Fix)
• Adjusts weights using gradients:
• New Weight = Old Weight − (Learning Rate × Gradient)
• Key Idea: "Backprop is like a teacher grading your test and fixing each mistake!"
Step 1: Make a Guess (Forward Pass)
• The neural network gets some input data (like an image or number).
• It goes layer by layer, doing math and passing the results forward.
• At the end, it makes a prediction (e.g., “This is a 3!”).
• This part is called the forward pass.
Real-Life Analogy:
• Like answering a question on a test — you try your best based on what you
know!
Step 2: Oops! How Wrong Was I?
• After guessing, the model checks how wrong it was using a loss function.
• The bigger the error, the more it needs to fix.
• The mistake is then sent backward through the network to show where things
went wrong.
• This part is called backpropagation.
Real-Life Analogy:
• Like your teacher handing your test back with notes:
“You were close here, but way off there.”
Step 3: Fix the Weights, Try Again!
• The network uses the error to adjust the weights and biases.
• This is done using gradients — they tell which way to go and how far.
• The updated weights are used for the next round of training.
• Over time, the network gets better and better at making predictions!
Real-Life Analogy:
• Like a student fixing their answers after seeing what went wrong — and getting
better with each try!
Backprop vs. Human Learning
Example:
• Math Test:
• Missed a problem? Teacher explains which step was wrong.
• Neural Net:
• Misclassified cat? Backprop fixes exact weights responsible.
Without Backprop?
• Random guesses → Like studying without feedback!
Forward Pass vs Backward Pass
Forward Pass Backward Pass (Backpropagation)
The model makes a prediction The model learns from its mistakes
Data flows from input to output Error flows from output to input
Uses weights to make decisions Updates weights to fix mistakes
It’s like answering a question It’s like getting feedback on your answer
Called during training or testing Happens only during training
Chain Rule
• What It Does:
• Breaks big problems into small steps to find "who’s responsible" for
an error.
• Example:
• If your cake burns , was it the oven temp? The flour mix? The baking time?
Chain Rule
How It Works in Neural Nets:
• Forward Pass: Predict (e.g., "Cat!").
• Backward Pass:
• Find error (e.g., "Should be dog!").
• Use chain rule to trace:
∂Loss/∂Weight = ∂Loss/∂Output × ∂Output/∂Weight
Key Idea:
• "Like blaming each chef in a kitchen for a burnt cake—step by step!"
Gradients in Backpropagation: The
"Correction Compass"
What They Do:
• Gradients = Directions + Magnitude
• "Which way?" (Should weights increase or decrease?)
• "How much?" (Tiny tweak or big change?)
Why They Matter:
1. Blame Assignment:
• High gradient → "This weight caused a big error!"
• Low gradient → "This weight is fine, ignore it."
2. Smart Updates:
• Adjust weights using: New Weight = Old Weight − (Learning Rate × Gradient)
"If gradient = -1.5, does the weight increase or decrease?"*
(Answer: Increase! Because W = W − (-1.5))
Challenges in Backpropagation
1. Vanishing Gradients
Problem: Tiny gradients → Updates too small → Network stops learning.
Why? Deep networks + sigmoid/tanh activations.
Fix: Use ReLU or Residual Connections.
2. Exploding Gradients
Problem: Huge gradients → Wild weight swings → Crashes training.
Why? Large weights or deep networks.
Fix: Gradient Clipping (like a speed limit).
Challenges in Backpropagation
3. Local Minima & Saddle Points
Problem: Gets stuck in "fake" low spots.
Fix: Momentum (roll past small bumps) or Adam optimizer.
4. Slow Training
Problem: Too many calculations → Takes forever.
Fix: GPUs + Mini-batches.
Project
Title - Forward & Backward Pass in Neural Networks — with NumPy
Objective –
• Manual implementation of a 2-layer NN
• How gradients flow backward to update weights
• How ReLU and affect learning
• Training using gradient descent
Project File - [Link]
Epochs, Batches, Iterations
What is an Epoch?
• An epoch means one full round through the entire training dataset.
• The model sees every single example once and tries to learn from it.
• We usually train for many epochs so the model can keep getting better.
• The more epochs, the more chances the model has to practice and improve.
• But training for too many epochs can make the model memorize instead of learn
(that’s called overfitting!).
Real-Life Analogy:
• Think of an epoch like reading your textbook cover to cover once.
• You might need to read it multiple times to really understand and remember
everything!
What is a Batch?
• A batch is a small group of training examples used at once.
• Instead of sending the whole dataset at once, we send it bit by bit.
• It helps the model train faster and use less memory.
• The batch is like a bite-sized portion of the data.
• The size of each group is called the batch size (e.g., 32, 64).
Real-Life Analogy:
• Imagine carrying groceries
• Instead of carrying all the bags at once (too heavy!),
• You carry them in smaller batches — one trip at a time.
What is an Iteration?
• An iteration is one update step during training.
• Every time the model processes one batch, that’s one iteration.
• It updates the weights based on what it learned from that batch.
• In one epoch, there are many iterations (one per batch).
• The model gets a little better with each iteration!
Real-Life Analogy:
• If an epoch is reading the whole book,
an iteration is reading just one page and learning from it.
Thanks
Samatrix Consulting Pvt Ltd