AI for Computer Vision - Detailed Notes
(RGPV 7th Sem)
Unit I: Introduction to Image Formation and Processing
Detailed notes will be added here...
Unit II: Feature Detection, Matching and Segmentation
Detailed notes will be added here...
Unit III: Feature-based Alignment & Motion Estimation (2D/3D)
Detailed notes will be added here...
Unit IV: 3D Reconstruction Techniques
Detailed notes will be added here...
Unit V: Image-based Rendering and Recognition
Detailed notes will be added here...
✅ **Unit I: Introduction to Image Formation and Processing**
--- THEORY NOTES ---
• Image Formation Basics: An image is a 2D projection of a 3D scene captured by a camera
using lens and light.
• Camera Model: Describes how light rays map from the world to the image plane.
- Pinhole Camera Model: Simplest geometry-based projection model.
- Perspective Projection: Objects farther away appear smaller.
• Radiometry & Photometry: Measurement of light energy affecting pixel intensity values.
• Image Types: Binary, Grayscale, Color, RGB, HSV.
--- IMAGE PROCESSING PIPELINE ---
Image Acquisition ➝ Preprocessing ➝ Feature Extraction ➝ Analysis ➝
Recognition/Interpretation
--- IMAGE FILTERING TECHNIQUES ---
1 Spatial Domain Filters: Operate directly on pixels
1️⃣
• Smoothing Filters → Noise Reduction (Mean, Gaussian)
• Sharpening Filters → Highlight Edges (Laplacian, High-Boost)
2️⃣Frequency Domain Processing: Apply Fourier Transform
• Low-Pass Filters → Remove noise
• High-Pass Filters → Detect edges
--- EDGE DETECTION ---
• Purpose: Detect boundaries of objects
• Operators: Sobel, Prewitt, Roberts, Canny
• Canny Edge Detector Steps: Smoothing → Gradient → Non-max suppression → Hysteresis
thresholding
--- IMAGE TRANSFORMS ---
• Fourier Transform → represents image in frequency domain
• Wavelets → Multi-resolution image analysis
--- PYRAMIDS ---
• Gaussian Pyramid: Repeated smoothing + downsampling
• Laplacian Pyramid: Edge information pyramid
--- OPTIMIZATION IN VISION ---
Used in feature matching, energy minimization, segmentation etc.
--- SHORT EXAM QUESTIONS ---
Q1: Define Pinhole Camera Model.
Q2: Differentiate between spatial and frequency domain filters.
Q3: Write steps of Canny edge detection.
--- LONG EXAM QUESTIONS ---
Q1: Explain image formation process with camera model and geometry.
Q2: Explain image filtering with suitable examples and diagrams.
✅ **Unit II: Feature Detection, Matching & Segmentation**
--- THEORY NOTES ---
⭐ IMP: Feature Detection identifies key points in an image that are invariant to changes in
scale, rotation or lighting.
Common detectors: Harris Corner, SIFT, SURF, FAST, ORB
➡ Harris Corner Detector
• Based on detecting corners where there is a large change in all directions
• Uses auto-correlation matrix
➡ SIFT (Scale Invariant Feature Transform) ⭐ IMP
• Detects scale and rotation invariant features
• Steps: Scale-space → Keypoint localization → Orientation assignment → Descriptor
generation
➡ SURF (Speeded-Up Robust Features)
• Faster than SIFT using box filters and integral images
➡ FAST & ORB
• FAST is extremely fast corner detector
• ORB combines FAST + BRIEF descriptors for real-time apps (e.g., robotics)
--- FEATURE MATCHING ---
⭐ IMP: Used to find correspondences between images
Methods: SSD (Sum of Squared Differences), NCC (Normalized Cross Correlation), Hamming
distance for ORB
➡ RANSAC (Random Sample Consensus) ⭐ Most Asked
• Removes false matches by estimating the best model through random sampling
--- IMAGE SEGMENTATION ---
⭐ IMP: Process of dividing image into meaningful regions
➡ Thresholding-based Segmentation
• Otsu Method: Finds optimal threshold by maximizing variance between classes
➡ Region-based Segmentation
• Region growing & splitting, merging
➡ Clustering-based Segmentation ⭐ IMP
• K-Means clustering: groups pixels based on similarity
➡ Graph-based Segmentation ⭐ IMP
• Graph Cut: minimizes cut cost to separate foreground & background
➡ Watershed Segmentation ⭐ Important for diagrams
• Visualizes gradient of image as a topographic surface
--- SHORT EXAM QUESTIONS ---
Q1: Define feature detection (IMP)
Q2: Difference between SIFT and SURF
Q3: What is RANSAC? Why used? ⭐
--- LONG EXAM QUESTIONS (Repeated in RGPV) ---
Q1: Explain SIFT algorithm with steps ⭐⭐
Q2: Explain image segmentation techniques with examples ⭐⭐
✅ **Unit III: Feature-based Alignment & Motion Estimation (2D/3D)**
--- THEORY NOTES ---
➡ ⭐ IMP: Pose Estimation
• Determines camera location + orientation relative to the object
• Uses feature correspondences and geometric constraints
➡ Triangulation
• Uses 2D projections from multiple views to recover 3D points
➡ ⭐ IMP: Structure from Motion (SfM)
• Recovers 3D scene + camera motion from multiple images
• Used in 3D mapping, AR, drones
➡ ⭐ Most Asked: Optical Flow
• Motion estimation by pixel intensity changes between frames
• Assumption: intensity constant over motion
Popular Methods:
• Lucas-Kanade → Sparse estimation
• Horn-Schunck → Dense estimation
➡ Bundle Adjustment ⭐⭐ Highly Asked in Exams
• Optimization technique in SfM to minimize reprojection error
➡ ⭐ Most Repeated: Camera Calibration
• Estimation of intrinsic + extrinsic parameters of camera
➡ Layered Motion Estimation
• Separates motion into different layers for better tracking
--- ✅ OpenCV Code Examples (IMP for Viva) ---
➡ Lucas-Kanade Optical Flow
import cv2
cap = [Link](0)
lk_params = dict(winSize=(15, 15))
while True:
ret, frame = [Link]()
gray = [Link](frame, cv2.COLOR_BGR2GRAY)
[Link]("Optical Flow", gray)
if [Link](1) & 0xFF == ord('q'):
break
[Link]()
[Link]()
➡ Camera Calibration (Pseudo Example)
import cv2
import numpy as np
# Termination criteria
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)
objp = [Link]((6*7, 3), np.float32)
objp[:, :2] = [Link][0:7, 0:6].[Link](-1, 2)
# Used when clicking chessboard patterns
--- SHORT EXAM QUESTIONS ---
Q1: What is optical flow? ⭐
Q2: Define structure from motion (SfM).
Q3: What is camera calibration? ⭐
--- LONG EXAM QUESTIONS (Repeated in RGPV) ---
Q1: Explain optical flow with its different methods ⭐⭐
Q2: Explain Structure from Motion with example diagrams ⭐⭐
✅ **Unit IV: 3D Reconstruction Techniques**
--- THEORY NOTES ---
➡ ⭐ Stereo Vision (IMP + Diagrams)
• Uses two or more camera views to estimate depth
• Works like human eyes → Binocular disparity enables depth estimation
• Steps: Feature Matching → Disparity Map → Depth Map Computation
📌 Depth Formula:
Depth (Z) = (f × B) / Disparity (d)
where f = focal length, B = baseline distance
📝 Diagram (Concept - Text Representation):
[Camera-L] ---- Object ---- [Camera-R]
Matching points shift → Disparity → Depth
➡ ⭐ Shape-from-X Methods (Mostly Asked)
• Recover 3D shape using different clues:
Shape-from-Shading
1️⃣
• Uses variations in brightness to estimate surface orientation
• Assumes: Single light source, uniform material
2️⃣Shape-from-Silhouette
• Combined outlines from multiple views create visual hull
Shape-from-Motion
3️⃣ (SfM) Repeated in Exams
• Uses motion between frames to build 3D structure
➡ 3D Point Cloud Representation
• Collection of 3D points representing object shape
• Used in robotics, AR, autonomous navigation
➡ Volumetric Reconstruction
• Represents shape using voxels (3D pixels)
• Example: CT scan volume model
➡ Surface Reconstruction ⭐ Viva Question
• Creates surface mesh from point cloud
• Output: Triangular mesh
--- APPLICATIONS ---
• AR/VR, Robotics, Medical Imaging, Gaming, Drone Mapping
--- SHORT EXAM QUESTIONS ---
Q1: Define disparity in stereo vision. ⭐
Q2: What is point cloud?
Q3: Define volumetric reconstruction.
--- LONG EXAM QUESTIONS (Frequently Asked in RGPV) ---
Q1: Explain stereo vision with depth estimation formula + diagram ⭐⭐
Q2: Describe different Shape-from-X methods ⭐⭐
UNIT-5: Deep Learning for Computer Vision
1 Introduction to Deep Learning for Vision
1️⃣
Deep Learning uses neural networks with multiple layers to learn image patterns. It
automatically extracts features like edges, textures, objects — unlike traditional computer
vision where features are manually designed.
2️⃣Convolutional Neural Networks (CNN)
CNNs are the backbone of modern computer vision. They consist of layers that learn
hierarchical features:
• Low-level: edges, corners
• Mid-level: shapes
• High-level: full objects
Main Components of CNN:
1️⃣Convolution Layer – extracts features using filters/kernels
2️⃣Activation Function (ReLU) – introduces non-linearity
3️⃣Pooling Layer – reduces spatial size (Max Pooling is common)
4️⃣Fully Connected Layer – final classification
CNN Block Diagram
Input Image → Convolution → ReLU → Pooling → Flatten → Fully Connected → Softmax
Output
Important CNN Terms
• Stride – step size of filter movement
• Padding – keeps output size same as input
• Feature Map – output of convolution layer
• Kernel – small filter matrix
3️⃣Popular CNN Architectures
✅ LeNet-5 – first CNN for digits (MNIST)
✅ AlexNet – deeper CNN using ReLU & dropout
✅ VGGNet – uses 3×3 convolutions repeatedly
✅ ResNet – introduces skip connections to solve vanishing gradient
ResNet Block Structure:
Input → Convolution Layers → Output + Input (Skip Connection) → ReLU
4️⃣Transfer Learning
Pre-trained models like VGG16, ResNet50 are trained on large datasets like ImageNet.
We reuse their learned features and train only final layers for new tasks.
Advantages:
• Less data required
• Faster training
• Higher accuracy
5️⃣CNN Code Example (PyTorch)
A simple CNN image classifier example:
import torch
import [Link] as nn
import [Link] as optim
from torchvision import datasets, transforms
# Data Preprocessing
transform = [Link]([
[Link](),
[Link]((0.5,), (0.5,))
])
train_data = [Link](root='./data', train=True,
download=True, transform=transform)
train_loader = [Link](train_data,
batch_size=64, shuffle=True)
# CNN Model
class CNN([Link]):
def __init__(self):
super(CNN, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3)
[Link] = nn.MaxPool2d(2, 2)
self.fc1 = [Link](32 * 13 * 13, 10)
def forward(self, x):
x = [Link](self.conv1(x))
x = [Link](x)
x = [Link](-1, 32*13*13)
x = self.fc1(x)
return x
model = CNN()
criterion = [Link]()
optimizer = [Link]([Link](), lr=0.001)
# Training Loop
for epoch in range(1):
for images, labels in train_loader:
optimizer.zero_grad()
output = model(images)
loss = criterion(output, labels)
[Link]()
[Link]()
print("Training Completed!")
6️⃣Applications of Deep Learning in Vision
• Face Recognition
• Self-driving Cars
• Medical Imaging
• Object Detection (YOLO)
• Image Segmentation (U-Net)
✅ Unit-5 Completed ✅