Computer Vision After Deep Learning — Complete Study Notes
Computer Vision
After Deep Learning
Complete Study Notes — From Zero to Expert
Topics Covered:
1. Image Basics — Reading, Color Spaces, Resizing, Normalization, Thresholding
2. Feature Detection — Edges, Corners, Keypoints, Descriptors
3. Image Transformation — Geometric, Filtering, Morphological, Intensity
4. CV Pipeline — Acquisition → Preprocessing → Features → Models → Evaluation
Page 1 of 23
Computer Vision After Deep Learning — Complete Study Notes
SECTION 1: IMAGE BASICS
Before jumping into algorithms, you need to understand what an image actually is in a computer's
memory, and how to work with it using Python libraries.
1.1 What is a Digital Image?
A digital image is simply a 2D grid (matrix) of pixels. Each pixel stores a color value. Depending on the
image type, a pixel can store:
• Grayscale: a single number (0 = black, 255 = white, values in between = shades of grey)
• Color (RGB): three numbers — one each for Red, Green, and Blue channels
• Color (RGBA): four numbers — RGB plus Alpha (transparency)
So a 640×480 grayscale image is a matrix of shape (480, 640) — 480 rows, 640 columns.
A 640×480 color image is a matrix of shape (480, 640, 3) — the extra dimension holds R, G, B values.
🔑 Key Insight
Images are just numbers! Every operation in CV is essentially math on these numbers.
1.2 Image Reading & Writing — OpenCV vs PIL
Two main libraries handle image I/O in Python:
OpenCV (cv2) PIL / Pillow
Industry standard for CV tasks Great for general image manipulation
Reads images in BGR order (not RGB!) Reads images in RGB order
Very fast — written in C++ Easier API for beginners
import cv2 from PIL import Image
img = [Link]("[Link]") img = [Link]("[Link]")
[Link]("[Link]", img) [Link]("[Link]")
⚠️Critical Warning
OpenCV stores colors as BGR (Blue-Green-Red), NOT the standard RGB. This is a common source
of bugs. Always convert when mixing OpenCV with other libraries. Use [Link](img,
cv2.COLOR_BGR2RGB) to convert.
After reading, the image is a NumPy array. You can inspect it:
import cv2
import numpy as np
img = [Link]("[Link]")
print([Link]) # e.g. (480, 640, 3) → height, width, channels
print([Link]) # uint8 → each value is 0-255
print(img[0, 0]) # pixel at row=0, col=0 → [B, G, R] values
1.3 Grayscale Conversion
Converting a color image to grayscale means collapsing 3 channels into 1. Instead of storing (R,G,B)
per pixel, we store a single brightness value.
Page 2 of 23
Computer Vision After Deep Learning — Complete Study Notes
The formula used is NOT a simple average. Human eyes are more sensitive to green light, so the
formula is weighted:
Grayscale = 0.299 × R + 0.587 × G + 0.114 × B
In code:
# OpenCV (input is BGR, output is single-channel gray)
gray = [Link](img, cv2.COLOR_BGR2GRAY)
# PIL
gray_pil = img_pil.convert("L") # L = luminance = grayscale
Result shape changes from (H, W, 3) to (H, W) — no more channel dimension.
💡 Why grayscale?
Many CV algorithms only need brightness information, not color. Grayscale is faster to process and
reduces memory. Edge detectors, most feature detectors, etc., all operate on grayscale images.
1.4 Color Space Conversion
A color space is a mathematical way of representing color. Different tasks need different color spaces.
RGB ↔ Grayscale
Already covered above. RGB has 3 channels (Red, Green, Blue). Grayscale has 1.
RGB ↔ HSV
HSV stands for Hue, Saturation, Value:
• Hue: the actual color (red=0°, yellow=60°, green=120°, cyan=180°, blue=240°, magenta=300°)
• Saturation: how vivid/pure the color is (0 = grey, 1 = full color)
• Value: brightness (0 = black, 1 = fully bright)
Why use HSV? Because it separates color identity (hue) from lighting conditions (value). Detecting a
red ball is MUCH easier in HSV — you just look for pixels where hue is near 0° regardless of lighting.
# OpenCV
hsv = [Link](img, cv2.COLOR_BGR2HSV)
rgb = [Link](hsv, cv2.COLOR_HSV2BGR)
# In OpenCV: Hue is 0-179 (not 0-360), S and V are 0-255
# Lower red: H=0-10, Upper red: H=170-179
BGR ↔ RGB
Since OpenCV uses BGR and most other tools use RGB, this conversion is the most common:
rgb = [Link](bgr_img, cv2.COLOR_BGR2RGB) # BGR → RGB
bgr = [Link](rgb_img, cv2.COLOR_RGB2BGR) # RGB → BGR
# Or with NumPy slicing (just reverses the channel axis)
rgb = bgr_img[:, :, ::-1]
1.5 Image Resizing & Interpolation Methods
Resizing means changing the number of pixels in an image. When you make an image bigger
(upsample) or smaller (downsample), you need to estimate values for pixels that don't exist exactly —
this is called interpolation.
Page 3 of 23
Computer Vision After Deep Learning — Complete Study Notes
Nearest Neighbor Interpolation
The simplest method. For each new pixel position, find the nearest pixel in the original image and just
copy its value. Very fast, but produces blocky/pixelated results when upsampling.
• Best for: pixel art, masks, label maps (where you must NOT average labels)
• Worst for: photos (looks very blocky when enlarged)
resized = [Link](img, (new_width, new_height),
interpolation=cv2.INTER_NEAREST)
Bilinear Interpolation
Uses the 4 nearest pixels and takes a weighted average based on distance. The result is much
smoother than nearest neighbor.
• Best for: general use, real-time applications
• Works well for both upsampling and downsampling
resized = [Link](img, (new_width, new_height),
interpolation=cv2.INTER_LINEAR) # default
Bicubic Interpolation
Uses 16 nearest pixels (4×4 neighborhood) with a cubic polynomial weighting. Produces the sharpest,
highest quality results.
• Best for: high-quality upscaling (enlarging photos)
• Slower than bilinear
resized = [Link](img, (new_width, new_height),
interpolation=cv2.INTER_CUBIC)
Method When to Use
Nearest Neighbor Masks/labels, speed priority
Bilinear General use, real-time
Bicubic High quality, photo enlargement
1.6 Image Normalization
Raw pixel values are integers 0–255. Neural networks and many algorithms work much better when
input values are small numbers (like 0.0–1.0 or -1.0 to 1.0). Normalization is the process of rescaling
pixel values.
Min-Max Normalization (Scale to 0–1)
img_float = [Link](np.float32) / 255.0
# Now all values are in range [0.0, 1.0]
Standardization (Zero mean, unit variance)
mean = [Link](img)
std = [Link](img)
img_norm = (img - mean) / std
# Values are now centered around 0
ImageNet Normalization (for pretrained models)
# Standard values used when training on ImageNet
mean = [0.485, 0.456, 0.406] # per channel (R, G, B)
std = [0.229, 0.224, 0.225]
Page 4 of 23
Computer Vision After Deep Learning — Complete Study Notes
import [Link] as T
normalize = [Link](mean=mean, std=std)
💡 Why normalize?
Pixel values 0-255 can cause problems: gradients become unstable, learning is slow, and different
images have wildly different scales. Normalization puts all inputs on equal footing.
1.7 Image Thresholding
Thresholding converts a grayscale image into a binary image — each pixel becomes either pure black
(0) or pure white (255). It is the simplest form of image segmentation (separating objects from
background).
Binary Thresholding
You pick a threshold value T. Every pixel above T becomes white (255), every pixel at or below T
becomes black (0).
# Simple binary threshold
T = 127 # threshold value
ret, binary = [Link](gray, T, 255, cv2.THRESH_BINARY)
# THRESH_BINARY: pixel > T → 255, else → 0
# THRESH_BINARY_INV: pixel > T → 0, else → 255 (inverted)
# THRESH_TRUNC: pixel > T → T, else → pixel (clips highs)
# THRESH_TOZERO: pixel > T → pixel, else → 0 (kills darks)
Adaptive Thresholding
Problem with binary thresholding: if the image has uneven lighting (e.g., a document with shadows), a
single T value won't work — the shadow areas will be too dark. Solution: use a different threshold for
each local region of the image.
adaptive = [Link](
gray, # source image
255, # max value
cv2.ADAPTIVE_THRESH_GAUSSIAN_C, # use Gaussian weighted average
cv2.THRESH_BINARY, # output type
11, # block size (must be odd): size of local region
2 # C: subtract this constant from mean
)
# ADAPTIVE_THRESH_MEAN_C: threshold = mean of block - C
# ADAPTIVE_THRESH_GAUSSIAN_C: threshold = Gaussian-weighted mean - C
🔑 When to use what?
Binary threshold: uniform lighting, simple documents. Adaptive threshold: uneven lighting, real-world
photos of text/documents, complex scenes.
Page 5 of 23
Computer Vision After Deep Learning — Complete Study Notes
SECTION 2: FEATURE DETECTION
Features are distinctive, informative parts of an image — edges, corners, blobs — that summarize
what's in the image without having to look at every pixel. Feature detection finds these interesting
locations.
2.1 Edge Detection
An edge is a place in an image where pixel intensity changes rapidly — for example, the boundary
between a white wall and a dark door. Mathematically, edges are where the image gradient (rate of
change) is large.
We detect this by convolving the image with special filters (kernels) that compute derivatives.
🧮 What is convolution?
Convolution is sliding a small matrix (kernel) over the image. At each position, you multiply the kernel
values with the corresponding pixel values and sum them up. The result captures local patterns like
edges or blurs.
Sobel Operator
The Sobel operator uses two 3×3 kernels to compute the gradient in the X direction (horizontal edges)
and Y direction (vertical edges) separately.
# Sobel kernels:
# Gx (horizontal): Gy (vertical):
# -1 0 +1 -1 -2 -1
# -2 0 +2 0 0 0
# -1 0 +1 +1 +2 +1
Gx = [Link](gray, cv2.CV_64F, 1, 0, ksize=3) # x-direction
Gy = [Link](gray, cv2.CV_64F, 0, 1, ksize=3) # y-direction
magnitude = [Link](Gx**2 + Gy**2) # edge strength
direction = np.arctan2(Gy, Gx) # edge direction
• Gx detects vertical edges (changes left-right)
• Gy detects horizontal edges (changes up-down)
• Magnitude = sqrt(Gx² + Gy²) gives the full edge strength
• Direction = atan2(Gy, Gx) gives the angle of the edge
Prewitt Operator
Very similar to Sobel, but simpler — all weights are equal (no center-weighted emphasis). Slightly less
noise-resistant than Sobel.
# Gx: Gy:
# -1 0 +1 -1 -1 -1
# -1 0 +1 0 0 0
# -1 0 +1 +1 +1 +1
# No direct OpenCV function; use filter2D
Kx = [Link]([[-1,0,1],[-1,0,1],[-1,0,1]], dtype=np.float32)
Ky = [Link]([[-1,-1,-1],[0,0,0],[1,1,1]], dtype=np.float32)
Gx = cv2.filter2D(gray, -1, Kx)
Gy = cv2.filter2D(gray, -1, Ky)
Page 6 of 23
Computer Vision After Deep Learning — Complete Study Notes
Roberts Cross Operator
The oldest and simplest edge detector. Uses 2×2 kernels to compute diagonal gradients.
# Gx: Gy:
# +1 0 0 +1
# 0 -1 -1 0
Kx = [Link]([[1,0],[0,-1]], dtype=np.float32)
Ky = [Link]([[0,1],[-1,0]], dtype=np.float32)
Gx = cv2.filter2D(gray, -1, Kx)
Gy = cv2.filter2D(gray, -1, Ky)
Roberts is very sensitive to noise because it uses only 2×2 neighborhoods. Mostly used for education
today.
Laplacian Operator
Instead of computing gradients separately in X and Y, Laplacian computes the second derivative
directly. It detects edges in ALL directions at once.
# Laplacian kernel:
# 0 1 0
# 1 -4 1
# 0 1 0
laplacian = [Link](gray, cv2.CV_64F, ksize=3)
# VERY sensitive to noise! Always blur first:
blurred = [Link](gray, (5,5), 0)
laplacian = [Link](blurred, cv2.CV_64F)
⚠️Laplacian & Noise
Laplacian amplifies noise because it looks at second derivatives. Always apply Gaussian blur before
Laplacian. This combination is called LoG (Laplacian of Gaussian).
Canny Edge Detector
The gold standard of edge detection. Canny is a multi-step algorithm that produces thin, clean,
accurate edges. It is the most widely used edge detector.
How Canny works step by step:
• Step 1 — Gaussian Blur: Smooth the image to reduce noise
• Step 2 — Sobel Gradient: Compute gradient magnitude and direction at every pixel
• Step 3 — Non-Maximum Suppression: Thin edges to 1 pixel wide (keep only the local maximum
along the gradient direction)
• Step 4 — Double Thresholding: Classify pixels as strong edges (above high threshold), weak
edges (between thresholds), or non-edges (below low threshold)
• Step 5 — Edge Tracking by Hysteresis: Keep weak edges ONLY if they connect to strong edges
edges = [Link](
gray, # input (grayscale)
threshold1=50, # low threshold
threshold2=150, # high threshold
apertureSize=3 # Sobel kernel size
)
# Rule of thumb: high:low ratio should be 2:1 or 3:1
# Lower thresholds = more edges detected
# Higher thresholds = fewer, stronger edges only
Page 7 of 23
Computer Vision After Deep Learning — Complete Study Notes
2.2 Corner & Keypoint Detection
A corner is a point where two edges meet — a place that is distinctive in BOTH horizontal and vertical
directions. Corners are great landmarks for matching images because they are unique and repeatable.
Harris Corner Detector
Harris computes, for each pixel, how much the image changes if we shift a small window in any
direction. If it changes a lot in all directions → corner. Changes in only one direction → edge. No
change → flat region.
Mathematically, it computes a 2×2 matrix M (structure tensor) from image gradients, then:
R = det(M) - k × trace(M)²
# where k is a sensitivity parameter (usually 0.04-0.06)
# R >> 0 → corner
# R << 0 → edge
# R ≈ 0 → flat region
harris = [Link](gray, blockSize=2, ksize=3, k=0.04)
harris = [Link](harris, None) # enlarge corners for visibility
img[harris > 0.01 * [Link]()] = [0, 0, 255] # mark red
Shi-Tomasi Corner Detector
An improvement on Harris. Instead of using det(M) - k×trace(M)², it uses the minimum eigenvalue of M:
R = min(λ1, λ2)
# If min eigenvalue > threshold → corner
Result: better corners, especially useful for tracking (used in Lucas-Kanade optical flow).
corners = [Link](
gray,
maxCorners=100, # return at most N corners
qualityLevel=0.01, # minimum quality (0-1)
minDistance=10, # minimum pixels between corners
)
corners = np.int0(corners)
for c in corners:
x, y = [Link]()
[Link](img, (x,y), 5, (0,255,0), -1)
FAST (Features from Accelerated Segment Test)
Harris and Shi-Tomasi are too slow for real-time applications. FAST is designed for speed.
How FAST works:
• For each pixel p, look at a circle of 16 pixels around it (Bresenham circle, radius 3)
• If N or more consecutive pixels on the circle are ALL brighter than p+threshold, or ALL darker
than p-threshold → p is a corner
• Typical N = 12 (FAST-12) or 9 (FAST-9)
fast = cv2.FastFeatureDetector_create(threshold=10, nonmaxSuppression=True)
keypoints = [Link](gray, None)
img_kp = [Link](img, keypoints, None, color=(0,255,0))
⚡ Speed
FAST is ~10-100x faster than Harris. It is used in real-time SLAM (Simultaneous Localization and
Mapping) and AR applications.
Page 8 of 23
Computer Vision After Deep Learning — Complete Study Notes
2.3 Feature Descriptors
Detecting a keypoint (location) is just the first step. We also need to describe WHAT the region around
that keypoint looks like — so we can match it against keypoints in other images.
A descriptor is a vector of numbers that encodes the appearance around a keypoint. Good descriptors
are:
• Invariant to rotation (the same descriptor if the image is rotated)
• Invariant to scale (same descriptor if image is zoomed in/out)
• Invariant to illumination changes
SIFT (Scale-Invariant Feature Transform)
SIFT is the gold standard descriptor. Developed by David Lowe in 1999-2004. Steps:
• Scale-space extrema detection: Find keypoints at multiple scales using Difference of Gaussian
(DoG)
• Keypoint localization: Reject low-contrast/edge points, keep only stable ones
• Orientation assignment: Compute dominant gradient orientation → rotation invariance
• Descriptor creation: Divide 16×16 region into 4×4 sub-regions, compute 8-bin histogram of
gradient orientations in each → 4×4×8 = 128-dimensional vector
sift = cv2.SIFT_create()
keypoints, descriptors = [Link](gray, None)
# descriptors shape: (N_keypoints, 128)
img_sift = [Link](img, keypoints, None,
flags=cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)
⚠️Patent note
SIFT was patented (expired in 2020). Before 2020, you needed opencv-contrib for it. Now it's in the
main OpenCV package.
SURF (Speeded-Up Robust Features)
SURF approximates SIFT using integral images and box filters — roughly 3-7x faster than SIFT. The
descriptor is 64-dimensional (fast) or 128-dimensional (accurate).
# Requires opencv-contrib-python
surf = cv2.xfeatures2d.SURF_create(hessianThreshold=400)
keypoints, descriptors = [Link](gray, None)
# descriptors shape: (N_keypoints, 64) or (N_keypoints, 128)
ORB (Oriented FAST and Rotated BRIEF)
ORB is free, fast, and almost as good as SIFT/SURF. It combines FAST keypoint detection with BRIEF
descriptors, adding rotation invariance. Created by OpenCV lab as a patent-free alternative.
orb = cv2.ORB_create(nfeatures=500)
keypoints, descriptors = [Link](gray, None)
# descriptors: binary (0s and 1s), shape (N, 32) × 8 bits
# Matching ORB descriptors: use Hamming distance (XOR of bits)
bf = [Link](cv2.NORM_HAMMING, crossCheck=True)
matches = [Link](desc1, desc2)
matches = sorted(matches, key=lambda x: [Link])
Page 9 of 23
Computer Vision After Deep Learning — Complete Study Notes
BRIEF (Binary Robust Independent Elementary Features)
BRIEF is purely a descriptor (not a detector — needs a separate keypoint detector like FAST). It is
extremely fast because it uses binary strings.
How BRIEF works: around each keypoint, randomly sample pairs of pixels (p, q). For each pair,
compare: if p is brighter than q → bit=1, else bit=0. Concatenate all these bits → a 128, 256, or 512-bit
binary string.
# BRIEF alone (uses STAR detector by default in OpenCV)
star = cv2.xfeatures2d.StarDetector_create()
brief = cv2.xfeatures2d.BriefDescriptorExtractor_create()
keypoints = [Link](gray, None)
keypoints, descriptors = [Link](gray, keypoints)
Descriptor Key Properties
SIFT 128-dim float, scale+rot invariant, most accurate
SURF 64-dim float, faster than SIFT, patent-expired
ORB 32-dim binary, free, fast, good accuracy
BRIEF Pure binary, needs separate detector, ultrafast
Page 10 of 23
Computer Vision After Deep Learning — Complete Study Notes
SECTION 3: IMAGE TRANSFORMATION
Transformations change the image's appearance — repositioning, filtering, shaping, or adjusting
intensity. They are fundamental building blocks in any CV pipeline.
3.1 Geometric Transformations
Geometric transformations change where pixels are located (spatial mapping) without changing their
values.
Translation
Moving the image by (tx, ty) pixels — horizontally and/or vertically.
tx, ty = 100, 50 # shift right 100px, down 50px
M = np.float32([[1, 0, tx],
[0, 1, ty]])
translated = [Link](img, M, ([Link][1], [Link][0]))
Scaling
Scaling resizes the image (or part of it) by a factor. [Link] does this, but you can also use
warpAffine:
# Simple resize (covered in 1.5):
scaled = [Link](img, None, fx=2.0, fy=2.0) # 2x larger
# Or with transformation matrix (scale by sx, sy):
sx, sy = 2.0, 2.0
M = np.float32([[sx, 0, 0],
[0, sy, 0]])
scaled = [Link](img, M, (int([Link][1]*sx), int([Link][0]*sy)))
Rotation
Rotating the image by an angle θ around a center point. OpenCV's getRotationMatrix2D makes this
easy.
center = ([Link][1]//2, [Link][0]//2) # center of image
angle = 45 # degrees counter-clockwise
scale = 1.0 # 1.0 = no resize
M = cv2.getRotationMatrix2D(center, angle, scale)
rotated = [Link](img, M, ([Link][1], [Link][0]))
🧮 The rotation matrix
For angle θ: M = [[cos θ, -sin θ, tx], [sin θ, cos θ, ty]] where tx,ty shift to rotate around a specific center
(not just origin).
Affine Transformation
Affine transformation is the general form of translation + rotation + scaling + shearing. It maps three
points to three new points, and everything in between follows linearly.
Properties preserved: parallel lines stay parallel, ratios of distances stay the same.
# Define 3 source points and 3 destination points
pts1 = np.float32([[50,50], [200,50], [50,200]])
pts2 = np.float32([[10,100], [200,50], [100,250]])
Page 11 of 23
Computer Vision After Deep Learning — Complete Study Notes
M = [Link](pts1, pts2)
warped = [Link](img, M, ([Link][1], [Link][0]))
Perspective (Homography) Transformation
Affine uses 3 points; perspective uses 4 points. Perspective can correct for camera angle — for
example, making a tilted document look flat (bird's eye view).
Properties preserved: straight lines stay straight, BUT parallel lines may converge.
# Source: 4 corners of a document in a photo
pts_src = np.float32([[100,50],[400,30],[420,350],[80,370]])
# Destination: what those corners should become (rectangle)
pts_dst = np.float32([[0,0],[300,0],[300,400],[0,400]])
H = [Link](pts_src, pts_dst)
warped = [Link](img, H, (300, 400))
3.2 Filtering & Smoothing
Filtering replaces each pixel with a function of its neighborhood. Smoothing blurs the image to reduce
noise or detail.
Gaussian Blur
The most common blur. Each pixel is replaced with a weighted average of its neighbors, where weights
follow a 2D Gaussian (bell curve) centered on that pixel.
Why? Gaussian blur is mathematically optimal for noise reduction and is the base of many algorithms
(Canny, DoG, etc.).
blurred = [Link](
img,
(5, 5), # kernel size: must be odd — 3,5,7,9...
sigmaX=0 # standard deviation; 0 = auto-compute from kernel size
)
# Larger kernel = more blur
Median Blur
For each pixel, collect all values in the kernel window, sort them, and replace with the median value.
Unlike Gaussian, median blur completely removes impulse noise (salt-and-pepper noise) while
preserving edges.
median = [Link](img, ksize=5) # ksize must be odd
# Great for removing salt-and-pepper noise
Bilateral Filter
Gaussian blur blurs edges too — it doesn't know if neighbors are on the same surface. Bilateral filter is
edge-preserving: it averages nearby pixels BUT only weights pixels that have SIMILAR intensity values
(in addition to spatial proximity).
bilateral = [Link](
img,
d=9, # diameter of pixel neighborhood
sigmaColor=75, # how different in color can still be averaged
sigmaSpace=75 # how far spatially can still be averaged
)
# Smooths textures while keeping edges sharp
Filter Best Use Case
Page 12 of 23
Computer Vision After Deep Learning — Complete Study Notes
Gaussian General noise reduction, preprocessing
Median Salt-and-pepper noise removal
Bilateral Smoothing while preserving edges
3.3 Morphological Operations
Morphological operations work on binary (black/white) images. They probe the image with a small
shape called a structuring element (SE) and transform pixel values based on whether the SE fits or
touches foreground pixels.
Erosion
Erosion shrinks white (foreground) regions. A pixel stays white only if ALL pixels under the SE are
white. Effect: removes small white noise, shrinks objects, separates touching objects.
kernel = [Link]((5,5), np.uint8) # structuring element
eroded = [Link](binary_img, kernel, iterations=1)
Dilation
Dilation grows white (foreground) regions. A pixel becomes white if ANY pixel under the SE is white.
Effect: fills small holes, grows objects, connects nearby objects.
dilated = [Link](binary_img, kernel, iterations=1)
Opening
Opening = Erosion THEN Dilation. Net effect: removes small white noise/specs while approximately
preserving the size of larger objects.
opened = [Link](binary_img, cv2.MORPH_OPEN, kernel)
# = [Link]([Link](img, kernel), kernel)
Closing
Closing = Dilation THEN Erosion. Net effect: fills small black holes/gaps while approximately preserving
the size of larger objects.
closed = [Link](binary_img, cv2.MORPH_CLOSE, kernel)
# = [Link]([Link](img, kernel), kernel)
Operation Effect
Erosion Shrinks whites, removes tiny specs
Dilation Grows whites, fills tiny holes
Opening Removes noise (small white blobs)
Closing Fills gaps (small black holes)
3.4 Intensity Transformations: Histogram Equalization
Histogram Equalization is a technique to improve the contrast of an image. It redistributes pixel
intensities so they span the full range (0–255) more uniformly.
Intuition: if most pixels are bunched between 100–150 (low contrast), equalization stretches them to
cover 0–255 (high contrast).
# Global Histogram Equalization
equalized = [Link](gray) # input must be grayscale
Page 13 of 23
Computer Vision After Deep Learning — Complete Study Notes
# CLAHE — Contrast Limited Adaptive Histogram Equalization
# Works on local tiles instead of globally (better for real images)
clahe = [Link](clipLimit=2.0, tileGridSize=(8,8))
equalized_clahe = [Link](gray)
# For color images: convert to LAB, equalize L channel only
lab = [Link](img, cv2.COLOR_BGR2LAB)
lab[:,:,0] = [Link](lab[:,:,0])
result = [Link](lab, cv2.COLOR_LAB2BGR)
💡 CLAHE vs Global
Global equalization can over-amplify noise in flat regions. CLAHE limits contrast enhancement per tile
and is much better for real-world images like medical scans and photographs.
Page 14 of 23
Computer Vision After Deep Learning — Complete Study Notes
SECTION 4: THE COMPLETE CV PIPELINE
A CV pipeline is an end-to-end system that takes raw images as input and produces useful outputs
(class labels, matched images, recognized objects, etc.). Each stage feeds into the next.
Big Picture
Raw Image → Acquisition & Input → Preprocessing → Feature Extraction → Feature Representation
→ Model Training → Prediction & Inference → Evaluation
4.1 Image Acquisition & Input
The first stage: get images into your system in the right format.
Image Loading
import cv2
import numpy as np
from pathlib import Path
# Load single image
img = [Link]("[Link]") # BGR, uint8
img = [Link]("[Link]", 0) # grayscale directly
# Load all images from a folder
folder = Path('dataset/cats/')
images = [[Link](str(p)) for p in [Link]('*.jpg')]
Image Resizing (for consistency)
Models require all input images to be the same size. Resize every image to a fixed resolution.
TARGET = (224, 224) # width, height — standard for many models
resized = [Link](img, TARGET, interpolation=cv2.INTER_LINEAR)
Color Space Conversion
gray = [Link](img, cv2.COLOR_BGR2GRAY) # for single-channel
rgb = [Link](img, cv2.COLOR_BGR2RGB) # for plotting/PIL
hsv = [Link](img, cv2.COLOR_BGR2HSV) # for color segmentation
4.2 Preprocessing
Clean and standardize images before extracting features. The quality of preprocessing directly affects
how good your features will be.
Noise Reduction
# Remove Gaussian noise
denoised = [Link](gray, (5,5), 0)
# Remove salt-and-pepper noise
denoised = [Link](gray, 5)
# Edge-preserving smoothing
denoised = [Link](img, 9, 75, 75)
Page 15 of 23
Computer Vision After Deep Learning — Complete Study Notes
Normalization
img_norm = [Link](np.float32) / 255.0 # scale to [0,1]
Thresholding
_, binary = [Link](gray, 0, 255,
cv2.THRESH_BINARY + cv2.THRESH_OTSU) # Otsu: auto-finds best T
adaptive = [Link](gray, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
Contrast Enhancement
clahe = [Link](clipLimit=2.0, tileGridSize=(8,8))
enhanced = [Link](gray)
4.3 Feature Extraction
Features are compact numerical representations of the image content. Instead of working with raw
224×224 = 50,176 pixel values, we extract a much smaller set of informative numbers.
Edge Features (Sobel, Canny)
# Sobel edges
Gx = [Link](gray, cv2.CV_64F, 1, 0, ksize=3)
Gy = [Link](gray, cv2.CV_64F, 0, 1, ksize=3)
edges_sobel = [Link](Gx**2 + Gy**2)
# Canny edges
edges_canny = [Link](gray, 50, 150)
Corner Features (Harris, FAST)
# Harris corners
harris = [Link](gray, 2, 3, 0.04)
# FAST keypoints
fast = cv2.FastFeatureDetector_create(threshold=10)
kps = [Link](gray, None)
Local Descriptors (SIFT, SURF, ORB)
sift = cv2.SIFT_create()
orb = cv2.ORB_create(nfeatures=500)
kps_sift, desc_sift = [Link](gray, None)
kps_orb, desc_orb = [Link](gray, None)
HOG — Histogram of Oriented Gradients
HOG is one of the most important feature descriptors for object detection (made famous by pedestrian
detection). It describes the distribution of gradient orientations in local regions of the image.
How HOG works:
• Divide the image into small cells (e.g., 8×8 pixels)
• For each cell, compute gradient magnitude and direction at every pixel
• Build a histogram of gradient directions (9 bins covering 0°–180°) weighted by magnitude
• Normalize histograms over larger blocks (e.g., 2×2 cells)
Page 16 of 23
Computer Vision After Deep Learning — Complete Study Notes
• Concatenate all normalized block histograms → feature vector
from [Link] import hog
features, hog_image = hog(
gray,
orientations=9, # number of orientation bins
pixels_per_cell=(8, 8), # cell size
cells_per_block=(2, 2), # block size for normalization
visualize=True # also return visualization
)
print([Link]) # e.g., (3780,) for 64x128 image
4.4 Feature Representation
After extracting per-image features, we need to represent them in a form suitable for ML models.
Feature Vector Construction
For fixed-size features (HOG, pixel values), just flatten to 1D:
feature_vec = hog_features.flatten() # or hog_features itself
# Each image → one row of a matrix X
X = [Link]([extract_features(img) for img in images])
Bag of Visual Words (BoVW)
SIFT/ORB produce a variable number of descriptors per image (can't just stack them). BoVW solves
this:
• Step 1 — Extract SIFT descriptors from ALL images in the dataset
• Step 2 — Cluster all descriptors into K clusters using KMeans → these are the 'visual words'
• Step 3 — For each image: count how many of its SIFT descriptors fall into each cluster → a
histogram of size K
• Step 4 — This K-dimensional histogram is the image's feature vector
from [Link] import KMeans
from [Link] import normalize
# Step 1-2: Build vocabulary
all_descs = [Link]([desc for desc in all_descriptors])
kmeans = KMeans(n_clusters=500, random_state=42)
[Link](all_descs)
# Step 3: Encode each image
def encode_bovw(desc, kmeans, K=500):
labels = [Link](desc)
hist, _ = [Link](labels, bins=K, range=(0,K))
return [Link](float)
Feature Normalization
from [Link] import StandardScaler, normalize
scaler = StandardScaler() # zero mean, unit variance
X_scaled = scaler.fit_transform(X)
X_l2 = normalize(X, norm='l2') # L2 norm (unit length vectors)
Page 17 of 23
Computer Vision After Deep Learning — Complete Study Notes
Dimensionality Reduction (PCA)
If feature vectors are too large, PCA projects them to a lower-dimensional space while keeping
maximum variance.
from [Link] import PCA
pca = PCA(n_components=100) # keep 100 principal components
X_pca = pca.fit_transform(X_scaled)
print(f'Explained variance: {pca.explained_variance_ratio_.sum():.2%}')
4.5 Model Training Using Extracted Features
Once we have feature vectors X (shape: N_images × N_features) and labels y, we train traditional ML
models.
Linear Models
Logistic Regression — Learns a linear decision boundary. Despite the name, it's a classification
algorithm. Probability output via sigmoid function.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1.0, max_iter=1000)
[Link](X_train, y_train)
preds = [Link](X_test)
Linear SVM — Finds the hyperplane with maximum margin between classes. Best for high-dimensional
feature spaces.
from [Link] import LinearSVC
svm = LinearSVC(C=1.0)
[Link](X_train, y_train)
Margin-Based Models: Kernel SVM
When data is not linearly separable, the kernel trick maps data to a higher-dimensional space where it
IS separable — without explicitly computing that space.
• RBF kernel: most common, good for image features
• Polynomial kernel: useful for specific geometric patterns
from [Link] import SVC
svm_rbf = SVC(kernel='rbf', C=10, gamma='scale')
svm_rbf.fit(X_train, y_train)
# C: regularization (smaller = more regularization)
# gamma: how far influence of each training example reaches
Distance-Based Models: k-Nearest Neighbors (kNN)
Classify a new image by finding the K most similar images in the training set and taking a majority vote.
No training needed — just store all examples!
from [Link] import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
[Link](X_train, y_train)
# For ORB binary descriptors: use Hamming distance
Page 18 of 23
Computer Vision After Deep Learning — Complete Study Notes
knn_bin = KNeighborsClassifier(n_neighbors=5, metric='hamming')
Tree-Based Models: Decision Tree & Random Forest
Decision Tree: Recursively splits data using feature thresholds to minimize impurity (Gini or entropy).
Easy to visualize.
Random Forest: Trains many decision trees on random subsets of data and features, then combines
predictions by voting. Much more accurate and robust.
from [Link] import DecisionTreeClassifier
from [Link] import RandomForestClassifier
dt = DecisionTreeClassifier(max_depth=10)
[Link](X_train, y_train)
rf = RandomForestClassifier(n_estimators=100, max_depth=20)
[Link](X_train, y_train)
Probabilistic Models: Naive Bayes & GMM
Naive Bayes: Applies Bayes' theorem assuming features are conditionally independent. Very fast and
surprisingly effective for text and image histograms.
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
[Link](X_train, y_train)
Gaussian Mixture Models (GMM): Models the feature distribution of each class as a mixture of
Gaussian distributions. More flexible than a single Gaussian.
from [Link] import GaussianMixture
# Fit one GMM per class
gmms = {}
for cls in [Link](y_train):
gmm = GaussianMixture(n_components=3)
[Link](X_train[y_train == cls])
gmms[cls] = gmm
# Predict: pick class with highest log-likelihood
preds = [max(gmms, key=lambda c: gmms[c].score([Link](1,-1)))
for x in X_test]
4.6 Prediction & Inference
Image Classification
Assign one label to an entire image (e.g., 'cat', 'dog', 'car').
def classify_image(img, model, scaler, pca=None):
gray = [Link](img, cv2.COLOR_BGR2GRAY)
features = extract_hog(gray) # HOG or SIFT/BoVW
features = [Link]([Link](1,-1))
if pca: features = [Link](features)
return [Link](features)[0]
Image Matching
Find if two images contain the same object/scene. Use descriptor matching:
# FLANN-based matcher (fast for SIFT)
FLANN_INDEX_KDTREE = 1
index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=5)
Page 19 of 23
Computer Vision After Deep Learning — Complete Study Notes
search_params = dict(checks=50)
flann = [Link](index_params, search_params)
matches = [Link](desc1, desc2, k=2)
# Lowe's ratio test: keep only good matches
good = [m for m,n in matches if [Link] < 0.7*[Link]]
print(f'{len(good)} good matches')
Similarity-Based Retrieval
Given a query image, retrieve the most similar images from a database (like a visual search engine).
from [Link] import cosine_similarity
query_feat = extract_features(query_img).reshape(1,-1)
similarities = cosine_similarity(query_feat, database_features)
top_k = [Link](similarities[0])[::-1][:5] # top 5 results
4.7 Evaluation
How do we know if our CV system is working well? We use quantitative metrics.
Accuracy
Percentage of correctly classified images. Simple but misleading for imbalanced datasets.
from [Link] import accuracy_score
acc = accuracy_score(y_true, y_pred)
print(f'Accuracy: {acc:.2%}')
Precision & Recall
For each class:
• Precision = TP / (TP + FP) — Of all images predicted as class X, how many actually are X? (No
false alarms)
• Recall = TP / (TP + FN) — Of all images that ARE class X, how many did we catch? (No
misses)
• F1 Score = 2 × (Precision × Recall) / (Precision + Recall) — Harmonic mean, balances both
from [Link] import precision_score, recall_score, f1_score
print(precision_score(y_true, y_pred, average='macro'))
print(recall_score (y_true, y_pred, average='macro'))
print(f1_score (y_true, y_pred, average='macro'))
Confusion Matrix
A table showing actual vs predicted labels for all classes. Diagonal = correct predictions. Off-diagonal =
errors. Shows WHICH classes are being confused.
from [Link] import confusion_matrix
import seaborn as sns
import [Link] as plt
cm = confusion_matrix(y_true, y_pred)
[Link](cm, annot=True, fmt='d', cmap='Blues')
[Link]('Predicted'); [Link]('Actual')
[Link]()
Page 20 of 23
Computer Vision After Deep Learning — Complete Study Notes
Cross-Validation (Image-Level)
Don't evaluate on the training set! Cross-validation splits data into K folds, trains on K-1 folds, tests on
1, repeats K times. Gives an honest estimate of generalization performance.
from sklearn.model_selection import cross_val_score, StratifiedKFold
# 5-fold stratified CV (maintains class proportions per fold)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print(f'CV Accuracy: {[Link]():.2%} ± {[Link]():.2%}')
⚠️Important
For image datasets, make sure images from the same scene or sequence don't appear in BOTH train
and test folds (data leakage). Use GroupKFold if images are grouped.
Page 21 of 23
Computer Vision After Deep Learning — Complete Study Notes
QUICK REFERENCE SUMMARY
All Color Space Conversions
cv2.COLOR_BGR2GRAY # BGR → Grayscale
cv2.COLOR_BGR2RGB # BGR → RGB
cv2.COLOR_RGB2BGR # RGB → BGR
cv2.COLOR_BGR2HSV # BGR → HSV
cv2.COLOR_HSV2BGR # HSV → BGR
cv2.COLOR_BGR2LAB # BGR → LAB (perceptual)
cv2.COLOR_LAB2BGR # LAB → BGR
All Interpolation Methods
cv2.INTER_NEAREST # Nearest Neighbor — fastest, blocky
cv2.INTER_LINEAR # Bilinear — fast, smooth (default)
cv2.INTER_CUBIC # Bicubic — slower, best quality
cv2.INTER_LANCZOS4 # Lanczos — very high quality
cv2.INTER_AREA # Best for downsampling (shrinking)
Feature Detector & Descriptor Quick Comparison
Method Type Descriptor Dim Speed Use Case
SIFT Det+Desc 128 float Slow Accurate matching
SURF Det+Desc 64 float Medium Faster SIFT
alternative
ORB Det+Desc 32 binary Fast Free real-time
alternative
FAST Detector — Very Fast RT keypoint detection
BRIEF Descriptor 32 binary Very Fast Needs separate
detector
Harris Detector — Medium Corner analysis
HOG Descriptor 3780+ Medium Object classification
ML Models Quick Comparison
Model Best For Pros Cons
Logistic Reg Baseline Fast, interpretable Linear only
Linear SVM High-dim features Fast, good generalization Linear only
Kernel SVM Non-linear data Very accurate Slow on large data
kNN Similarity retrieval No training needed Slow prediction
Decision Tree Interpretability Fast, visual Overfits easily
Random Forest General use Robust, accurate Memory intensive
Naive Bayes Text/histogram Very fast Assumes independence
GMM Density estimation Models distributions Complex tuning
Page 22 of 23
Computer Vision After Deep Learning — Complete Study Notes
Page 23 of 23