0% found this document useful (0 votes)
5 views704 pages

Understanding Color Spaces in Computer Vision

Uploaded by

tdfyn2mf4s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views704 pages

Understanding Color Spaces in Computer Vision

Uploaded by

tdfyn2mf4s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Color Spaces

CSCE 4603 – Fundamentals of Computer Vision


Dr. Mohamed Sedky
Today’s class
• What are the different Color Spaces?

• How the image is formed inside a camera?

• What does the intensity of a pixel tell us?


Color spaces
• How can we represent color?

[Link]
Color spaces
• A color space is a mathematical representation of continuous spectral
colors in a three-dimensional vector space, using three primary
colors, which allows color analysis and manipulation.
• Different color spaces can be derived by defining different primary
colors.
• Color spaces are developed according to output devices, physiological
or physical properties.
• A number of standard color space specifications are well developed.
However, different definitions can often be found for the same color
space. For more information about color spaces.

[Link]
Color spaces
• Color spaces can be classified as:
• Device independent
• Device dependent
• Device independent color components are the same on all output
devices.
• Device dependent color spaces will have different components for
different output devices.
• Color spaces can be also classified as user-oriented models which try
to build a bridge between the user and the hardware used to
manipulate color.

[Link]
Color spaces: RGB 0,1,0
Default color space
• The RGB camera output maps spectral power
densities to a triple of numerical components
that are the mathematical coordinates of R, G
and B.
• It is device dependent color space used for 1,0,0
capture and display devices.
• It is an additive color system. 0,0,1
• RGB space may be modelled as a cube with the
three axes corresponding to red, green and RGB cube
blue.
• The bottom corner, when red = green = blue = 0 • Easy for devices
is black, while the opposite top corner, where • But not perceptual
red = green = blue = 255 (for an 8-bit per • Where do the grays live?
channel display system), is white.
• Where is hue and saturation?

Image from: [Link]


Color spaces: RGB
• The RGB color space suffers from the
high correlation among the R, G and B R
(G=0,B=0)
components; if the intensity changes, all
the three components will change
accordingly.
• Therefore, the measurement of a color G
in RGB space does not represent color (R=0,B=0)

differences in a uniform scale; it is


difficult to evaluate the similarity of
two colors from their distance in RGB
space. B
(R=0,G=0)

Image from: [Link]


Grey-scale
• grey-scale represents the 1D projection
of the three channels of colour images
R
that is regarded as a simple way of (G=0,B=0)

combining color information.


• In Rec. 709 standard, the weights to
compute true CIE luminance from linear G
red, green and blue are: (R=0,B=0)

𝑌709 = 0.212 𝑅 + 0.7154 𝐺 + 0.0721 𝐵


B
(R=0,G=0)

Image from: [Link]


Color spaces: YCbCr
Fast to compute, good for compression,
used by TV
Y=0 Y=0.5 Y
(Cb=0.5,Cr=0.5)

Cr

Cb
(Y=0.5,Cr=0.5)

Cb
Y=1

Cr
(Y=0.5,Cb=05)
Y CbCr
• The Y CbCr color space separates RGB into luminance and chrominance information.
• Luminance information is stored as a single component (Y), and chrominance
information is stored as two color-difference components (Cb and Cr).
• Cb represents the difference between the blue component and a reference value.
• Cr represents the difference between the red component and a reference value.
• Digital video widely uses this color space.
• Given R, G and B in the range of [0, 1], with Y in the range of [16, 235] and Cb, Cr, in
the range of [16, 240].
𝑌ሗ 16 65.841 128.553 24.966 𝑅ሗ
𝐶ሗ𝑏 = 128 + −37.797 −0.587 112 𝐺ሗ
𝐶ሗ𝑟 128 112 −93.786 −18.214 𝐵ሗ
HSV
• Hue, Saturation, Value (Intensity) HSV is a non-
linear transformation from the RGB camera
output, which is more intuitive to human vision.
• It separates color information of an image from its
intensity information.
• Color information is represented by Hue and
Saturation values, while Intensity describes the
brightness of an image pixel.
• Hue represents basic colors, and is determined by
the dominant wavelength in the spectral
distribution of light wavelengths. It is the location
of the peak in the spectral distribution.
• The Saturation is a measure of the purity of the
color, and signifies the amount of white light mixed
with the Hue. It is the height of the peak relative to
the entire spectral distribution.
3 𝑅−𝐵 𝑅+𝐺+𝐵 min 𝑅, 𝐺, 𝐵
𝐻, 𝐼 , 𝑆 = (arctan( ), ,1 − )
2𝑅 − 𝐺 − 𝐵 3 𝐼
Color spaces: HSV
Intuitive color space

H
(S=1,V=1)

S
(H=1,V=1)

V
(H=1,S=0)
HSV

• Hue, Saturation, Value (Intensity)


• RGB cube on its vertex
• Decouples the three components (a bit)
• Use rgb2hsv() and hsv2rgb() in Matlab
Slide by Steve Seitz
HSV
• The main drawback of the hue that it has a nonremovable
singularity near the axis of the color cylinder, where a
slight change of input R, G and B values can cause a large
jump in the transformed values.
• Hue and Saturation play little role in distinguishing colors if
the intensity of the color lies close to white or black.
• The singularities may create discontinuities in the
representation of colours.
• The hue values near the singularity are numerically
unstable.
• That is why pixels having low saturation are left unassigned
to any regions in many segmentation algorithms.
• There are some variants of HSI color spaces, such as HSB
Hue Saturation Brightness, HSL Hue Saturation Lightness,
and HSV Hue Saturation Value.

Slide by Steve Seitz


CIE 1976 XYZ
• CIE 1976 XYZ and CIE 1976 L*a*b* are device-independent color spaces developed by the
International Commission on Illumination, known by the acronym CIE.
• These color spaces model colors according to the typical sensitivity of the three types of
cone cells in the human eye.
• The XYZ color space is the original model developed by the CIE.
• The Y channel represents the luminance of a color.
• The Z channel approximately relates to the amount of blue in an image, but the value
of Z in the XYZ color space is not identical to the value of B in the RGB color space.
• The X channel does not have a clear color analogy.
• However, if you consider the XYZ color space as a 3-D coordinate system, then X values lie
along the axis that is orthogonal to the Y (luminance) axis and the Z axis.
• Device-independent color spaces include the effect of the illumination source, called the
reference white point.
• The source imparts a color hue to the raw image data according to the color temperature
of the illuminant. For example, sunlight during sunrise or sunset imparts a yellow hue to
an image, whereas sunlight around noontime imparts a blue hue.
The CIE XYZ primaries are theoretical, device-independent components where Y represents luminance, and X and Z model color perception.

Slide by Steve Seitz


Color spaces: L*a*b*
• The L*a*b* color space provides a more
perceptually uniform color space than
the XYZ model.
• Colors in the L*a*b* color space can
exist outside the RGB gamut (the valid
set of RGB colors).
• “Perceptually uniform” color space
Color spaces: L*a*b*
• L* is the luminance or brightness of the
image. Values are in the range [0, 100],
where 0 specifies black and 100 specifies L
white. As L* increases, colors become (a=0,b=0)

brighter.
• a* is the amount of red or green tones in
the image. A large positive a* value
corresponds to red/magenta. A large a
negative a* value corresponds to green. (L=65,b=0)

• b* is the amount of yellow or blue tones


in the image. A large positive b* value
corresponds to yellow. A large
negative b* value corresponds to blue. b
(L=65,a=0)
Thank you
• Next class: Image Formation and Reflection Models
Image Formation
Reflection Models

CSCE 4603 – Fundamentals of Computer Vision


Dr. Mohamed Sedky
Today’s class
• How the image is formed inside a camera?

• What are the different reflection models?

• What does the intensity of a pixel tell us?


Image formation

Let’s design a camera


– Idea 1: put a piece of film in front of an object
– Do we get a reasonable image?
Pinhole camera

Idea 2: add a barrier to block off most of the rays


• This reduces blurring
• The opening known as the aperture
Pinhole camera
f

f = focal length
c = center of the camera
How does a pixel get its value?

Light emitted

Fraction of light reflects


into camera

Lens
Sensor
Slide credit: Derek Hoiem
How does a pixel get its value?
• Major factors
• Illumination strength and direction Light emitted
• Surface geometry
• Surface material Light reflected to
camera
• Nearby surfaces
• Camera gain/exposure
Sensor

Slide credit: Derek Hoiem


A photon’s life choices
• Absorption light source

• Diffusion
λ
• Reflection
• Transparency
• Refraction
?
• Fluorescence
• Subsurface scattering
• Phosphorescence
• Interreflection
A photon’s life choices
• Absorption light source

• Diffusion
λ
• Reflection
• Transparency
• Refraction
• Fluorescence
• Subsurface scattering
• Phosphorescence
• Interreflection
A photon’s life choices
• Absorption light source

• Diffuse Reflection
λ
• Reflection
• Transparency
• Refraction
• Fluorescence
• Subsurface scattering
• Phosphorescence
• Interreflection
A photon’s life choices
• Absorption light source

• Diffusion
λ
• Specular Reflection
• Transparency
• Refraction
• Fluorescence
• Subsurface scattering
• Phosphorescence
• Interreflection
A photon’s life choices
• Absorption light source

• Diffusion
λ
• Reflection
• Transparency
• Refraction
• Fluorescence
• Subsurface scattering
• Phosphorescence
• Interreflection
A photon’s life choices
• Absorption light source

• Diffusion
λ
• Reflection
• Transparency
• Refraction
• Fluorescence
• Subsurface scattering
• Phosphorescence
• Interreflection
A photon’s life choices
• Absorption light source

• Diffusion
λ1
• Reflection
• Transparency λ2
• Refraction
• Fluorescence
• Subsurface scattering
• Phosphorescence
• Interreflection
A photon’s life choices
• Absorption light source

• Diffusion
λ
• Reflection
• Transparency
• Refraction
• Fluorescence
• Subsurface scattering
• Phosphorescence
• Interreflection
A photon’s life choices
• Absorption light source

• Diffusion
t=1
• Reflection
• Transparency t=n
• Refraction
• Fluorescence
• Subsurface scattering
• Phosphorescence
• Interreflection
A photon’s life choices
• Absorption light source

• Diffusion
λ
• Reflection
• Transparency
• Refraction
• Fluorescence
• Subsurface scattering
• Phosphorescence (Specular Interreflection)

• Interreflection
Lambertian Reflectance
• In computer vision, the complexity of light transport is mostly
ignored.
• Surfaces are often assumed to be ideal diffuse reflectors with no
dependence on viewing direction.

[Link]
Basic models of reflection
• Specular: light bounces off at the incident angle
• E.g., mirror specular reflection incoming light

Θ Θ

• Diffuse: light scatters in all directions


• E.g., brick, cloth, rough wood
diffuse reflection incoming light

Slide credit: Derek Hoiem


Lambertian reflectance model
• Some light is absorbed (function of albedo 𝜌)
• Remaining light is scattered (diffuse reflection)
• Examples: soft cloth, concrete, matte paints
light source light source

diffuse reflection
𝜌
absorption
(1 − 𝜌)

Slide credit: Derek Hoiem


Specular Reflection
Flickr, by suzysputnik

• Reflected direction depends on light


orientation and surface normal
• E.g., mirrors are fully specular
• Most surfaces can be modeled with a
mixture of diffuse and specular components
light source

specular reflection

Θ Θ

Flickr, by piratejohnny
Slide credit: Derek Hoiem
Most surfaces have both specular and diffuse
components
• Specularity = spot where specular reflection dominates (typically
reflects light source)

Typically, specular component is small

Slide credit: Derek Hoiem Photo: [Link]


Intensity and Surface Orientation
Intensity depends on illumination angle
Why? Less light comes in at oblique angles.

𝜌 = Albedo: fraction of light that is reflected


𝑺 = directional source
𝑵 = surface normal
I = reflected intensity

Slide credit: Forsyth


1

2
Recap
absorption
• When light hits a typical surface
• Some light is absorbed (1-𝜌)
• More absorbed for low albedos
diffuse
reflection
• Some light is reflected diffusely
• Independent of viewing direction

• Some light is reflected specularly


specular
• Light bounces off (like a mirror), depends on reflection
viewing direction
Θ Θ
Slide credit: Derek Hoiem
Reflection models

Lambertian: Mirrored: reflection Glossy: reflection mostly


reflection all diffuse all specular diffuse, some specular
Dichromatic reflection model

• The dichromatic model represents the reflected light of an inhomogeneous


dielectric material as a linear combination of diffuse and specular reflections.
• Each of these parts is further divided into two elements, one accounting for
the geometry and another purely spectral.

𝑅 = 𝜔𝑑 ‫׬‬Ω 𝐸( 𝜆)𝑆(𝜆)𝑄𝑅 (𝜆) + 𝜔


෥𝑠 ‫׬‬Ω 𝐸( 𝜆)𝑄𝑅 (𝜆)
𝐺 = 𝜔𝑑 ‫׬‬Ω 𝐸( 𝜆)𝑆(𝜆)𝑄𝐺 (𝜆) + 𝜔
෥𝑠 ‫׬‬Ω 𝐸( 𝜆)𝑄𝐺 (𝜆)
𝐵 = 𝜔𝑑 ‫׬‬Ω 𝐸( 𝜆)𝑆(𝜆)𝑄𝐵 (𝜆) + 𝜔
෥𝑠 ‫׬‬Ω 𝐸( 𝜆)𝑄𝐵 (𝜆)
Dichromatic reflection model
• The assumptions made by the dichromatic reflection
model are:
1. There is a single light source, that can be a point source
or an area source;
2. The illumination has a constant SPD across the scene;
3. The amount of illumination can vary across the scene.
• For what concerns the surface properties, the model
assumes that:
1. The surface is opaque;
2. The surface is not optically active (no fluorescence);
3. The colorant is uniformly distributed
Shading reflection model
• Shading model is an approximated model derived from the dichromatic
reflection model.
• For images of scenes which contain Lambertian surfaces; shading reflection
model adopt two approximations to the dichromatic model: diffuse only
reflection and narrowband sensor sensitivity approximation.
• Diffuse only reflection approximation:
• Assuming Lambertian surfaces, theoretically, an image taken by a digital color
camera (for diffuse only reflection) can be described as:
𝐼𝑐 = 𝜔𝑑 න 𝐸( 𝜆)𝑆(𝜆)𝑄𝑐 (𝜆)
• Narrowband sensor sensitivity approximation:
Ω
– The second approximation that may be taken is ideal camera sensors, combined
with the first approximation (diffuse only reflection).
– For narrowband sensor sensitivity, camera sensors are assumed ideal and are
expressed using the Dirac delta function as:
𝑄𝑐 = 𝛿(𝜆 − 𝜆𝑐 )
Shading reflection model
• The observed image intensity at a pixel x
can then be modelled as the product of two
components: the illumination E() from the
light source(s) in the scene and the
reflectance S() of the object surface. Thus
each color channel becomes:
𝐼𝑐 = 𝐸 𝜆𝑐 𝑆 𝜆𝑐 = 𝐸𝑐 𝑆𝑐
Intrinsic image
• Only the reflectance component S contains information about the
objects visible in the scene.
• A type of illumination-invariant feature, intrinsic image, can be derived
by first filtering out the illumination component E from the image.
• If the illumination has lower spatial frequency content than the
reflectance component S, a homomorphic filter can be used to separate
the two components of the intensity signal.
• That is, logarithms are taken on both sides of the shading model to
obtain:
ln(𝐼𝑐 ) = ln 𝐸 𝜆𝑐 + ln(𝑆 𝜆𝑐 )
Intrinsic image
• Since the lower frequency component is now additive, it can be
eliminated using a high-pass filter. The intrinsic image can thus
be estimated as:

𝑆𝑐 = 𝑒 F(ln 𝐸 𝜆𝑐 +ln(𝑆 𝜆𝑐 ))

• where F(・) is a high-pass filter.


• The intrinsic image can be provided as input to the decision rule
step of a change detection process.
Things to remember
• Important terms: diffuse/specular
reflectance, albedo

• Color vision: physics of light, trichromacy,


color consistency, color spaces (RGB, HSV,
Lab)

• Observed intensity depends on


• light sources,
• geometry/material of reflecting surface,
• surrounding objects,
• camera settings
Thank you
• Next class: Vision Systems
Vision Systems

CSCE 4603 – Fundamentals of Computer Vision


Dr. Mohamed Sedky
Today’s class
• Describe vision systems

• How CCD cameras work?

• How the human visual system works?

• Explain diffrerent illuminants


Vision system
• The light emitted by sources of
illumination and modulated by
surfaces in the scene arrives at the
capturing sensors of the color vision
system that is observing the scene.
• The vision system senses the
captured electromagnetic signal and
then transforms the information
carried by light into a color image of Lens Based Camera Obscura, 1568
the physical world.
CCD Cameras
• Color Charge Coupled Device (CCD) camera,
which is an example of a typical vision system,
contains a set of sensors which convert
electromagnetic energy into electric signals,
which are, then sampled and quantized.
• Conventional CCD cameras insert color filters,
with different spectral sensitivity to the various
wavelengths, over each sensory element,
typically red, green, and blue filters in order to
obtain color information.
• The sensor Irradiance is a measure of signal that
reaches the camera’s sensor.
• Apart from the spectral sensitivity of the colour
filters, the formation of digital
• image color values includes other factors, such as
lens characteristics, and the electronics of the
camera.
CCD Cameras
• A 3-CCD camera has three CCD
sensors for each pixel element.
• However a single- CCD camera,
captures one color component with
each CCD sensor and the two
missing color components are
estimated from the adjacent existing
sensors using Bayer filters.
Color Sensing in Camera (RGB)
• 3-chip vs. 1-chip: quality vs. cost
• Why more green?

Why 3 colors?
[Link]
Slide by Steve Seitz
Color Image R

B
Images in Python
• Images represented as a matrix
• Suppose we have a NxM RGB image called “im”
– im(0,0,0) = top-left pixel value in R-channel
– im(y, x, b) = y pixels down, x pixels to right in the bth
channel
– im(N-1, M-1, 2) = bottom-right pixel in B-channel
column
row 0.92 0.93 0.94 0.97 0.62 0.37 0.85 0.97 0.93 0.92 0.99 R
0.95 0.89 0.82 0.89 0.56 0.31 0.75 0.92 0.81 0.95 0.91
0.89 0.72 0.51
0.92
0.55
0.93
0.51
0.94
0.42
0.97
0.57
0.62
0.41
0.37
0.49
0.85
0.91
0.97
0.92
0.93 0.92 0.99 G
0.96 0.95 0.88 0.94 0.56 0.46 0.91 0.87 0.90 0.97 0.95
0.95 0.89 0.82 0.89 0.56 0.31 0.75 0.92 0.81 0.95 0.91
0.71 0.81 0.81
0.89
0.87
0.72
0.57
0.51
0.92
0.37
0.55
0.93
0.80
0.51
0.94
0.88
0.42
0.97
0.89
0.57
0.62
0.79
0.41
0.37
0.85
0.49
0.85
0.91
0.97
0.92
0.93 0.92 0.99
B
0.49 0.62 0.60 0.58 0.50 0.60 0.58 0.50 0.61 0.45 0.33
0.96 0.95 0.88 0.94 0.56 0.46 0.91 0.87 0.90 0.97 0.95
0.86 0.84 0.74 0.58 0.95
0.51 0.89
0.39 0.82
0.73 0.89
0.92 0.56
0.91 0.31
0.49 0.75
0.74 0.92 0.81 0.95 0.91
0.71 0.81 0.81 0.87 0.57 0.37 0.80 0.88 0.89 0.79 0.85
0.96 0.67 0.54 0.85 0.89
0.48 0.72
0.37 0.51
0.88 0.55
0.90 0.51
0.94 0.42
0.82 0.57
0.93 0.41 0.49 0.91 0.92
0.49 0.62 0.60 0.58 0.50 0.60 0.58 0.50 0.61 0.45 0.33
0.69 0.49 0.56 0.66 0.96
0.43 0.95
0.42 0.88
0.77 0.94
0.73 0.56
0.71 0.46
0.90 0.91
0.99 0.87 0.90 0.97 0.95
0.86 0.84 0.74 0.58 0.51 0.39 0.73 0.92 0.91 0.49 0.74
0.79 0.73 0.90 0.67 0.71
0.33 0.81
0.61 0.81
0.69 0.87
0.79 0.57
0.73 0.37
0.93 0.80
0.97 0.88 0.89 0.79 0.85
0.96 0.67 0.54 0.85 0.48 0.37 0.88 0.90 0.94 0.82 0.93
0.91 0.94 0.89 0.49 0.49
0.41 0.62
0.78 0.60
0.78 0.58
0.77 0.50
0.89 0.60
0.99 0.58
0.93 0.50 0.61 0.45 0.33
0.69 0.49 0.56 0.66 0.43 0.42 0.77 0.73 0.71 0.90 0.99
0.86 0.84 0.74 0.58 0.51 0.39 0.73 0.92 0.91 0.49 0.74
0.79 0.73 0.90 0.67 0.33 0.61 0.69 0.79 0.73 0.93 0.97
0.96 0.67 0.54 0.85 0.48 0.37 0.88 0.90 0.94 0.82 0.93
0.91 0.94 0.89 0.49 0.41 0.78 0.78 0.77 0.89 0.99 0.93
0.69 0.49 0.56 0.66 0.43 0.42 0.77 0.73 0.71 0.90 0.99
0.79 0.73 0.90 0.67 0.33 0.61 0.69 0.79 0.73 0.93 0.97
Dynamic range and camera response
• Typical scenes have a huge dynamic range
• Camera response is roughly linear in the mid
range (15 to 240) but non-linear at the extremes
• called saturation or under-saturation
• The human perception of brightness, under
common illumination conditions (neither pitch
black nor blindingly bright), follows an
approximate power function, with greater
sensitivity to relative differences between darker
tones than between lighter tones.
• If images are not gamma-encoded, they allocate
too many bits or too much bandwidth to
highlights that humans cannot differentiate, and
too few bits or too little bandwidth to shadow
values that humans are sensitive to and would
require more bits/bandwidth to maintain the
same visual quality.[
Camera Gamma Curve
𝑒𝑛𝑐𝑜𝑑𝑒𝑑 𝑏𝑟𝑖𝑔ℎ𝑡𝑛𝑒𝑠𝑠 = 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 𝑏𝑟𝑖𝑔ℎ𝑡𝑛𝑒𝑠𝑠 𝛾

𝛾 will typically take values less than 1

• A pixel with value of “100” is not necessarily twice as bright as a pixel with value “50”.
Camera Gamma Curve

• The ITU-R Recommendation BT.709 transfer function defines gamma-compensated values


`R, `G and `B from R, G and B as :

• where `I represents a vector of `R `G and `B . Since gamma-correction is already applied


• by the camera as standard practice, most digital image data should be interpreted as `R `G
`B and not RGB.
• The prime symbols in this equation, and in these to follow, denote non-linear components.
Human Visual System (JVS)
• The image processing in the HVS does
not start in the brain but rather in the
retina, which begins immediately.
• In addition, the retina and the optic
nerve originate as an outgrowth of
vertebrate embryonic development.
• The retina's inner surface is parallel to
the light that opens the eye.
• The photo sensors of the retina are
reversed, not in the direction of light.
The Retinal cells are divided into
photoreceptor cells and neuronal cells
Receptive Field
• In the visual system, volumes in visual
space are receptive fields.
• The area of space (e.g., the visual field)
through which a stimulus influences
neuron firing, e.g., cone, bipolar cell, or
ganglion cell) is called the
photoreceptor's or neuron's receptive
field.
• The receptive field is always determined
relative to the given receptor or neuron.
• In other words, the stimulus is the activity
we observe which affects the neuron's
activity or state.
Color Vision
• The fact that the retina fovea
comprises three different types of
cones, each with different sensitivity in
the wavelength spectrum of Short (S),
Medium (M) and Long (L), lets us see
colours.
• The trichromatic theory is called the
three-dimensional nature of colour
vision.
Color Constancy

Color Constancy: the ability to perceive the


invariant color of a surface despite ecological
Variations in the conditions of observation.

Another of these hard inverse problems:


Physics of light emission and surface reflection
underdetermine perception of surface color

© Stephen E. Palmer, 2002


So far: light→surface→camera
• Called a local illumination model
• But much light comes from surrounding surfaces

From Koenderink slides on image texture and the flow of light


Inter-reflection is a major source of light
Inter-reflection affects the apparent color of objects
Models of light sources
• Distant point source
• One illumination direction
• E.g., sun

• Area source
• E.g., white walls, diffuser lamps, sky

• Ambient light
• Substitute for dealing with interreflections

• Global illumination model


• Account for interreflections in modeled scene
Illuminants
• A light source illuminates the material surface 10000 C

by emitting electromagnetic waves composed .


by a mixture of energy at different wavelengths.
• The emitted light denotes the visible part of the 5000 C

Energy
electromagnetic energy that covers wavelengths
from approximately 400 nm to 700 nm.
2000 C
• The visible spectrum represents only a small 700 C

portion of the complete electromagnetic


spectrum. 0 400 700 1000 2000 300

• The power emitted at each wavelength gives Visible


Region
Wavelength (nm)

the Spectral Power Distribution (SPD) of the


source (the illuminant).
Black-body radiator
• Black-body (Planckian) radiator is a special type of theoretical light that
emits radiation, due only to thermal excitation, at a maximum rate for
its given temperature (ToK) and absorbs all the radiations that strikes it.
Plank’s formula
• The SPD of a black-body may be described as a function of its absolute
temperature and wavelength by Planck’s formula:

c2 −c 2
E ( ) = c1. .(e
−5  .T
− 1)  c1. .e
−1 −5  .T

• Where 𝐸(𝜆) is the illuminant spectral power distribution in Watts / m2 /


wavelength interval; 𝜆 is the wavelength in m; 𝑐1 = 3.74183.10-16 Watt.m2;
𝑐2 = 1.4388.10-2 [Link]; T is the color temperature of the illuminant, in oK.
• The temperature of a Planckian radiator is called color temperature which
uniquely specifies the color of the source.
Correlated Color Temperature (CCT)
• An important measure that can be used to describe the colour properties
of a light source is its Correlated Color Temperature (CCT).
• The CCT of a source represents the color temperature of a black-body
radiator that has nearly the same color as the source.
• CCT is a measure of the shade of whiteness of a light source, by
comparison with a blackbody.
• For example, the CCT of a typical incandescent lighting is 2700oK which is
yellowish-white.
• CCT for Halogen lighting is 3000oK. Fluorescent lamps are manufactured to
a chosen CCT by altering the mixture of phosphors inside the tube.
Correlated Color Temperature (CCT)
• Warm-white fluorescents have CCT of 2700oK and are popular for
residential lighting. Neutral-white fluorescents have a CCT of 3000oK or
3500oK.
• Cool-white fluorescents have a CCT of 4100oK and are popular for office
lighting.
• Daylight fluorescents have a CCT of 5000oK to 6500oK, which is bluish-
white.
• Another way of describing the color properties of a typical light source
(i.e. candle, lamp or sun) may be done by means of standard illuminants,
such as the Commission Internationale de l’Eclairage (CIE) illuminants.
Questions
A. Why is (2) brighter than (1)?
6 Each points to the asphalt.
B. Why is (4) darker than (3)?
3 (4) points to the marking.
C. Why is (5) brighter than (3)?
8 5 7 Each points to the side of
the wooden block.
9 4 D. Why isn’t (6) black, given
that there is no direct path
from it to the sun?
E. Why (7) brighter than (8)?
Both point to the yellow
1 paints.
2 F. Why is (9) green, given that
the sun light contains all
visible wavelengths?
What does the intensity of a pixel tell us?

im(234, 452) = 0.58

0.92 0.93 0.94 0.97 0.62 0.37 0.85 0.97 0.93 0.92 0.99
0.95 0.89 0.82 0.89 0.56 0.31 0.75 0.92 0.81 0.95 0.91
0.89 0.72 0.51 0.55 0.51 0.42 0.57 0.41 0.49 0.91 0.92
0.96 0.95 0.88 0.94 0.56 0.46 0.91 0.87 0.90 0.97 0.95
0.71 0.81 0.81 0.87 0.57 0.37 0.80 0.88 0.89 0.79 0.85
0.49 0.62 0.60 0.58 0.50 0.60 0.58 0.50 0.61 0.45 0.33
0.86 0.84 0.74 0.58 0.51 0.39 0.73 0.92 0.91 0.49 0.74
0.96 0.67 0.54 0.85 0.48 0.37 0.88 0.90 0.94 0.82 0.93
0.69 0.49 0.56 0.66 0.43 0.42 0.77 0.73 0.71 0.90 0.99
0.79 0.73 0.90 0.67 0.33 0.61 0.69 0.79 0.73 0.93 0.97
0.91 0.94 0.89 0.49 0.41 0.78 0.78 0.77 0.89 0.99 0.93
The light of the poor pixel
• A pixel’s brightness is determined by
• Light source (strength, direction, color)
• Surface orientation
• Surface material and albedo
• Reflected light and shadows from surrounding surfaces
• Gain on the sensor

• A pixel’s brightness tells us nothing by itself


And yet we can interpret images…

• Key idea: for nearby scene points, most factors do not change much
• The information is mainly contained in local differences of brightness
Darkness = Large Difference in Neighboring Pixels
What differences in intensity tell us about
shape?

• Changes in surface normal


• Texture
• Proximity
• Indents and bumps
• Grooves and creases

Photos Koenderink slides on image texture and the flow of light


Shadows as cues

From Koenderink slides on image texture and the flow of light


Slide: Forsyth
Color constancy
• Interpret surface in terms of albedo or “true color”, rather
than observed intensity
• Humans are good at it
• Computers are not nearly as good
One source of constancy: local comparisons
[Link]
Perception of Intensity

from Ted Adelson


Perception of Intensity

from Ted Adelson


Color Correction
• Simple idea: multiply R, G, and B values by separate
constants
𝑟ǁ 𝛼𝑟 0 0 𝑟
𝑔෤ = 0 𝛼𝑔 0 𝑔
𝑏෨ 0 0 𝛼𝑏 𝑏

• How to choose the constants?


– “White world” assumption: brightest pixel is white
• Divide by largest value
– “Gray world” assumption: average value should be gray
• E.g., multiply r channel by avg(r) /avg((r+g+b)/3)
– White balancing: choose a reference as the white or gray color
Things to remember
• Important terms: diffuse/specular reflectance,
albedo

• Color vision: physics of light, trichromacy, color


consistency, color spaces (RGB, HSV, Lab)

• Observed intensity depends on


• light sources,
• geometry/material of reflecting surface,
• surrounding objects,
• camera settings

• Objects cast light and shadows on each other

• Differences in intensity are primary cues for shape


Thank you
• Next class: Image Filters
Corner Detection

CSCE 4603 – Fundamentals of Computer Vision


Dr. Mohamed Sedky
What have we learned so far?
• Light and color
– What an image records
• Filtering in spatial domain
• Filtering = weighted sum of neighboring pixels
• Smoothing, sharpening, measuring texture
• Filtering in frequency domain
• Filtering = change frequency of the input image
• Denoising, sampling, image compression
• Image pyramid and template matching
• Filtering = a way to find a template
• Image pyramids for coarse-to-fine search and
multi-scale detection
• Edge detection
• Canny edge = smooth -> derivative -> thin ->
threshold -> link
• Finding straight lines, binary image analysis
Today’s class
• What is interest point?

• Corner detection
Applications
• Key points are used for:
• Image alignment
• 3D reconstruction
• Motion tracking
• Robot navigation
• Indexing and database retrieval
• Object recognition
Why extract features?
• Motivation: panorama stitching
• We have two images – how do we combine them?
Why extract features?
• Motivation: panorama stitching
• We have two images – how do we combine them?

Step 1: extract features


Step 2: match features
Why extract features?
• Motivation: panorama stitching
• We have two images – how do we combine them?

Step 1: extract features


Step 2: match features
Step 3: align images
Image matching

by Diva Sian

by swashford
Harder case

by Diva Sian by scgbt


Planar object instance recognition
Database of planar objects Instance recognition
3D object recognition
Database of 3D objects 3D objects recognition
Recognition under occlusion
Location Recognition
Robot Localization
Image matching
NASA Mars Rover images

Where are the corresponding points?


Pick a point in the image.
Find it again in the next image.

What type of feature would you select?


Pick a point in the image.
Find it again in the next image.

What type of feature would you select?


Pick a point in the image.
Find it again in the next image.

What type of feature would you select?


a corner
How do you find a corner?
How do you find a corner?
[Moravec 1980]

Easily recognized by looking through a small window

Shifting the window should give large change in intensity


Corner Detection: Basic Idea
• How does the window change when you shift it?
• Shifting the window in any direction causes a big
change

“flat” region: “edge”: “corner”:


no change in all no change along the significant change in
directions edge direction all directions
Credit: S. Seitz, D. Frolova, D. Simakov
Harris corner detection: the math
Consider shifting the window W by (u,v)
• how do the pixels in W change?
W
• compare each pixel before and after by
summing up the squared differences (SSD)
• this defines an SSD “error” E(u,v):
Small motion assumption
Taylor Series expansion of I:

If the motion (u,v) is small, then first order approximation is good

Plugging this into the formula on the previous slide…


Harris corner detection: the math

Using the small motion assumption,


W
replace I with a linear approximation
(Shorthand: )
Corner detection: the math

• Thus, E(u,v) is locally approximated as a quadratic form


The second moment matrix
The surface E(u,v) is locally approximated by a quadratic form.

Let’s try to understand its shape.


Visualizing quadratics
Equation of a circle

Equation of a ‘bowl’ (paraboloid)

If you slice the bowl at

what do you get?


Visualizing quadratics
Equation of a circle

Equation of a ‘bowl’ (paraboloid)

If you slice the bowl at

what do you get?


Visualizing quadratics

can be written in matrix form like this…


Visualizing quadratics

‘sliced at 1’
Visualizing quadratics

What happens if you increase


coefficient on x?

and slice at 1
Visualizing quadratics

What happens if you increase


coefficient on x?

and slice at 1
decrease width in x!
Visualizing quadratics

What happens if you increase


coefficient on y?

and slice at 1
Visualizing quadratics

decrease width in y
What happens if you increase
coefficient on y?

and slice at 1
Visualizing quadratics

can be written in matrix form like this…

Corner detection is better when you need distinct, repeatable features for matching, tracking, or identifying points of interest in images with significant texture or structure.

What’s the shape?


What are the eigenvectors?
What are the eigenvalues?
Visualizing quadratics

can be written in matrix form like this…

Result of Singular Value Decomposition (SVD)

eigenvalues
eigenvectors along diagonal

Inverse sqr of
axis of the ‘ellipse length of the
slice’ quadratic along
the axis
Visualizing quadratics
Eigenvectors Eigenvalues

Eigenvectors
Inverse sqr of the size of the axis

Eigenvector
Eigenvector
Recall:

you can smash this bowl in the y direction

you can smash this bowl in the x direction


Visualizing quadratics Eigenvalues

Eigenvectors Eigenvectors

Inverse sqr of length of axis

Eigenvector
Eigenvector
Eigenvalues

Eigenvectors Eigenvectors
Eigenvalues

Eigenvectors Eigenvectors
We will need this to understand the…

Error function for Harris Corners


The surface E(u,v) is locally approximated by a quadratic form
E(u,v)

Horizontal edge:
v
u
E(u,v)

Vertical edge:
v
u
General case
The shape of H tells us something about the distribution
of gradients around a pixel
We can visualize H as an ellipse with axis lengths
determined by the eigenvalues of H and orientation
determined by the eigenvectors of H
max, min : eigenvalues of H
Ellipse equation: direction of the
fastest change
u 
[u v] H   = const direction of the
slowest change
v 
(max)-1/2
(min)-1/2
Quick eigenvalue/eigenvector review
The eigenvectors of a matrix A are the vectors x that satisfy:

The scalar  is the eigenvalue corresponding to x


• The eigenvalues are found by solving:

• In our case, A = H is a 2x2 matrix, so we have

• The solution:

Once you know , you find x by solving


Corner detection: the math

xmin

xmax

Eigenvalues and eigenvectors of H


• Define shift directions with the smallest and largest change in error
• xmax = direction of largest increase in E
• max = amount of increase in direction xmax
• xmin = direction of smallest increase in E
• min = amount of increase in direction xmin
Corner detection: the math
How are max, xmax, min, and xmin relevant for feature detection?
• What’s our feature scoring function?
Corner detection: the math
How are max, xmax, min, and xmin relevant for feature detection?
• What’s our feature scoring function?
Want E(u,v) to be large for small shifts in all directions
• the minimum of E(u,v) should be large, over all unit vectors [u v]
• this minimum is given by the smaller eigenvalue (min) of H
Interpreting the eigenvalues
Classification of image points using eigenvalues of M:

2 “Edge”
2 >> 1 “Corner”
1 and 2 are large,
1 ~ 2 ;
E increases in all
directions

1 and 2 are small;


E is almost constant “Flat” “Edge”
in all directions region 1 >> 2

1
Corner detection summary
Here’s what you do
• Compute the gradient at each point in the image
• Create the H matrix from the entries in the gradient
• Compute the eigenvalues.
• Find points with large response (min > threshold)
• Choose those points where min is a local maximum as features
Corner detection summary
Here’s what you do
• Compute the gradient at each point in the image
• Create the H matrix from the entries in the gradient
• Compute the eigenvalues.
• Find points with large response (min > threshold)
• Choose those points where min is a local maximum as features
The Harris operator
min is a variant of the “Harris operator” for feature detection

• The trace is the sum of the diagonals, i.e., trace(H) = h11 + h22
• Very similar to min but less expensive (no square root)
• Called the “Harris Corner Detector” or “Harris Operator”
• Lots of other detectors, this is one of the most popular
The Harris operator

Harris
operator
Harris corner detector
Finding corners
(a.k.a. PCA)
[Link] image gradients over
small region

[Link] mean from each image


gradient

[Link] the covariance matrix

[Link] eigenvectors and


eigenvalues

[Link] threshold on eigenvalues to


detect corners
1. Compute image gradients over a small region
(not just a single pixel)
1. Compute image gradients over a small region
(not just a single pixel)

array of x gradients

array of y gradients
Visualization of gradients
image

X derivative

Y derivative
What does the distribution tell you about the region?
distribution reveals edge orientation and magnitude
How do you quantify orientation and magnitude?
2. Subtract the mean from each image gradient
2. Subtract the mean from each image gradient

plot intensities

constant intensity
gradient

intensities along the line


2. Subtract the mean from each image gradient

plot intensities

constant intensity
gradient

intensities along the line

subtract mean

plot of image gradients


2. Subtract the mean from each image gradient

plot intensities

constant intensity
gradient

intensities along the line

subtract mean

data is centered
plot of image gradients
(‘DC’ offset is removed)
3. Compute the covariance matrix
3. Compute the covariance matrix

=sum( .* )
array of x gradients array of y gradients

Where does this covariance matrix come from?


Easily recognized by looking through a small window

Shifting the window should give large change in intensity

“flat” region: “edge”: “corner”:


no change in all no change along the edge significant change in all
directions direction directions

[Moravec 1980]
Some mathematical background…

Error function
Change of intensity for the shift [u,v]:

Error Window Shifted Intensity


function function intensity

Window function w(x,y) = or

1 in window, 0 outside Gaussian


Error function approximation
Change of intensity for the shift [u,v]:

First-order Taylor expansion of I(x,y) about (0,0)


(bilinear approximation for small shifts)
Bilinear approximation
For small shifts [u,v] we have a ‘bilinear approximation’:

Change in
appearance for a
shift [u,v]

where M is a 2×2 matrix computed from image derivatives:

‘second moment’ matrix


‘structure tensor’
By computing the gradient covariance matrix…

we are fitting a quadratic to the gradients over a


small image region
Visualization of a quadratic
The surface E(u,v) is locally approximated by a quadratic form
Which error surface indicates a good image feature?

What kind of image patch do these surfaces represent?


Which error surface indicates a good image feature?

flat
Which error surface indicates a good image feature?

flat edge
‘line’
Which error surface indicates a good image feature?

flat edge corner


‘line’ ‘dot’
4. Compute eigenvalues and eigenvectors
4. Compute eigenvalues and eigenvectors
eigenvalue

eigenvector
4. Compute eigenvalues and eigenvectors
eigenvalue

eigenvector

1. Compute the determinant of


(returns a polynomial)
4. Compute eigenvalues and eigenvectors
eigenvalue

eigenvector

1. Compute the determinant of


(returns a polynomial)

2. Find the roots of polynomial


(returns eigenvalues)
4. Compute eigenvalues and eigenvectors
eigenvalue

eigenvector

1. Compute the determinant of


(returns a polynomial)

2. Find the roots of polynomial


(returns eigenvalues)

3. For each eigenvalue, solve


(returns eigenvectors)
eig(M)
Visualization as an ellipse
Since M is symmetric, we have

We can visualize M as an ellipse with axis lengths determined by the


eigenvalues and orientation determined by R

direction of the fastest


change
Ellipse equation:
direction of the
slowest change
(max)-1/2
(min)-1/2
interpreting eigenvalues
2
2 >> 1

What kind of image patch does


each region represent?

1 >> 2
1
interpreting eigenvalues
2
‘horizontal’ corner
edge

2 >> 1

1 ~ 2

flat 1 >> 2
‘vertical’
edge

1
interpreting eigenvalues
2
‘horizontal’ corner
edge

2 >> 1

1 ~ 2

flat 1 >> 2
‘vertical’
edge

1
interpreting eigenvalues
2
‘horizontal’ corner
edge

2 >> 1

1 ~ 2

flat 1 >> 2
‘vertical’
edge

1
5. Use threshold on eigenvalues to detect corners
5. Use threshold on eigenvalues to detect corners

2

Think of a function to
score ‘cornerness’

flat

1
5. Use threshold on eigenvalues to detect corners

2

strong corner
Think of a function to
score ‘cornerness’

flat

1
5. Use threshold on eigenvalues to detect corners
^
(a function of )
2
corner

Use the smallest eigenvalue as


the response function

flat

1
5. Use threshold on eigenvalues to detect corners
^
(a function of )
2
corner

Eigenvalues need to be
bigger than one.

Can compute this more efficiently…

flat

1
5. Use threshold on eigenvalues to detect corners
^
(a function of )
2
corner

R<0 R>0

R<0
flat

1
Harris & Stephens (1988)

Kanade & Tomasi (1994)

Nobel (1998)
Harris Detector
[Link] and [Link]. “A Combined Corner and Edge Detector.”1988.

1. Compute x and y derivatives of image

2. Compute products of derivatives at every pixel

3. Compute the sums of the products of derivatives at


each pixel
Harris Detector
[Link] and [Link]. “A Combined Corner and Edge Detector.”1988.

4. Define the matrix at each pixel

5. Compute the response of the detector at each pixel

6. Threshold on value of R; compute non-max suppression.


Yet another option:
Different criteria

Harris criterion
Corner response
Thresholded corner response
Non-maximal suppression
Design a program to detect corners
(hint: use image gradients)
Harris Detector: Invariance Properties
• Rotation

Ellipse rotates but its shape (i.e. eigenvalues)


remains the same

Corner response is invariant to image rotation


Harris Detector: Invariance Properties
• Affine intensity change: I → aI + b
✓ Only derivatives are used =>
invariance to intensity shift I → I + b
✓ Intensity scale: I → a I

R R
threshold

x (image coordinate) x (image coordinate)


Partially invariant to affine intensity change
The Harris detector is not invariant to changes in …
The Harris corner detector is not
invariant to scale

edge!
corner!
Harris Detector: Invariance Properties
• Scaling

Corner

All points will be


classified as edges

Not invariant to scaling


Scale invariant detection
Suppose you’re looking for corners

Key idea: find scale that gives local maximum of f


• in both position and scale
• One definition of f: the Harris operator
Multi-scale detection
How can we make a feature detector scale-invariant?
How can we automatically select the scale?
Thank you
• Next class: SIFT Detector and SIFT Descriptor
Multi-scale Blob Detection

CSCE 4603 – Fundamentals of Computer Vision


Dr. Mohamed Sedky
What have we learned so far?
• Light and color
– What an image records
• Filtering in spatial domain
• Filtering = weighted sum of neighboring pixels
• Smoothing, sharpening, measuring texture
• Filtering in frequency domain
• Filtering = change frequency of the input image
• Denoising, sampling, image compression
• Image pyramid and template matching
• Filtering = a way to find a template
• Image pyramids for coarse-to-fine search and multi-
scale detection
• Edge detection
• Canny edge = smooth -> derivative -> thin -> threshold -
> link
• Finding straight lines, binary image analysis
• Corner detection
• Harris corner = smooth -> SSD of X derivative-> SSD of Y
derivative-> SSD of X.Y Derivatives -> Eigen values of H
matrix -> error function -> threshold -> non-maximal
suppression
Today’s class
• What is an Interest Point?
• Detecting Blobs
• SIFT Detector
• SIFT Descriptor
Multi-scale detection
How can we make a feature detector scale-invariant?
How can we automatically select the scale?
Multi-scale blob detection
Intuitively…
Find local maxima in both position and scale

f f
Image 1 Image 2

s1 region size s2 region size


Formally…
Laplacian filter

Highest response when the signal has the


same characteristic scale as the filter
characteristic scale - the scale that
produces peak filter response

characteristic scale
we need to search over characteristic scales
What happens if you apply different Laplacian filters?

Full size 3/4 size


jet color scale
blue: low, red: high
What happened when you applied different Laplacian filters?

Full size 3/4 size


peak!
peak!
What happened when you applied different Laplacian filters?

Full size 3/4 size


2.1 4.2 6.0

9.8 15.5 17.0

peak!
2.1 4.2 6.0

9.8 15.5 17.0

maximum response
optimal scale
2.1 4.2 6.0 9.8 15.5 17.0

Full size image


2.1 4.2 6.0 9.8 15.5 17.0

3/4 size image


optimal scale
2.1 4.2 6.0 9.8 15.5 17.0
maximum
response

Full size image


2.1 4.2 6.0 9.8 15.5 17.0
maximum
response

3/4 size image


local maximum

4.2

cross-scale maximum local maximum

6.0

local maximum

9.8
How would you implement scale selection?
Implementation
• Instead of computing f for larger and larger windows, we can
implement using a fixed window size with a Gaussian pyramid

(sometimes need to create in-


between levels, e.g. a ¾-size image)
implementation

For each level of the Gaussian pyramid

compute feature response (e.g. Harris, Laplacian)

For each level of the Gaussian pyramid

if local maximum and cross-scale

save scale and location of feature


Advantages of local features
Locality
• features are local, so robust to occlusion and clutter
Quantity
• hundreds or thousands in a single image
Distinctiveness:
• can differentiate a large database of objects
Efficiency
• real-time performance achievable
Thank you

• Next class: feature detectors and desriptors


Feature Detector and
Decriptor

CSCE 4603 – Fundamentals of Computer Vision


Dr. Mohamed Sedky
What have we learned so far?
• Light and color
– What an image records
• Filtering in spatial domain
• Filtering = weighted sum of neighboring pixels
• Smoothing, sharpening, measuring texture
• Filtering in frequency domain
• Filtering = change frequency of the input image
• Denoising, sampling, image compression
• Image pyramid and template matching
• Filtering = a way to find a template
• Image pyramids for coarse-to-fine search and multi-
scale detection
• Edge detection
• Canny edge = smooth -> derivative -> thin -> threshold -
> link
• Finding straight lines, binary image analysis
• Corner detection
• Harris corner = smooth -> SSD of X derivative-> SSD of Y
derivative-> SSD of X.Y Derivatives -> Eigen values of H
matrix -> error function -> threshold -> non-maximal
suppression
Today’s class
• Designing feature descriptors
• SIFT Detector
• SIFT Descriptor
Designing feature descriptors
Geometric transformations

objects will appear at different scales,


translation and rotation
Photometric transformations
What is the best descriptor for an image feature?
Image patch
Just use the pixel values of the patch

1 2 3

4 5 6 ( 1 2 3 4 5 6 7 8 9 )
7 8 9 vector of intensity values

Perfectly fine if geometry and appearance is unchanged


(a.k.a. template matching)
Tiny Images
Just down-sample it!
Simple, fast, robust to small affine transforms.
Image patch
Just use the pixel values of the patch

1 2 3

4 5 6 ( 1 2 3 4 5 6 7 8 9 )
7 8 9 vector of intensity values

Perfectly fine if geometry and appearance is unchanged


(a.k.a. template matching)

What are the problems?


Image patch
Just use the pixel values of the patch

1 2 3

4 5 6 ( 1 2 3 4 5 6 7 8 9 )
7 8 9 vector of intensity values

Perfectly fine if geometry and appearance is unchanged


(a.k.a. template matching)

What are the problems?


How can you be less sensitive to absolute intensity values?
Image gradients
Use pixel differences

1 2 3

4 5 6 ( - + + - - + )
7 8 9 vector of x derivatives
‘binary descriptor’

Feature is invariant to absolute intensity values

What are the problems?


Image gradients
Use pixel differences

1 2 3

4 5 6 ( - + + - - + )
7 8 9 vector of x derivatives

Feature is invariant to absolute intensity values

What are the problems?


How can you be less sensitive to deformations?
Color histogram
Count the colors in the image using a histogram

colors

Invariant to changes in scale and rotation

What are the problems?


Color histogram
Count the colors in the image using a histogram

colors

Invariant to changes in scale and rotation

What are the problems?


Color histogram
Count the colors in the image using a histogram

colors

Invariant to changes in scale and rotation

What are the problems?


How can you be more sensitive to spatial layout?
Spatial histograms
Compute histograms over spatial ‘cells’

Retains rough spatial layout


Some invariance to deformations
What are the problems?
Spatial histograms
Compute histograms over spatial ‘cells’

Retains rough spatial layout


Some invariance to deformations
What are the problems?
How can you be completely invariant to rotation?
Orientation normalization
Use the dominant image gradient direction to
normalize the orientation of the patch

save the orientation angle along with

What are the problems?


Discriminative power

Raw pixels Sampled Locally orderless Global histogram

Generalization power
HOG (Histograms of Oriented
Gradients) descriptor
HOG (Histograms of Oriented Gradients)
Dalal, Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005

histogram of ‘unsigned’
gradients
Cell
(8x8 pixels)
gradient magnitude histogram
(one for each cell)

soft binning
Block
(2x2 cells)

Concatenate and L-2 normalization

Single scale, no dominant orientation


Pedestrian detection
1 cell step size visualization

128 pixels
15 x 7 x 4 x 36 =
16 cells
3780
15 blocks

64 pixels
8 cells
7 blocks
Redundant representation due to overlapping blocks
How many times is each inner cell encoded?

[Link]
SIFT (Scale-Invariant Feature
Transform)
Fast NLoG Approximation: DoG
Difference of Gaussian 𝐷𝑜𝐺 = 𝑛𝑠𝜎 − 𝑛𝜎 ≈ (𝑠 − 1)𝜎 2 𝛻 2 𝑛𝜎

Normalized Laplacian of Gaussian


NLoG

𝐷𝑜𝐺 ≈ (𝑠 − 1)𝜎 2 𝛻 2 𝑛𝜎
What Is A Useful Signature Function?
• Difference-of-Gaussian = “blob” detector

K. Grauman, B. Leibe
Difference-of-Gaussian (DoG)

- =

K. Grauman, B. Leibe
DoG – Efficient Computation
• Computation in Gaussian scale pyramid
𝑠4𝜎
𝑠3𝜎
𝑠2𝜎
𝑠𝜎
Sampling with 𝜎
step  =2 𝑠4𝜎

𝑠3𝜎

𝑠2𝜎

1 𝑠𝜎
Original image  =2 4
𝜎

K. Grauman, B. Leibe
Find local maxima in position-scale
space of Difference-of-Gaussian
𝑠4𝜎

𝑠3𝜎

Lxx ( ) + Lyy ( ) 𝑠 2 𝜎

𝑠𝜎

 List of
𝜎 (x, y, s)

K. Grauman, B. Leibe
Results: Difference-of-Gaussian

K. Grauman, B. Leibe
SIFT (Scale Invariant
Feature Transform)
SIFT describes both a detector and descriptor

1. Multi-scale extrema detection

2. Keypoint localization

3. Orientation assignment

4. Keypoint descriptor
1. Multi-scale extrema
detection
𝑠4𝜎
𝑠3𝜎
Second octave 𝑠2𝜎
𝑠𝜎
𝜎

𝑠4𝜎
𝑠3𝜎
First octave

𝑠2𝜎
𝑠𝜎
𝜎
Gaussian Difference of Gaussian (DoG)
Gaussian

Laplacian
Scale-space extrema

Scale of Gaussian variance Selected if larger


than all 26
neighbors

Difference of Gaussian (DoG)


2. Keypoint localization
2nd order Taylor series approximation of DoG scale-space

Take the derivative and solve for extrema

Additional tests to retain only strong features


3. Orientation assignment
For a keypoint, L is the Gaussian-smoothed image with the closest scale,

x-derivative y-derivative

Two underconstrained problems in computer vision are:

1. **Optical Flow Estimation** – It has one equation (brightness constancy) but two unknowns (motion in x and y), making it underconstrained per pixel.

2. **Shape from Shading** – Multiple 3D shapes can produce the same shading in an image, leading to ambiguity in recovering surface geometry from intensity alone.
Detection process returns

location scale orientation


4. Keypoint descriptor
Image Gradients SIFT descriptor
(4 x 4 pixel per cell, 4 x 4 cells) (16 cells x 8 directions = 128 dims)

Gaussian weighting
(sigma = half width)
Orientation Normalization
[Lowe, SIFT, 1999]

• Compute orientation histogram


• Select dominant orientation
• Normalize: rotate to fixed orientation

0 2
T. Tuytelaars, B. Leibe
SIFT descriptor
Histograms of gradient directions over spatial regions
SIFT descriptor
Histograms of gradient directions over spatial regions
SIFT descriptor
Histograms of gradient directions over spatial regions
SIFT descriptor
Histograms of gradient directions over spatial regions
Local Descriptors: SIFT Descriptor

Histogram of oriented
gradients
• Captures important texture
information
• Robust to small translations /
[Lowe, ICCV 1999]
affine deformations
K. Grauman, B. Leibe
Scale Invariant Feature Transform (SIFT)
Basic idea:
• Take 16x16 square window around detected feature
• Compute edge orientation (angle of the gradient - 90) for each pixel
• Throw out weak edges (threshold gradient magnitude)
• Create histogram of surviving edge orientations

angle histogram

Adapted from slide by David Lowe


SIFT descriptor (Full version)
• Divide the 16x16 window into a 4x4 grid of cells (2x2 case shown below)
• Compute an orientation histogram for each cell
• 16 cells * 8 orientations = 128 dimensional descriptor

Adapted from slide by David Lowe


Details of Lowe’s SIFT algorithm
• Run DoG detector
– Find maxima in location/scale space
– Remove edge points
• Find all major orientations
– Bin orientations into 36 bin histogram
• Weight by gradient magnitude
• Weight by distance to center (Gaussian-weighted mean)
– Return orientations within 0.8 of peak
• Use parabola for better orientation fit
• For each (x,y,scale,orientation), create descriptor:
– Sample 16x16 gradient mag. and rel. orientation
– Bin 4x4 samples into 4x4 histograms
– Threshold values to max of 0.2, divide by L2 norm
– Final descriptor: 4x4x8 normalized histograms

Lowe IJCV 2004


SIFT Example

sift

868 SIFT features


Local Descriptors
• The ideal descriptor should be
– Robust
– Distinctive
– Compact
– Efficient

• Most available descriptors focus on edge/gradient information


– Capture texture information
– Color rarely used

K. Grauman, B. Leibe
SIFT Results: Scale Invariance
SIFT Results: Rotation Invariance
SIFT Results: Robustness to Clutter
Feature matching
• Essentially comparing two arrays of data.

• Given a feature in I1, how to find the best match in I2?


• Define distance function that compares two descriptors
• Test all the features in I2, find the one with min distance

• Let 𝐻1 (𝑘) and 𝐻2 (𝑘) be two arrays of data of length 𝑁.


Comparing SIFT Descriptors: L2 Distance:
• L2 Distance:

𝑑(𝐻1 , 𝐻2 ) = ෍(𝐻1 𝑘 − 𝐻2 𝑘 )2
𝑘

• Smaller the distance metric, better the match.


• Perfect match when 𝑑(𝐻1 , 𝐻2 ) =0
Comparing SIFT Descriptors: Normalized Correlation

• Normalized Correlation:

σ𝑘[ 𝐻1 𝑘 − 𝐻1 𝐻2 𝑘 − 𝐻2 ]
𝑑 𝐻1 , 𝐻2 =
σ𝑘 𝐻1 𝑘 − 𝐻1 2 σ𝑘 𝐻2 𝑘 − 𝐻2 2

1 𝑁
where 𝐻𝑖 = σ𝑘=1 𝐻𝑖 𝑘
𝑁

• Larger the distance metric, better the match.


• Perfect match when 𝑑(𝐻1 , 𝐻2 ) =1
Comparing SIFT Descriptors: Intersection
• Intersection:

𝑑 𝐻1 , 𝐻2 = σ𝑘 𝑚𝑖𝑛(𝐻1 𝑘 , 𝐻2 𝑘 )

• Larger the distance metric, better the match.


Feature distance
How to define the difference between two features f1, f2?
• Simple approach: L2 distance, ||f1 - f2 ||
• can give good scores to ambiguous (incorrect) matches

f1 f2

I1 I2
Feature distance
How to define the difference between two features f1, f2?
• Better approach: ratio distance = ||f1 - f2 || / || f1 - f2’ ||
• f2 is best SSD match to f1 in I2
• f2’ is 2nd best SSD match to f1 in I2
• gives large values for ambiguous matches

f1 f2' f2

I1 I2
Feature matching example

51 matches
Feature matching example

58 matches
Evaluating the results
How can we measure the performance of a feature matcher?

50
75
200

feature distance
True/false positives
How can we measure the performance of a feature matcher?
50
true match
75
200
false match

feature distance

The distance threshold affects performance


• True positives = # of detected matches that are correct
• Suppose we want to maximize these—how to choose threshold?
• False positives = # of detected matches that are incorrect
• Suppose we want to minimize these—how to choose threshold?
Evaluating the results
How can we measure the performance of a feature matcher?

0.7

true
# true positives
positive
# correctly matched features (positives)
rate
“recall”

0 0.1 false positive rate 1


# false positives
# incorrectly matched features (negatives)
1 - “precision”
Evaluating the results
How can we measure the performance of a feature matcher?

ROC curve (“Receiver Operator Characteristic”)


1

0.7

true
# true positives
positive
# correctly matched features (positives)
rate
“recall”

0 0.1 false positive rate 1


# false positives
# incorrectly matched features (negatives)
1 - “precision”
Local Descriptors: SURF
• Fast approximation of SIFT idea
➢ Efficient computation by 2D box filters &
integral images
 6 times faster than SIFT
➢ Equivalent quality for object
identification

• GPU implementation available


Many other efficient descriptors ➢ Feature extraction @ 200Hz
are also available (detector + descriptor, 640×480 img)
➢ [Link]

[Bay, ECCV’06], [Cornelis, CVGPU’08]


K. Grauman, B. Leibe
Choosing a detector
• What do you want it for?
– Precise localization in x-y: Harris
– Good localization in scale: Difference of Gaussian
– Flexible region shape: MSER

• Best choice often application dependent


– Harris-/Hessian-Laplace/DoG work well for many natural categories
– MSER works well for buildings and printed things

• Why choose?
– Get more points with more detectors

• There have been extensive evaluations/comparisons


– [Mikolajczyk et al., IJCV’05, PAMI’05]
– All detectors/descriptors shown here work well
Comparison of Keypoint Detectors

Tuytelaars Mikolajczyk 2008


Choosing a descriptor
• Again, need not stick to one

• For object instance recognition or stitching, SIFT or variant is a good


choice
Things to remember
• Keypoint detection: repeatable
and distinctive
• Corners, blobs, stable regions
• Harris, DoG

• Descriptors: robust and selective


• spatial histograms of orientation
• SIFT
Thank you

• Next class: feature tracking and optical flow


Feature Tracking

CSCE 4603 – Fundamentals of Computer Vision


Dr. Mohamed Sedky
What have we learned so far?
• Light and color
– What an image records
• Filtering in spatial domain
• Filtering = weighted sum of neighboring pixels
• Smoothing, sharpening, measuring texture
• Filtering in frequency domain
• Filtering = change frequency of the input image
• Denoising, sampling, image compression
• Image pyramid and template matching
• Filtering = a way to find a template
• Image pyramids for coarse-to-fine search and multi-
scale detection
• Edge detection
• Canny edge = smooth -> derivative -> thin -> threshold -
> link
• Finding straight lines, binary image analysis
• Corner detection
• Harris corner = smooth -> SSD of X derivative-> SSD of Y
derivative-> SSD of X.Y Derivatives -> Eigen values of H
matrix -> error function -> threshold -> non-maximal
suppression
Previous class
• Interest point/keypoint/feature
detectors
• Harris: detects corners
• DoG: detects peaks/troughs

• Interest point/keypoint/feature
descriptors
• SIFT (do read the paper)

• Feature matching f 2 ' f2


• Ratio distance = ||f1 - f2 || / || f1 - f2’ || f1
• Remove 90% false matches, 5% of true
matches in Lowe’s study
I1 I2
Today’s class: recovering motion
• Feature tracking
• Extract visual features (corners, textured areas) and “track” them over multiple frames
Relating spatial and temporal derivatives helps estimate motion by linking pixel movement to changes in brightness over time.

• Optical flow
• Recover image motion at each pixel from spatio-temporal image brightness variations

Two problems, one registration method

B. Lucas and T. Kanade. An iterative image registration technique with an application to


stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence, 1981.
Today’s class: optical flow
• Method to estimate apparent motion of scene points from a
sequence of images.
• Topics
• Motion field and optical flow
• Optical flow constraint equation
• Lucas-Kanade method
• Coarse-to-Fine flow estimation
Motion and perceptual organization
• Even “impoverished” motion data can evoke a strong percept

G. Johansson, “Visual Perception of Biological Motion and a Model For Its Analysis",
Perception and Psychophysics 14, 201-211, 1973.
Motion and perceptual organization
• Even “impoverished” motion data can evoke a strong percept
The optical flow model estimates motion by assuming constant brightness and linking spatial and temporal intensity changes.

G. Johansson, “Visual Perception of Biological Motion and a Model For Its Analysis",
Perception and Psychophysics 14, 201-211, 1973.
Uses of motion
• Estimating 3D structure
• Segmenting objects based on motion cues
• Learning and tracking dynamical models
• Recognizing events and activities
• Improving video quality (motion stabilization)
Video Stabilization

Remove Camera Shake


Video Retiming

Produce Slow-Motion Effect


Optical Mouse

Estimating Mouse Movement


Traffic Monitoring

Finding Velocoties of Vehicles


Face Tracking

Tracking of Facial Features


Feature tracking
• Many problems, such as structure from motion require matching
points
• If motion is small, tracking is an easy way to get them
Optical flow
• Motion of brightness patterns in the image

Image Sequence Optical Flow


(2 frames)
Velocity of
• Ideally, Optical Flow = Motion Field Brightness Pattern
Optical flow
• Motion of brightness patterns in the image

Image Sequence Optical Flow


(2 frames)
Velocity of
• Ideally, Optical Flow = Motion Field Brightness Pattern
Motion field
• The motion field is the projection of the 3D scene motion into the
image

What would the motion field of a non-rotating ball moving towards the camera look like?
Optical flow
• Definition: optical flow is the apparent motion of brightness patterns
in the image
• Ideally, optical flow would be the same as the motion field
• Have to be careful: apparent motion can be caused by lighting
changes without any actual motion
• Think of a uniform rotating sphere under fixed lighting vs. a stationary sphere
under moving illumination
When is optical Flow ≠ Motion Field?

Barber Pole
Illusion
When is optical Flow ≠ Motion Field?

Donguri Wave
Motion Illusion
When is optical Flow ≠ Motion Field?

Donguri Wave
Motion Illusion
Optical Flow

𝑡 𝑡 + 𝛿𝑡
Optical Flow

𝑡 𝑡 + 𝛿𝑡

Optical Flow:
(𝑥, 𝑦) (𝑥 + 𝛿𝑥, 𝑦 + 𝛿𝑦) 𝜹𝒙 𝜹𝒚
𝒖, 𝒗 = ( , )
Displacement: (𝜹𝒙, 𝜹𝒚) 𝜹𝒕 𝜹𝒕
Feature tracking - Challenges

• Figure out which features can be tracked


• Efficiently track across frames
• Some points may change appearance over time (e.g.,
due to rotation, moving into shadows, etc.)
• Drift: small errors can accumulate as appearance
model is updated
• Points may appear or disappear: need to be able to
add/delete tracked points
Feature tracking

I(x,y,t) I(x,y,t+1)
• Given two subsequent frames, estimate the point translation

• Key assumptions of Lucas-Kanade Tracker


• Brightness constancy: projection of the same point looks the same in
every frame
• Small motion: points do not move very far
• Spatial coherence: points move like their neighbors
Smoothness

most objects in the world are rigid or


deform elastically
moving together coherently

we expect optical flow fields to be smooth


Optical Flow

Problem Definition

Given two consecutive image frames,


estimate the motion of each pixel

Assumptions

Brightness constancy

Small motion
Optical Flow (Problem definition)

Estimate the motion


(flow) between these two
consecutive images
How is this different from estimating a 2D transform?
Optical Flow (Key Assumptions)

Color Constancy
(Brightness constancy for intensity images)

Implication: allows for pixel to pixel comparison


(not image features)

Small Motion
(pixels only move a little bit)

Implication: linearization of the brightness


constancy constraint
Optical Flow (Approach)

Look for nearby pixels with the same color


(small motion) (color constancy)
Assumption 1
Brightness constancy

Scene point moving through image sequence


Assumption 1
Brightness constancy

Scene point moving through image sequence


Assumption 1
Brightness constancy

Scene point moving through image sequence

Assumption:Brightness of the point will remain the same


Assumption 1
Brightness constancy
Scene point moving through image sequence

Assumption:Brightness of the point will remain the same

constant
Assumption 2
Small motion
Assumption 2
Small motion
Assumption 2
Small motion

Optical flow (velocities): Displacement:


Assumption 2
Small motion

Optical flow (velocities): Displacement:

For a really small space-time step…


𝐼 𝑥 + 𝛿𝑥, 𝑦 + 𝛿𝑦, 𝑡 + 𝛿𝑡 = 𝐼(𝑥, 𝑦, 𝑡)
… the brightness between two consecutive image frames
is the same
𝐼 𝑥 + 𝛿𝑥, 𝑦 + 𝛿𝑦, 𝑡 + 𝛿𝑡 = 𝐼(𝑥, 𝑦, 𝑡)
For small space-time step, brightness of a point is the same
𝐼 𝑥 + 𝛿𝑥, 𝑦 + 𝛿𝑦, 𝑡 + 𝛿𝑡 = 𝐼(𝑥, 𝑦, 𝑡)
For small space-time step, brightness of a point is the same

Insight:
If the time step is really small,
we can linearize the intensity function
𝐼 𝑥 + 𝛿𝑥, 𝑦 + 𝛿𝑦, 𝑡 + 𝛿𝑡 = 𝐼(𝑥, 𝑦, 𝑡)

Multivariable Taylor Series Expansion


(First order approximation, two variables)
𝐼 𝑥 + 𝛿𝑥, 𝑦 + 𝛿𝑦, 𝑡 + 𝛿𝑡 = 𝐼(𝑥, 𝑦, 𝑡)

Multivariable Taylor Series Expansion


(First order approximation, two variables)

assuming small
motion
Multivariable Taylor Series Expansion
(First order approximation, two variables)

partial derivative

assuming small
motion
fixed point

cancel terms
Multivariable Taylor Series Expansion
(First order approximation, two variables)

assuming small
motion

cancel terms
Multivariable Taylor Series Expansion
(First order approximation, two variables)

assuming small
motion

divide by
take limit
Multivariable Taylor Series Expansion
(First order approximation, two variables)

assuming small
motion

divide by
take limit
Multivariable Taylor Series Expansion
(First order approximation, two variables)

assuming small
motion

divide by
take limit

Brightness Constancy
Equation
Brightness
Constancy Equation

shorthand notation
(x-flow) (y-flow)

vector form
(1 x 2) (2 x 1)
(putting the math aside for a second…)

What do the term of the


brightness constancy equation represent?
(putting the math aside for a second…)

What do the term of the


brightness constancy equation represent?

Image gradients
(at a point p)
(putting the math aside for a second…)

What do the term of the


brightness constancy equation represent?

flow velocities

Image gradients
(at a point p)
(putting the math aside for a second…)

What do the term of the


brightness constancy equation represent?

flow velocities

Image gradients temporal gradient


(at a point p)

How do you compute these terms?


How do you compute …

spatial derivative
How do you compute …

spatial derivative

Forward difference
Sobel filter
Scharr filter

How do you compute …

spatial derivative temporal derivative

Forward difference
Sobel filter
Scharr filter

How do you compute …

spatial derivative temporal derivative

Forward difference frame differencing


Sobel filter
Scharr filter

Frame differencing

1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
1
1
1
10 10 10 10
10 10 10 10
10 10 10 10
- 1
1
1
1
1
1
1 1
10 10 10
10 10 10
1

= 0
0
0
9
9
9
9
0
0
9
0
0
9
0
0
1 10 10 10 10 1 1 10 10 10 0 9 0 0 0

(example of a forward difference)


Example:
x x
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 10 10 10 10
3 x 3 patch 1 1 1 1 1
1 10 10 10 10 1 1 10 10 10
1 10 10 10 10 1 1 10 10 10
1 10 10 10 10 1 1 10 10 10
y y

x x x
- 0 0 0 - - - - - - 0 0 0 0 0
- 0 0 0 - 0 0 0 0 0 0 0 0 0 0
- 9 0 0 - 0 9 9 9 9 0 9 9 9 9
- 9 0 0 - - 0 0 0 0 0 9 0 0 0
- 9 0 0 - 0 0 0 0 0 0 9 0 0 0
- 9 0 0 - - - - - - 0 9 0 0 0
y y
-1 y

-1 0 1 0
1
How do you compute …

spatial derivative optical flow temporal derivative

Forward difference How do you compute this? frame differencing


Sobel filter
Scharr filter

How do you compute …

spatial derivative optical flow temporal derivative

Forward difference We need to solve for this! frame differencing


Sobel filter (this is the unknown in the optical
Scharr filter flow problem)

How do you compute …

spatial derivative optical flow temporal derivative

Forward difference frame differencing


Sobel filter
Scharr filter Solution lies on a line

Cannot be found uniquely
with a single constraint
Solution lies on a straight line

many combinations of u and v will satisfy the equality

The solution cannot be determined uniquely with a


single constraint (a single pixel)
spatial derivative optical flow temporal derivative

How can we use the brightness constancy equation to


estimate the optical flow?
unknown

known

We need at least ____ equations to solve for 2 unknowns.


unknown

known

Where do we get more equations (constraints)?


Horn-Schunck Lucas-Kanade
Optical Flow (1981) Optical Flow (1981)

brightness constancy
method of differences
small motion

‘smooth’ flow ‘constant’ flow


(flow can vary from pixel to pixel) (flow is constant for all pixels)

global method local method


(dense) (sparse)
Constant flow
Where do we get more equations (constraints)?

Assume that the surrounding patch (say 5x5) has


‘constant flow’
Assumptions:
Flow is locally smooth
Neighboring pixels have same displacement

Using a 5 x 5 image patch, gives us 25 equations


Assumptions:
Flow is locally smooth
Neighboring pixels have same displacement

Using a 5 x 5 image patch, gives us 25 equations


Assumptions:
Flow is locally smooth
Neighboring pixels have same displacement

Using a 5 x 5 image patch, gives us 25 equations

Matrix form
Assumptions:
Flow is locally smooth
Neighboring pixels have same displacement

Using a 5 x 5 image patch, gives us 25 equations


Overconstrained linear system

How many equations? How many unknowns? How do we solve this?


Least squares approximation

is equivalent to solving
Least squares approximation

is equivalent to solving

Square Matrix

To obtain the least squares solution solve:

where the summation is over each pixel p in patch P

Pseudo Inverse
Least squares approximation

is equivalent to solving

To obtain the least squares solution solve:

where the summation is over each pixel p in patch P

Sometimes called ‘Lucas-Kanade Optical Flow’


(can be interpreted to be a special case of the LK method with a translational warp model)
Conditions for solvability
Optimal (u, v) satisfies Lucas-Kanade equation

When is this solvable? I.e., what are good points to track?


• ATA should be invertible
• ATA should not be too small due to noise
– eigenvalues 1 and  2 of ATA should not be too small
• ATA should be well-conditioned
–  1/  2 should not be too large ( 1 = larger eigenvalue)

Does this remind you of anything?

Criteria for Harris corner detector


Where have you seen this before?

=
Where have you seen this before?

Harris Corner Detector!


Low-texture region

– gradients have small magnitude


– small 1, small 2
Edge

– gradients very large or very small


– large 1, small 2
High-texture region

– gradients are different, large magnitudes


– large 1, large 2
Implications
• Corners are when λ1, λ2 are big; this is also when
Lucas-Kanade optical flow works best
• Corners are regions with two different directions of
gradient (at least)
• Corners are good places to compute flow!

What happens when you have no ‘corners’?


Dealing with larger movements: Original (x,y) position
Iterative refinement
1. Initialize (x’,y’) = (x,y) It = I(x’, y’, t+1) - I(x, y, t)
2. Compute (u,v) by

2nd moment matrix for feature


displacement
patch in first image

3. Shift window by (u, v): x’=x’+u; y’=y’+v;


4. Recalculate It
5. Repeat steps 2-4 until small change
• Use interpolation for subpixel values
Iterative Refinement
• Iterative Lukas-Kanade Algorithm
1. Estimate displacement at each pixel by solving Lucas-
Kanade equations
2. Warp I(t) towards I(t+1) using the estimated flow field
- Basically, just interpolation
3. Repeat until convergence

86
* From Khurram Hassan-Shafique CAP5415 Computer Vision 2003
Coarse-to-fine optical flow estimation

run iterative L-K


warp & upsample

run iterative L-K


.
.
.

image J1 image
image I2

Gaussian pyramid of image 1 (t) Gaussian pyramid of image 2 (t+1)


Image Warping
Example

* From Khurram Hassan-Shafique CAP5415 Computer Vision 2003


Multi-resolution registration

* From Khurram Hassan-Shafique CAP5415 Computer Vision 2003


Optical Flow Results
Lucas-Kanade without
pyramids

Fails in areas of
large motion

* From Khurram Hassan-Shafique CAP5415 Computer Vision 2003


Optical Flow Results

Lucas-Kanade with pyramids

* From Khurram Hassan-Shafique CAP5415 Computer Vision 2003


Summary of LKT tracking
• Find a good point to track (harris corner)

• Use intensity second moment matrix and difference across frames to


find displacement

• Iterate and use coarse-to-fine search to deal with larger movements

• When creating long tracks, check appearance of registered patch


against appearance of initial patch to find points that have drifted
Implementation issues
• Window size
• Small window more sensitive to noise and may miss larger motions (without
pyramid)
• Large window more likely to cross an occlusion boundary (and it’s slower)
• 15x15 to 31x31 seems typical

• Weighting the window


• Common to apply weights so that center matters more (e.g., with Gaussian)
Why not just do local template matching?

• Slow (need to check more locations)

• Does not give subpixel alignment (or becomes much slower)


• Even pixel alignment may not be good enough to prevent drift

• May be useful as a step in tracking if there are large movements


Errors in Lucas-Kanade
• The motion is large
• Possible Fix: Keypoint matching

• A point does not move like its neighbors


• Possible Fix: Region-based matching

• Brightness constancy does not hold


• Possible Fix: Gradient constancy
Things to remember
• Major contributions from Lucas, Tomasi, Kanade
• Tracking feature points
• Optical flow
• Stereo (later)
• Structure from motion (later)

• Key ideas
• By assuming brightness constancy, truncated Taylor expansion leads to simple
and fast patch matching across frames
• Coarse-to-fine registration
Next week
• Fitting and Alignment
Fitting

CSCE 4603 – Fundamentals of Computer Vision


Dr. Mohamed Sedky
What have we learned so far?
• Light and color
– What an image records
• Filtering in spatial domain
• Filtering = weighted sum of neighboring pixels
• Smoothing, sharpening, measuring texture
• Filtering in frequency domain
• Filtering = change frequency of the input image
• Denoising, sampling, image compression
• Image pyramid and template matching
• Filtering = a way to find a template
• Image pyramids for coarse-to-fine search and multi-
scale detection
• Edge detection
• Canny edge = smooth -> derivative -> thin -> threshold -
> link
• Finding straight lines, binary image analysis
• Corner detection
• Harris corner = smooth -> SSD of X derivative-> SSD of Y
derivative-> SSD of X.Y Derivatives -> Eigen values of H
matrix -> error function -> threshold -> non-maximal
suppression
Where are we?
• Interest points
• Find distinct and repeatable points in images
• Harris-> corners, DoG -> blobs
• SIFT -> feature descriptor
• Feature tracking and optical flow
• Find motion of a keypoint/pixel over time
• Lucas-Kanade:
• brightness consistency, small motion, spatial coherence
• Handle large motion:
• iterative update + pyramid search
• Fitting and alignment (this class)
• find the transformation parameters that
best align matched points
• Object instance recognition (next class)
• Keypoint-based object instance recognition and search
Today’s class: Fitting and alignment
• Find the transformation parameters that best align matched points
Review: Harris Corner Detector

• Second moment matrix


 I x2 ( D ) I x I y ( D )
 ( I ,  D ) = g ( I )    1. Image Ix Iy
 I x I y ( D ) I y ( D ) 
2
derivatives
(optionally, blur first)

Ix2 Iy 2 IxIy
2. Square of
derivatives
det M = 12
trace M = 1 + 2
3. Gaussian g(Ix2) g(Iy2) g(IxIy)
filter g(I)

4. Cornerness function – both eigenvalues are strong


har = det[ ( I , D)] −  [trace( ( I , D)) 2 ] =
g ( I x2 ) g ( I y2 ) − [ g ( I x I y )]2 −  [ g ( I x2 ) + g ( I y2 )]2
5 5. Non-maxima suppression har
Review: Find local maxima in position-
scale space of Difference-of-Gaussian




Lxx ( ) + Lyy ( ) 



 List of
(x, y, s)

K. Grauman, B. Leibe
Review: SIFT Descriptor

Histogram of oriented
gradients
• Captures important texture
information
• Robust to small translations /
[Lowe, ICCV 1999]
affine deformations
K. Grauman, B. Leibe
Review: Lucas-Kanade Tracker

I(x,y,t) I(x,y,t+1)

I ( x, y, t ) = I ( x + u, y + v, t + 1) Brightness consistency
I ( x + u, y + v, t + 1)  I ( x, y , t ) + I x  u + I y  v + I t
Small motion
I x  u + I y  v + It  0

Spatial coherence
Dealing with larger movements: Iterative
Original (x,y) position
refinement
1. Initialize (x’,y’) = (x,y) It = I(x’, y’, t+1) - I(x, y, t)
2. Compute (u,v) by

2nd moment matrix for feature


patch in first image displacement

3. Shift window by (u, v): x’=x’+u; y’=y’+v;


4. Recalculate It
5. Repeat steps 2-4 until small change
• Use interpolation for subpixel values
Dealing with larger movements:
coarse-to-fine registration

run iterative L-K


upsample

run iterative L-K


.
.
.

image J1 image
image I2

Gaussian pyramid of image 1 (t) Gaussian pyramid of image 2 (t+1)


Fitting [ˈfidiNG]:
find the parameters of a model that
best fit the data

Alignment [əˈlīnmənt]:
find the parameters of the
transformation that best align
matched points
Fitting and alignment
• Choose a parametric model to represent a set of
features

simple model: lines simple model: circles

complicated model: face shape

complicated model: car


Source: K. Grauman
Finding boundaries
Where are the object boundaries?
Human annotated boundaries
Edge detection
Multi-scale edge detection
Edge strength does not necessarily
correspond to our perception of boundaries
Where are the object boundaries?
Human annotated boundaries
Edge detection
Defining boundaries are hard for us too
Where is the boundary of the mountain top?
Lines are hard to find

Original image Edge detection Thresholding

Noisy edge image


Incomplete boundaries
Applications

Autonomous Vehicles tissue engineering behavioral genetics


(lane line detection) (blood vessel counting) (earthworm contours)

Autonomous Vehicles Computational Photography


(semantic scene segmentation) (image inpainting)
Line fitting
Line fitting
Given: Many pairs

Find: Parameters

Minimize: Average square


distance:

Using:

Note:

How can we solve this minimization?


Line fitting
Given: Many pairs

Find: Parameters

Minimize: Average square distance:

Using:

Note: σ𝑖(𝑦𝑖 − 𝑦ഥ𝑖 )𝑥𝑖


𝑚=
σ𝑖(𝑥𝑖 − 𝑥ഥ𝑖 )𝑥𝑖

What are some problems with the approach?


Problems with parameterizations

Where is the line that minimizes E?


Problems with parameterizations
Where is the line that minimizes E?

Huge E!
Problems with parameterizations
Where is the line that minimizes E?

Line that minimizes E!!


Problems with parameterizations
Where is the line that minimizes E?

Line that minimizes E!!


(Error E must be formulated carefully!)

How can we deal with this?


Line fitting is easily setup as a maximum likelihood problem
… but choice of model is important

What optimization are we solving here?


Problem with “vertical” least squares
• Not rotation-invariant
• Fails completely for vertical
lines

Slide from S. Lazebnik


Problems with noise

Least-squares error fit Squared error heavily penalizes outliers


Model fitting is difficult because…

• Extraneous data: clutter or multiple models


– We do not know what is part of the model?
– Can we pull out models with a few parts from much larger
amounts of background clutter?
• Missing data: only some parts of model are present
• Noise

• Cost:
– It is not feasible to check all combinations of features by fitting a
model to each possible subset

So what can we do?


Line parameterizations
Slope intercept form

slope y-intercept

What are m and b?


Slope intercept form

slope y-intercept
Double intercept form

x-intercept y-intercept

What are x and y?


Double intercept form

x-intercept y-intercept

Derivation:
(Similar slope)
Normal Form

What are rho and theta?


Normal Form

Derivation:

plug into:
Hough transform
Hough transform

• Generic framework for detecting a parametric model

• Edges don’t have to be connected

• Lines can be occluded

• Key idea: edges vote for the possible models


Image and parameter
space
variables

parameters

Image space
Image and parameter
space
variables variables

parameters parameters

a line
becomes a
point

Image space Parameter space


Image and parameter
space
variables

parameters

What would a point in image space


become in parameter space?

Image space
Image and parameter
space
variables variables

parameters parameters

a point
becomes a
line

Image space Parameter space


Image and parameter
space
variables variables

parameters parameters

two points
become
?

Image space Parameter space


Image and parameter
space
variables variables

parameters parameters

two points
become
?

Image space Parameter space


Image and parameter
space
variables variables

parameters parameters

three points
become
?

Image space Parameter space


Image and parameter
space
variables variables

parameters parameters

three points
become
?

Image space Parameter space


Image and parameter
space
variables variables

parameters parameters

four points
become
?

Image space Parameter space


Image and parameter
space
variables variables

parameters parameters

four points
become
?

Image space Parameter space


How would you find the best
fitting line?

Image space Parameter space

Is this method robust to measurement noise?


Is this method robust to outliers?
Line Detection by Hough Transform

Algorithm:

[Link] Parameter Space

[Link] Accumulator Array Parameter Space

[Link]
1 1

4. For each image edge


1 1
1 1

For each element in 2


1 1
If lies on the line: 1 1

Increment 1 1

5. Find local maxima in


Problems with
parameterization
How big does the accumulator need to be for the parameterization ?

1 1
1 1
1 1
2
1 1
1 1
1 1
Problems with
parameterization
How big does the accumulator need to be for the parameterization ?

1 1
1 1
1 1
2
1 1
1 1
1 1

The space of m is huge! The space of c is huge!


Better Parameterization
Use normal form:

Given points find

Image Space
Hough Space Sinusoid

?
(Finite Accumulator Array Size)

Hough Space
Image and parameter
space
variables parameters

parameters variables

a point
becomes?

Image space Parameter space


Image and parameter
space
variables parameters

parameters variables

a point
becomes a
wave

Image space Parameter space


Image and parameter
space
variables

parameters

a line
becomes?

Image space Parameter space


Image and parameter
space
variables

parameters

a line
becomes a
point

Image space Parameter space


Image and parameter
space
variables

parameters

a line
becomes?

Image space Parameter space


Image and parameter
space
variables

parameters

a line
becomes a
point

Image space Parameter space


Image and parameter
space
variables

parameters

a line
becomes a
point

Image space Parameter space


Image and parameter
space
variables

parameters

a line
becomes a
point

Image space Parameter space


Image and parameter
space
variables

parameters

a line
becomes a
point

Image space Parameter space


Image and parameter
space
variables

parameters

a line
becomes a
point

Image space Parameter space


Image and parameter
space
variables

parameters
Wait …why is rho negative?

a line
becomes a
point

Image space Parameter space


Image and parameter
space
variables

parameters

same line through


the point

a line
becomes a
point

Image space Parameter space


There are two ways to
write the same line:

Positive rho version:

Negative rho version:

Recall:
Image and parameter
space
variables

parameters

same line through


the point

a line
becomes a
point

Image space Parameter space


Image and parameter
space
variables

parameters

two points
become
?

Image space Parameter space


Image and parameter
space
variables

parameters

three points
become
?

Image space Parameter space


Image and parameter
space
variables

parameters

four points
become
?

Image space Parameter space


Implementation
1. Initialize accumulator H to all zeros

2. For each edge point (x,y) in the image


For θ = 0 to 180
ρ = x cos θ + y sin θ
H(θ, ρ) = H(θ, ρ) + 1
end
end

3. Find the value(s) of (θ, ρ) where H(θ, ρ)


is a local maximum

4. The detected line in the image is given by


ρ = x cos θ + y sin θ

NOTE: Watch your coordinates. Image origin is top left!


Image space Votes
Basic shapes
(in parameter space)

can you guess the shape?


Basic shapes
(in parameter space)

line
Basic shapes
(in parameter space)

line rectangle
Basic shapes
(in parameter space)

line rectangle circle


Basic Shapes
More complex image
In practice, measurements are noisy…

Image space Votes


Too much noise …

Image space Votes


Real-world example

Original Edges parameter space Hough Lines


RANSAC
RANdom Sample Consensus
RANSAC:
RANdom Sample Consensus

 𝑀 = 14
Algorithm:
1. Sample (randomly) the number of points (s) required to fit the model
Typically s is the minimum samples to fit a model.)
2. Fit the model to the randomly chosen samples.
3. Count the number M of data points (inliers) that fit the model within a
measure of error 𝜖
4. Repeat Steps 1-3 N times
5. Choose the mode; that has the largest number of M
RANSAC:
RANdom Sample Consensus

Least Square RANSAC Iteration


Fitting 1
Inliers: 2 Inliers: 4
RANSAC:
RANdom Sample Consensus

RANSAC Iteration RANSAC Iteration


2 i
Inliers: 3 Inliers: 20
function [m, b] = ransacfit(x, y)
% y = mx + b
N = 200;
thresh = 0.03;
bestcount = 0;

for k = 1:N
rp = randperm(numel(x));
tx = x(rp(1:2));
ty = y(rp(1:2));
m = (ty(2)-ty(1)) ./ (tx(2)-tx(1));
b = ty(2)-m*tx(2);

nin = sum(abs(y-m*x-b)<thresh);
if nin > bestcount
bestcount = nin;
inliers = (abs(y - m*x - b) < thresh);
end
end
% total least square fitting on inliers
[m, b] = total_lsqfit(x(inliers), y(inliers));
Hough Circles
Let’s assume radius known
parameters parameters

variables variables

What is the dimension of the parameter space?


parameters parameters

variables variables

Image space Parameter space

What does a point in image space correspond to in parameter space?


parameters parameters

variables variables
parameters parameters

variables variables
parameters parameters

variables variables
parameters parameters

variables variables
What if radius is unknown?
parameters parameters

variables variables
What if radius is unknown?
parameters parameters

variables variables

If radius is not known: 3D Hough Space!

Use Accumulator array

Surface shape in Hough space is complicated


Using Gradient Information
Gradient information can save lot of computation:

Edge Location
Edge Direction

Assume radius is known:

Need to increment only one point in accumulator!


parameters parameters

variables variables
parameters parameters

variables variables
Pennie Hough detector Quarter Hough detector
Pennie Hough detector Quarter Hough detector
1. Image → Canny
2. Canny → Hough votes
3. Hough votes → Edges
Find peaks and post-process
Hough transform example

[Link]
Hough transform for circles
• Circle: center (a,b) and radius r Equation of circle?
( xi − a) 2 + ( yi − b) 2 = r 2
• For a fixed radius r Equation of set of
circles that all pass
through a point?
b

Image space Hough space a


119
Adapted by Devi Parikh from: Kristen Grauman
Hough transform for circles
• Circle: center (a,b) and radius r
( xi − a) 2 + ( yi − b) 2 = r 2
• For a fixed radius r

Intersection:
most votes for
center occur
here.

Image space Hough space


120
Kristen Grauman
Hough transform for circles
• Circle: center (a,b) and radius r
( xi − a) 2 + ( yi − b) 2 = r 2
• For an unknown radius r

b
a
Image space Hough space

121
Kristen Grauman
Hough transform for circles
• Circle: center (a,b) and radius r
( xi − a) 2 + ( yi − b) 2 = r 2
• For an unknown radius r

b
a
Image space Hough space

122
Kristen Grauman
Hough transform for circles
• Circle: center (a,b) and radius r
( xi − a) 2 + ( yi − b) 2 = r 2
• For an unknown radius r, known gradient direction

Image space Hough space

123
Kristen Grauman
The Hough transform …

Deals with occlusion well?


Detects multiple instances?
Robust to noise?
Good computational complexity?
Easy to set parameters?
Can you use Hough Transforms for other objects,
beyond lines and circles?

Do you have to use edge detectors to


vote in Hough Space?
Which algorithm should I use?
✓ If we know which points belong to the line, how do we find the
“optimal” line parameters?
✓Least squares

✓ What if there are outliers?


✓Robust fitting, RANSAC

• What if there are many lines?


• Voting methods: RANSAC, Hough transform

Slide credit: S. Lazebnik


Things to remember
• Least Squares Fit
• closed form solution
• robust to noise
• not robust to outliers
• Hough transform
• robust to noise and outliers
• can fit multiple models
• only works for a few parameters (1-4 typically)
• RANSAC
• robust to noise and outliers
• works with a moderate number of parameters (e.g, 1-8)
Next Lecture
• Alignment
Alignment

CSCE 4603 – Fundamentals of Computer Vision


Dr. Mohamed Sedky
What have we learned so far?
• Light and color
– What an image records
• Filtering in spatial domain
• Filtering = weighted sum of neighboring pixels
• Smoothing, sharpening, measuring texture
• Filtering in frequency domain
• Filtering = change frequency of the input image
• Denoising, sampling, image compression
• Image pyramid and template matching
• Filtering = a way to find a template
• Image pyramids for coarse-to-fine search and multi-
scale detection
• Edge detection
• Canny edge = smooth -> derivative -> thin -> threshold -
> link
• Finding straight lines, binary image analysis
• Corner detection
• Harris corner = smooth -> SSD of X derivative-> SSD of Y
derivative-> SSD of X.Y Derivatives -> Eigen values of H
matrix -> error function -> threshold -> non-maximal
suppression
Where are we?
• Interest points
• Find distinct and repeatable points in images
• Harris-> corners, DoG -> blobs
• SIFT -> feature descriptor
• Feature tracking and optical flow
• Find motion of a keypoint/pixel over time
• Lucas-Kanade:
• brightness consistency, small motion, spatial coherence
• Handle large motion:
• iterative update + pyramid search
• Fitting and alignment (this class)
• find the transformation parameters that
best align matched points
• Object instance recognition (next class)
• Keypoint-based object instance recognition and search
Today’s class: Alignment
• Find the transformation parameters that best align matched points

• 2D transformations.

• Projective geometry 101.

• Transformations in projective geometry.

• Classification of 2D transformations.

• Determining unknown 2D transformations.

• Determining unknown image warps.


Today’s class: Alignment
• 2x2 Image Transformations
• 3x3 Image Transformations
• Computing Homography
What is an image?

A (grayscale)
image is a 2D
function.
grayscale image

What is the range of


the image function f? domain
What types of image transformations can we do?

Filtering Warping

changes pixel values changes pixel locations


What types of image transformations can we do?

Filtering Warping

changes range of image function changes domain of image function


Warping example: feature matching
Warping example: feature matching
Warping example: feature matching
• object recognition
• 3D reconstruction
• augmented reality
• image stitching

How do you compute the transformation?


Warping example: feature matching
Given a set of matched feature points:

point in one point in the


image other image

and a transformation:

transformation
parameters
function

find the best estimate of the parameters

What kind of transformation functions are there?


2D transformations
Parametric (global) warping
T

p = (x,y) p’ = (x’,y’)
Transformation T is a coordinate-changing machine:
p’ = T(p)

What does it mean that T is global?


• Is the same for any point p
• can be described by just a few numbers (parameters)

For linear transformations, we can represent T as a matrix


p’ = Tp
 x'   x
 y ' = T  y 
   
2D transformations

translation rotation aspect

original

affine perspective cylindrical


Scaling (Stretching or Squishing)

S
Scale
Scaling (Stretching or Squishing)
Scaling (Stretching or Squishing)

How would you implement scaling?

Scale

• Each component multiplied by a scalar


• Uniform scaling - same scalar for each component
Scaling (Stretching or Squishing)

What’s the effect of using


Scale different scale factors?

• Each component multiplied by a scalar


• Uniform scaling - same scalar for each component
Scaling (Stretching or Squishing)

matrix representation of scaling:


Scale

scaling matrix S

• Each component multiplied by a scalar


• Uniform scaling - same scalar for each component
Skew (Shear)

How would you implement shearing?

Shear
Skew (Shear)

Shear or in matrix form:


Rotation

How would you implement rotation?

rotation around
the origin
Rotation

rotation around
the origin
Rotation
Polar coordinates…
x = r cos (φ)
y = r sin (φ)
x’ = r cos (φ + θ)
y’ = r sin (φ + θ)

Trigonometric Identity…
x’ = r cos(φ) cos(θ) – r sin(φ) sin(θ)
𝑟 rotation around
y’ = r sin(φ) cos(θ) + r cos(φ) sin(θ)
the origin
Substitute…
𝑟 x’ = x cos(θ) - y sin(θ)
y’ = x sin(θ) + y cos(θ)
φ
Rotation

or in matrix form:

rotation around
the origin
2D planar and linear transformations

𝒙 = 𝑓 𝒙; 𝑝

𝑥′ 𝑥
′ =𝑴 𝑦
𝑦

parameters 𝑝 point 𝒙
2D planar and linear transformations
Scale Flip across y

Rotate Flip across origin

Shear Identity
2D translation

T
𝑡𝑦
Scale
𝑡𝑥
2D translation

How would you implement translation?


2D translation

𝑥ư = 𝑥 + 𝑡𝑥
𝑦ư = 𝑦 + 𝑡𝑦

What about matrix representation?


2D translation

𝑥ư = 𝑥 + 𝑡𝑥
𝑦ư = 𝑦 + 𝑡𝑦

What about matrix representation?

Not possible.
Projective geometry 101
Homogeneous coordinates
heterogeneous homogeneous
coordinates coordinates

𝑥 𝑑𝑒𝑓
𝑥
𝑦 𝑦 add a 1 here

1
• Represent 2D point with a 3D vector
Homogeneous coordinates
heterogeneous homogeneous
coordinates coordinates

𝑥 𝑑𝑒𝑓
𝑥 𝑎𝑥
𝑦 𝑦 ≝ 𝑎𝑦
1 𝑎
• Represent 2D point with a 3D vector
• 3D vectors are only defined up to scale
2D translation

What about matrix representation


using homogeneous coordinates?
2D translation

What about matrix representation


using heterogeneous coordinates?
2D translation using homogeneous coordinates
Homogeneous coordinates
Conversion: Special points:

• heterogeneous → homogeneous • point at infinity

• homogeneous → heterogeneous • undefined

• scale invariance
Projective geometry
image plane

image point in
pixel coordinates

image point in X is a projection of a point


homogeneous P on the image plane
coordinates

What does scaling X correspond to?


Transformations in projective geometry
2D transformations in heterogeneous coordinates
Re-write these transformations as 3x3 matrices:

?
translation scaling

? ?
rotation shearing
2D transformations in heterogeneous coordinates
Re-write these transformations as 3x3 matrices:

translation scaling

? ?
rotation shearing
2D transformations in heterogeneous coordinates
Re-write these transformations as 3x3 matrices:

translation scaling

?
rotation shearing
2D transformations in heterogeneous coordinates
Re-write these transformations as 3x3 matrices:

translation scaling

rotation shearing
Matrix composition
Transformations can be combined by matrix multiplication:

p’ = ? ? ? p
Matrix composition
Transformations can be combined by matrix multiplication:

p’ = translation(tx,ty) rotation(θ) scale(s,s) p

Does the multiplication order matter?


Classification of 2D transformations
Classification of 2D transformations
Classification of 2D transformations

?
?
?
?
?
Classification of 2D transformations

Translation:

How many degrees of freedom?


Classification of 2D transformations

Euclidean (rigid):
rotation + translation

Are there any values that are related?


Classification of 2D transformations

Euclidean (rigid):
rotation + translation

How many degrees of freedom?


Classification of 2D transformations
which other matrix values will
change if this increases?

Euclidean (rigid):
rotation + translation
Classification of 2D transformations
what will happen to the
image if this increases?

Euclidean (rigid):
rotation + translation
Classification of 2D transformations
what will happen to the
image if this increases?

Euclidean (rigid):
rotation + translation
Classification of 2D transformations

Similarity:
uniform scaling + rotation
+ translation

Are there any values that are related?


Classification of 2D transformations

multiply these four by scale s

Similarity:
uniform scaling + rotation
+ translation

How many degrees of freedom?


Classification of 2D transformations
what will happen to the
image if this increases?

Similarity:
uniform scaling + rotation
+ translation
Classification of 2D transformations

Affine transform:
uniform scaling + shearing
+ rotation + translation

Are there any values that are related?


Classification of 2D transformations

Affine transform:
uniform scaling + shearing
+ rotation + translation

Are there any values that are related?


similarity shear
Classification of 2D transformations

Affine transform:
uniform scaling + shearing
+ rotation + translation

How many degrees of freedom?


similarity shear
Affine transformations
Affine transformations are combinations of
• arbitrary (4-DOF) linear transformations; and
• translations

Properties of affine transformations:


• origin does not necessarily map to origin
• lines map to lines
• parallel lines map to parallel lines
• ratios are preserved
• compositions of affine transforms are also affine transforms
Does the last coordinate w ever change?
Affine transformations
Affine transformations are combinations of
• arbitrary (4-DOF) linear transformations; and
• translations

Properties of affine transformations:


• origin does not necessarily map to origin
• lines map to lines
• parallel lines map to parallel lines
• ratios are preserved
• compositions of affine transforms are also affine transforms
Nope! But what does that mean?
How to interpret affine transformations here?
image plane

image point in
pixel coordinates

image point in X is a projection of a point


heterogeneous P on the image plane
coordinates
Projective transformations (aka homographies)
Projective transformations are combinations of
• affine transformations; and
• projective wraps
How many degrees of freedom?
Properties of projective transformations:
• origin does not necessarily map to origin
• lines map to lines
• parallel lines do not necessarily map to parallel lines
• ratios are not necessarily preserved
• compositions of projective transforms are also projective
transforms
Projective transformations (aka homographies)
Projective transformations are combinations of
• affine transformations; and
• projective wraps
8 DOF: vectors (and therefore
Properties of projective transformations: matrices) are defined up to scale)
• origin does not necessarily map to origin
• lines map to lines
• parallel lines do not necessarily map to parallel lines
• ratios are not necessarily preserved
• compositions of projective transforms are also projective
transforms
How to interpret projective transformations here?
image plane

image point in
pixel coordinates

image point in X is a projection of a point


heterogeneous P on the image plane
coordinates
Determining unknown (affine) 2D transformations
Determining unknown transformations
Suppose we have two triangles: ABC and DEF.

B E

D
C

A
F
Determining unknown transformations
Suppose we have two triangles: ABC and DEF.
• What type of transformation will map A to D, B to E, and C to F?

B E

D
C

A
F
Determining unknown transformations
Suppose we have two triangles: ABC and DEF.
• What type of transformation will map A to D, B to E, and C to F?
• How do we determine the unknown parameters?
B E

Important: We will see a


different procedure for
dealing with D
homographies! C

A
F

Affine transform:
How many degrees of
uniform scaling + shearing
freedom do we have?
+ rotation + translation
Determining unknown transformations
Suppose we have two triangles: ABC and DEF.
• What type of transformation will map A to D, B to E, and C to F?
• How do we determine the unknown parameters?
B E

D
C

A
F
unknowns
• One point correspondence gives
how many equations?
• How many point
point correspondences correspondences do we need?
Determining unknown transformations
Suppose we have two triangles: ABC and DEF.
• What type of transformation will map A to D, B to E, and C to F?
• How do we determine the unknown parameters?
B E

D
C

A
F
unknowns

How do we solve this for M?


point correspondences
Least Squares Error
Least Squares Error What is
this?

What is What is
this? this?
Least Squares Error Euclidean
(L2) norm

squared!

predicted measured
location location
Least Squares Error

Residual (projection
error)
Least Squares Error

What is the free variable?


What do we want to optimize?
Find parameters that minimize squared error
General form of linear least squares

(Warning: change of notation. x is a vector of parameters!)

(matrix form)
Determining unknown transformations
Why can we drop
Affine transformation:
the last line?

Vectorize transformation
parameters:

Stack equations from point


correspondences:

Notation in system form:


Fitting an affine transformation

• Linear system with six unknowns


• Each match gives us two linearly independent
equations: need at least three to solve for the
transformation parameters
General form of linear least squares

(Warning: change of notation. x is a vector of parameters!)

(matrix form)

This function is quadratic.


How do you find the root of a quadratic?
Solving the linear system
Convert the system to a linear least-squares problem: In Matlab:

x = A \ b
Expand the error:

Minimize the error:


Set derivative to 0

Note: You almost never want to


Solve for x
compute the inverse of a matrix.
Linear least squares estimation only works when the transform function is ?
Linear least squares estimation only works when the transform function is linear! (duh)

Also doesn’t deal well with outliers


Determining unknown image warps
Determining unknown image warps
Suppose we have two images.
• How do we compute the transform that takes one to the other?

T(x, y)

y y’
x x’
f(x, y) g(x’, y’)
Forward warping
Suppose we have two images.
• How do we compute the transform that takes one to the other?

T(x, y)

y y’
x x’
f(x, y) g(x’, y’)
1. Form enough pixel-to-pixel correspondences between two images
2. Solve for linear transform parameters as before
3. Send intensities f(x,y) in first image to their corresponding location in the second image
Forward warping
Suppose we have two images.
• How do we compute the transform that takes one to the other?

T(x, y)

y y’
x x’
f(x, y) g(x’, y’)
what is the problem
1. Form enough pixel-to-pixel correspondences between two images with this?
2. Solve for linear transform parameters as before
3. Send intensities f(x,y) in first image to their corresponding location in the second image
Forward warping
Pixels may end up between two points
• How do we determine the intensity of each point?

f(x,y) g(x’,y’)
T(x,y)

y y’
x x’
Forward warping
Pixels may end up between two points
• How do we determine the intensity of each point?
✓ We distribute color among neighboring pixels (x’,y’) (“splatting”)

f(x,y) g(x’,y’)
T(x,y)

y y’
x x’

• What if a pixel (x’,y’) receives intensity from more than one pixels (x,y)?
Forward warping
Pixels may end up between two points
• How do we determine the intensity of each point?
✓ We distribute color among neighboring pixels (x’,y’) (“splatting”)

f(x,y) g(x’,y’)
T(x,y)

y y’
x x’

• What if a pixel (x’,y’) receives intensity from more than one pixels (x,y)?
✓ We average their intensity contributions.
Alignment as fitting
• Previous lectures: fitting a model to features in one
image M

xi Find model M that minimizes

 residual( x , M )
i
i

• Alignment: fitting a model to a transformation


between pairs of features (matches) in two images
xi
xi' Find transformation T
T that minimizes

 residual(T ( x ), x)
i
i i
What if you want to align but have no prior matched pairs?

• Hough transform and RANSAC not applicable

• Important applications

Medical imaging: match brain Robotics: match point clouds


scans or contours
Iterative Closest Points (ICP) Algorithm
Goal: estimate transform between two dense sets of
points

1. Initialize transformation (e.g., compute difference in


means and scale)
2. Assign each point in {Set 1} to its nearest neighbor in
{Set 2}
3. Estimate transformation parameters
• e.g., least squares or robust least squares
4. Transform the points in {Set 1} using estimated
parameters
5. Repeat steps 2-4 until change is very small
Alignment
• Alignment: find parameters of model that maps one
set of points to another

• Typically want to solve for a global transformation


that accounts for most true correspondences

• Difficulties
• Noise (typically 1-3 pixels)
• Outliers (often 30-50%)
• Many-to-one matches or multiple objects
Projective Transformations (homography)
• The transformation between two views of a planar
surface

• The transformation between images from two


cameras that share the same center
Application: Panorama stitching

Source: Hartley & Zisserman


Object Instance Recognition
A1
1. Match keypoints to object A3
model A2

Matched
keypoints
2. Solve for affine
transformation parameters
Affine
Parameters
3. Score by inliers and choose
solutions with score above This Class
# Inliers
threshold
Choose hypothesis with max
score above threshold
Overview of Keypoint Matching
1. Find a set of
distinctive key-
points
A1
2. Define a region
around each
A2 A3 keypoint

3. Extract and
normalize the
region content
fA fB
4. Compute a local
N pixels

descriptor from the


e.g. color e.g. color
normalized region
N pixels d( f A, fB )  T
5. Match local
descriptors
K. Grauman, B. Leibe
Finding the objects (overview)

Input
Image Stored
Image

1. Match interest points from input image to database image


2. Matched points vote for rough position/orientation/scale of
object
3. Find position/orientation/scales that have at least three votes
4. Compute affine registration and matches using iterative least
squares with outlier check
5. Report object if there are at least T matched points
Matching Keypoints
• Want to match keypoints between:
1. Query image
2. Stored image containing the object

• Given descriptor x0, find two nearest neighbors x1, x2


with distances d1, d2

• x1 matches x0 if d1/d2 < 0.8


• This gets rid of 90% false matches, 5% of true matches in
Lowe’s study
Finding the objects (in detail)
1. Match interest points from input image to database image
2. Get location/scale/orientation using Hough voting
• In training, each point has known position/scale/orientation
wrt whole object
• Matched points vote for the position, scale, and orientation
of the entire object
• Bins for x, y, scale, orientation
• Wide bins (0.25 object length in position, 2x scale, 30 degrees orientation)
• Vote for two closest bin centers in each direction (16 votes total)
3. Geometric verification
• For each bin with at least 3 keypoints
• Iterate between least squares fit and checking for inliers and
outliers
4. Report object if > T inliers (T is typically 3, can be computed to
match some probabilistic threshold)
Examples of recognized objects
View interpolation
• Training
– Given images of different
viewpoints
– Cluster similar viewpoints
using feature matches
– Link features in adjacent
views

• Recognition
– Feature matches may be
spread over several
training viewpoints
 Use the known links to
“transfer votes” to other
viewpoints [Lowe01]

Slide credit: David Lowe


Applications
• Sony Aibo
(Evolution Robotics)

• SIFT usage
– Recognize
docking station
– Communicate
with visual cards

• Other uses
– Place recognition
– Loop closure in SLAM

K. Grauman, B. Leibe 108


Slide credit: David Lowe
Location Recognition

Training

[Lowe04]
Slide credit: David Lowe
Another application: category
recognition
• Goal: identify what type of object is in the image
• Approach: align to known objects and choose
category with best match

“Shape matching and object recognition using low distortion correspondence”, Berg et
al., CVPR 2005: [Link]
Examples of Matches
Examples of Matches
Next week
• Object instance recognition
Computer Vision Q&A
Your Name
May 27, 2025

Contents

1
Questions and Answers
Questions and Answers
Question 1 (1 point)
What is a homography between two images?
Answer:
A homography is a transformation between two images that relates corresponding points
in the two images. It is a projective transformation that maps points from one plane to
another and is represented by a 3 × 3 matrix H. A homography accounts for effects such
as translation, rotation, scaling, and perspective distortion.
Mathematically, the homography transformation is written as:

x′ = Hx,

where:

• x = [x, y, 1]T is a point in the first image in homogeneous coordinates,

• x′ = [x′ , y ′ , 1]T is the corresponding point in the second image,

• H is the 3 × 3 homography matrix.

The homography matrix H has 8 degrees of freedom because it is defined up to a scale


factor, meaning the matrix can be scaled by any non-zero constant without changing the
transformation. Homographies are used in applications such as:

• Image stitching (panoramas),

• Perspective correction,

• Augmented reality (overlaying virtual objects on images).

Question 2 (2 points)
Explain the distinction between motion field and optical flow in an image.
Use a diagram to illustrate your answer.
Answer:
The distinction between motion field and optical flow can be explained as follows:

• Motion Field: The motion field represents the actual projection of 3D scene points
onto the 2D image plane as they move through time. It describes the true physical
motion of points in the scene.

• Optical Flow: Optical flow, on the other hand, refers to the apparent motion of
brightness patterns in an image sequence. It estimates pixel movement between
frames based on the intensity values in images. However, it may not always corre-
spond to the actual motion field due to occlusions, lighting changes, or textureless
regions.

2
The relationship between the two can be summarized:
Motion Field ̸= Optical Flow (in general)
However, in ideal conditions (e.g., Lambertian surfaces and small displacements), optical
flow approximates the motion field.

motion_field_vs_optical_flow.png

Figure 1: Difference between Motion Field and Optical Flow. Motion field represents
actual motion in 3D, while optical flow estimates 2D apparent motion.

Question 3 (4 points)
Using appropriate mathematical expressions, define the following operations
commonly used in computer vision and briefly explain their function and
applications:
1. Convolution:
Convolution is a mathematical operation used to combine two functions to pro-
duce a third function. In computer vision, it is widely used for filtering images.
Mathematically, convolution is defined as:
XX
(f ∗ g)(x, y) = f (i, j)g(x − i, y − j),
i j

where f is the input image, g is the kernel (filter), and (x, y) are coordinates.
Convolution is fundamental to operations like image smoothing, sharpening, and
feature extraction.

3
2. Correlation:
Correlation is similar to convolution but involves a direct alignment of the kernel
without flipping it. It is defined as:
XX
(f ⋆ g)(x, y) = f (i, j)g(x + i, y + j).
i j

Correlation is used in template matching, where we slide a kernel over the image
to find matches.

3. Edge Detection by Second-Derivative Zero-Crossings:


Edge detection using second-derivative zero-crossings involves applying the Lapla-
cian operator, which computes the second derivatives of the image. Edges are
located at points where the second derivative changes sign (zero-crossings). Math-
ematically:
2 ∂ 2f ∂ 2f
∇ f (x, y) = + 2.
∂x2 ∂y
This method detects edges with high precision but is sensitive to noise. The Lapla-
cian of Gaussian (LoG) filter is often used to reduce noise before detecting edges.

4. Invariant Transform:
Invariant transforms are mathematical operations that produce outputs unchanged
under specific transformations (e.g., rotation, scaling, or translation). For example:

T (f (x, y)) = f ′ (x, y),

where f is the input image, and f ′ is the transformed image. Invariant transforms
are widely used in object recognition and image matching, where features must
remain consistent under different viewpoints or scales.

Question 4 (6 points)
Explain the notion of scale-space and how it is used in various areas of com-
puter vision. Include the following:

1. Pyramidal representations of image structure across successive scales of


blurred undersampling.

2. Edge detection operators that extract edges at particular scales of anal-


ysis, but not at others.

3. The behaviour of zero-crossings, their trajectories and “fingerprints” in


scale-space.

Answer:
Scale-space is a multi-scale representation of an image, where the image is analyzed at
multiple levels of resolution or scales. It addresses the problem of finding structures or
features in images that exist at different spatial scales, such as fine details (small scales)
or large structures (large scales).

4
The scale-space is generated by convolving an image I(x, y) with a Gaussian kernel
G(x, y; σ), where σ controls the scale. Mathematically, the scale-space representation
L(x, y; σ) is defined as:
L(x, y; σ) = I(x, y) ∗ G(x, y; σ),
where:
1 − x2 +y2 2
G(x, y; σ) = e 2σ .
2πσ 2
Here, σ determines the amount of smoothing (blurring).

a. Pyramidal Representations of Image Structure: Pyramidal representation is


a discrete approximation of the continuous scale-space, where an image is successively
smoothed and sub-sampled (reduced in resolution). It forms a pyramid-like structure of
images:

• At the base of the pyramid, the original image is preserved.

• As we go up the pyramid, each level represents a blurred and downsampled


version of the image.

This representation reduces computational cost and is useful for:

• Image compression,

• Multi-scale feature extraction,

• Efficient object detection and recognition.

b. Edge Detection Operators at Different Scales: Edge detection in scale-space


involves applying edge detection operators (e.g., gradient or Laplacian filters) to images
at multiple scales. Key points include:

• Fine-scale edge detectors capture small details, but may be noisy.

• Coarse-scale edge detectors capture large-scale structures, but miss finer details.

Edges detected at smaller scales may not persist at larger scales. Therefore, edges that
consistently appear across multiple scales are more robust and meaningful for computer
vision tasks.

c. Behaviour of Zero-Crossings in Scale-Space: Zero-crossings occur where the


second derivative of the image intensity passes through zero, often detected using the
Laplacian of Gaussian (LoG) filter. In scale-space:

• Zero-crossings can appear, disappear, or move as the scale increases (trajectories in


scale-space).

• The set of all zero-crossings across scales forms a fingerprint in scale-space, which
encodes the structure of the image.

Zero-crossing trajectories are useful for:

• Multi-scale edge detection,

5
• Detecting object boundaries,
• Extracting meaningful image features at different resolutions.

Question 5 (2 points)
Why is it important that images are not under-sampled if you wish to accu-
rately track the images of objects?
Answer:
Under-sampling occurs when the sampling rate is insufficient to capture the variations
in an image. Inaccurate sampling can lead to loss of information and aliasing, where
high-frequency details are misrepresented as lower frequencies.
The importance of avoiding under-sampling for object tracking includes:
• Preserving Fine Details: Under-sampling can blur or distort object edges, mak-
ing it difficult to detect and track features like corners or edges.
• Avoiding Aliasing: Aliasing introduces artifacts that can mislead object tracking
algorithms. For instance, patterns in objects may appear shifted or incorrectly
matched.
• Ensuring Spatial Accuracy: Accurate sampling ensures that objects’ positions,
motion trajectories, and shapes are precisely captured.
To avoid under-sampling, the Nyquist criterion should be satisfied:
fs ≥ 2fmax ,
where fs is the sampling frequency and fmax is the highest frequency in the image.
Proper sampling is critical for applications such as motion estimation, optical flow,
and object tracking in computer vision.

Question 6 (3 points)
The Harris corner detection algorithm computes a 2 × 2 matrix at each pixel
based on the first derivatives at that point and then computes the two eigen-
values of the matrix, λ1 and λ2 , where λ1 ≤ λ2 . How can these two values be
used to label each pixel as either a locally smooth region (S), an edge point
(E), or a corner point (C)? Give your answer by specifying “Label pixel S if
...”, “Label pixel E if ...” and “Label pixel C if ...”
Answer:

The Harris corner detection algorithm analyzes the local structure of the image based
on the eigenvalues of the 2 × 2 structure tensor (autocorrelation matrix) defined as:
 2 
Ix Ix Iy
M= ,
Ix Iy Iy2
where Ix and Iy are the gradients of the image along the x- and y-directions, respectively.
The eigenvalues λ1 and λ2 represent the intensity variations in orthogonal directions.
The classification into smooth regions, edge points, and corner points is done as follows:

6
• Label pixel S (Smooth region) if:

λ1 ≈ 0 and λ2 ≈ 0.

Explanation: Both eigenvalues are small, meaning there is little to no variation in


intensity in any direction. This indicates a locally smooth or flat region.

• Label pixel E (Edge point) if:

λ1 ≈ 0 and λ2 > 0.

Explanation: One eigenvalue is large (significant variation in one direction), and the
other is close to zero. This indicates an edge where intensity changes significantly
along one direction but not the other.

• Label pixel C (Corner point) if:

λ1 > 0 and λ2 > 0.

Explanation: Both eigenvalues are large, meaning there is significant variation in


intensity in two orthogonal directions. This indicates a corner where the intensity
changes in all directions.

To summarize:

S: λ1 ≈ 0, λ2 ≈ 0, E: λ1 ≈ 0, λ2 > 0, C: λ1 > 0, λ2 > 0.

Question 7 (4 points)
Given the two eigenvalues specified in (Question 6), explain in English the
rationale and difference(s) between detecting corner points using the crite-
rion λ1 > T1 versus the criterion λ1 λ2 > T2 , where T1 and T2 are appropriate
thresholds.
Answer:

The two criteria for detecting corner points using eigenvalues of the structure tensor
M are as follows:

• Criterion 1: λ1 > T1
This condition considers only the magnitude of the smaller eigenvalue λ1 relative
to a threshold T1 .
Rationale:

– A large value of λ1 implies significant variation in at least one direction.


– However, this condition alone does not guarantee that the variation occurs in
multiple directions.
– As a result, this criterion may detect both edges and corners because edges
typically have one large eigenvalue.

7
Difference: The criterion λ1 > T1 can lead to false positives for edges, as it does
not account for the second eigenvalue λ2 .
• Criterion 2: λ1 λ2 > T2
This condition considers the product of both eigenvalues relative to a threshold T2 .
Rationale:
– The product λ1 λ2 becomes large only if both eigenvalues are large.
– Since corners exhibit significant intensity variation in two orthogonal direc-
tions, both eigenvalues must be non-negligible for a corner to be detected.
– This criterion is more robust for corner detection because it avoids detecting
edges where only one eigenvalue is large.
Difference: The criterion λ1 λ2 > T2 is stricter than λ1 > T1 because it ensures that
intensity variation occurs in all directions, making it more reliable for detecting
corners.

Summary of Differences:
• λ1 > T1 detects regions with strong variation in at least one direction (both edges
and corners).
• λ1 λ2 > T2 detects regions with strong variation in two orthogonal directions (corner
points only).
Thus, the second criterion (λ1 λ2 > T2 ) is preferred for detecting corners because it
avoids misclassifying edges as corners.

Question 8 (4 points)
a. Explain the key principles underlying the Scale-Invariant Feature Trans-
form (SIFT).
b. What is it used for?
c. What goals does it achieve?
d. How does it achieve them?
Answer:

[label=)] Part (a)Key principles underlying the Scale-Invariant Fea-


ture Transform (SIFT):
The SIFT algorithm is designed to detect and describe local features in im-
ages that are invariant to scale, rotation, and partially invariant to illumination
changes and affine transformations. Its key principles include:
1. • Scale-space extrema detection: SIFT uses a Difference of Gaussian (DoG)
function to detect stable keypoints across multiple scales.
• Keypoint localization: The position, scale, and orientation of keypoints are
refined to ensure stability under noise and transformations.
• Orientation assignment: A consistent orientation is assigned to each key-
point based on local image gradient directions, enabling rotation invariance.

8
• Keypoint descriptor: A descriptor is constructed by analyzing the gradient
magnitude and direction around the keypoint in a local neighborhood. The
descriptor is represented as a 128-dimensional vector.

2. What is it used for?


SIFT is widely used in computer vision tasks that involve matching features between
images. Applications include:

• Image stitching and panorama creation


• Object recognition
• Scene reconstruction
• Motion tracking
• Image registration

3. What goals does it achieve?


The main goals of SIFT are:

• Scale invariance: Detect features reliably at different scales.


• Rotation invariance: Match features regardless of image orientation.
• Robustness: Be robust to noise, illumination changes, and partial occlusions.
• Distinctiveness: Generate highly distinctive feature descriptors to uniquely
identify keypoints.

4. How does it achieve them?


SIFT achieves its goals through a systematic multi-step process:

(a) Scale-space construction: The image is convolved with Gaussian filters at


different scales to create a scale-space. The Difference of Gaussian (DoG) is
used to efficiently approximate Laplacian of Gaussian for detecting keypoints.
(b) Keypoint detection: Local extrema in the DoG images are identified across
scale and space.
(c) Keypoint localization and refinement: Keypoints are refined using inter-
polation to accurately determine their position, scale, and response strength.
(d) Orientation assignment: For each keypoint, the dominant gradient orien-
tation is computed to make the descriptor rotation-invariant.
(e) Descriptor generation: A local region around the keypoint is divided into
sub-regions, and gradients are computed within each. These gradients are used
to form a 128-dimensional descriptor vector.
(f) Feature matching: Keypoints and their descriptors are compared across
images using similarity metrics (e.g., Euclidean distance).

This multi-step pipeline ensures that SIFT can reliably detect and describe features
that are invariant to scale, rotation, and partially to affine transformations and
lighting changes.

9
Question 9 (3 points)
What special difficulties would you expect to have in computing optical flow
given:
a. Spatially aliased images?
b. Temporally aliased image sequence?
c. What can you do to deal with these situations?
Answer:

1. Difficulties with spatially aliased images:


Spatial aliasing occurs when the image resolution is insufficient to capture high-
frequency details of the scene, causing different image features to become indistin-
guishable. In the context of optical flow:

• High-frequency details such as fine edges, textures, or small objects may not
be accurately represented.
• Features that are spatially close together can overlap or appear as a single
feature, leading to ambiguous motion estimation.
• Optical flow algorithms rely on spatial gradients, and aliasing causes incorrect
gradient computation, resulting in inaccurate flow vectors.

2. Difficulties with temporally aliased image sequences:


Temporal aliasing arises when the frame rate is insufficient to capture the motion
of objects, leading to undersampling in time. This results in:

• Rapid motion appearing as discontinuous or “jumping” motion in successive


frames.
• Ambiguities where objects appear to move backward (the “wagon-wheel ef-
fect”) or faster than their actual motion.
• Optical flow algorithms failing because the motion between consecutive frames
exceeds the assumptions of small displacements.

3. Solutions to deal with these situations:

• For spatial aliasing:


– Use higher resolution images to capture finer spatial details.
– Apply pre-processing techniques like anti-aliasing filters (low-pass filter-
ing) to reduce high-frequency content that cannot be represented.
– Use multi-scale methods (e.g., pyramidal representations) to compute op-
tical flow at coarser scales first, followed by finer scales.
• For temporal aliasing:
– Increase the frame rate to ensure that object motion between consecutive
frames is small.
– Apply motion blur to smooth out rapid motion, making it easier to esti-
mate flow.

10
– Use robust methods such as multi-frame optical flow estimation, which
integrates information across multiple frames rather than relying on two
consecutive frames.

Question 10 (4 points)
For an image I(x, y), define its gradient vector field ∇I(x, y).

1. Why is this vector field a useful thing to compute?


The gradient vector field ∇I(x, y) provides information about the rate of change of
intensity in an image. It is useful for several reasons:

• Edge detection: The gradient highlights regions of rapid intensity change,


which typically correspond to edges in the image.
• Feature extraction: The gradient is used in many computer vision tech-
niques to identify important features in an image, such as corners, contours,
and boundaries.
• Image processing: The gradient is often used in image enhancement, filter-
ing, and segmentation tasks, as it helps detect texture, shape, and structure
in the image.

2. Define the gradient magnitude over the image plane (x, y).
The gradient magnitude at a point (x, y) in the image is defined as the Euclidean
norm of the gradient vector:
s 
2  2
∂I ∂I
|∇I(x, y)| = +
∂x ∂y
∂I ∂I
where ∂x and ∂y are the partial derivatives of the image intensity I(x, y) with
respect to the spatial coordinates x and y. The gradient magnitude indicates the
strength of the intensity change at each pixel.

3. Define the gradient direction over the image plane (x, y).
The gradient direction at a point (x, y) is the angle of the gradient vector relative
to the horizontal axis and is given by:
∂I
!
∂y
θ(x, y) = tan−1 ∂I
∂x

This direction represents the orientation of the steepest ascent in intensity, indicat-
ing the direction of maximum change in the image.

4. Explain how the gradient vector field is used in the Canny edge detector,
the main steps in its use, and its advantages.
The Canny edge detector uses the gradient vector field to detect edges in an image
by following these main steps:

11
• Step 1: Smoothing (Gaussian filtering): The image is smoothed using a
Gaussian filter to reduce noise that may interfere with edge detection.
• Step 2: Gradient computation: The gradient vector field of the smoothed
image is computed using Sobel or similar filters. This step calculates both the
gradient magnitude and direction at each pixel.
• Step 3: Non-maximum suppression: After calculating the gradient mag-
nitude, non-maximum suppression is applied to thin the edges by keeping only
local maxima in the gradient magnitude along the gradient direction.
• Step 4: Edge tracking by hysteresis:
– High gradient magnitude pixels are considered strong edges, and low mag-
nitude pixels are considered weak.
– Weak pixels are included in the final edge set if they are connected to
strong edges, forming continuous edges while eliminating noise.

Advantages of the Canny edge detector:

• It is effective at detecting true edges while minimizing false edges.


• The algorithm is robust to noise due to the initial smoothing step.
• It produces thin edges using non-maximum suppression, making edge detection
more precise.
• The hysteresis step ensures that weak edges that are connected to strong edges
are not discarded, leading to more complete edge detection.

Question 11 (5 points)
1. Define the gradient vector field ∇2 f (x, y) over an image f (x, y), and ex-
plain what makes it useful. Contrast its features and capabilities with
the ∇2 Gσ (x, y) (Laplacian of a Gaussian) operator.
The operator ∇2 f (x, y) is the Laplacian of the image function f (x, y). It is defined
as the sum of the second partial derivatives of the image intensity f (x, y) with
respect to x and y:
2 ∂ 2 f (x, y) ∂ 2 f (x, y)
∇ f (x, y) = +
∂x2 ∂y 2
The Laplacian operator measures the local rate of change of the gradient in an
image and is commonly used to detect regions of rapid intensity change, such as
edges and corners. It is particularly useful for identifying points of zero-crossing,
where the intensity changes from increasing to decreasing or vice versa.
The Laplacian of a Gaussian (∇2 Gσ (x, y)) operator is a combination of two opera-
tions:  
2 2 1 − x2 +y2 2
∇ Gσ (x, y) = ∇ e 2σ
2πσ 2
This operator involves first applying a Gaussian smoothing function with a standard
deviation σ and then computing the Laplacian of the smoothed image. The key
difference is that the Laplacian of a Gaussian smooths the image before applying

12
the Laplacian operator, which helps suppress noise and enhance the detection of
edges at different scales. The ∇2 Gσ (x, y) operator is more robust to noise than
∇2 f (x, y) due to the smoothing effect.

2. Contrast with ∇2 Gσ (x, y):

• ∇2 f (x, y) is a direct Laplacian, sensitive to noise and high-frequency details.


• ∇2 Gσ (x, y) involves smoothing via Gaussian filtering, which helps reduce noise
before detecting edges and can be more effective in detecting edges at multiple
scales.

3. Identify their respective orders as differential operators:

• The operator ∇2 f (x, y) is a second-order differential operator because it in-


volves the second derivatives of the image function f (x, y).
• The operator ∇2 Gσ (x, y) is also a second-order differential operator, as it
involves the second derivative of a Gaussian function applied to the image.
However, it first applies Gaussian smoothing, which alters the way the second
derivatives are computed compared to the direct Laplacian operator.

4. Explain how they can be implemented:

• Laplacian (∇2 f (x, y)):


– The Laplacian can be implemented using discrete finite difference approx-
imations. For example, a common discrete kernel for the Laplacian is:
 
0 1 0
Laplacian kernel = 1 −4 1
0 1 0

– This kernel is convolved with the image to calculate the Laplacian at each
pixel.
• Laplacian of Gaussian (∇2 Gσ (x, y)):
– This can be implemented in two steps: first, apply a Gaussian filter
(smoothing), and then compute the Laplacian of the resulting image.
– The Gaussian kernel is defined as:
1 − x2 +y2 2
Gσ (x, y) = e 2σ
2πσ 2
– After smoothing, apply the Laplacian filter, which can also be approxi-
mated using discrete finite differences, such as the following kernel:
 
1 1 1
∇2 Gσ = 1 −8 1
1 1 1

13
• Discuss any neurobiological analogues for both:
In neurobiology, the operations of the Laplacian and Laplacian of Gaussian
are somewhat analogous to the way the human visual system detects edges
and contours:
– The Laplacian (∇2 f (x, y)) is analogous to the way simple cells in the visual
cortex (V1) respond to changes in contrast or the boundaries between
light and dark regions. These cells are sensitive to the first derivative of
intensity and help detect edges.
– The Laplacian of Gaussian (∇2 Gσ (x, y)) is similar to the behavior of com-
plex cells that are tuned to edges and contours at different scales. These
cells often respond to the second derivative of intensity and are thought to
integrate information across a region of visual space, making them more
robust to noise and more sensitive to significant features like edges at
various scales.

Question 12 (2 points)
(a) Would you aim to include, as design goals, the standard geometric
visual illusions as well?
No, in a machine vision system, especially one designed for a visual prosthesis
for the blind, the goal is to restore or enhance vision in a way that is as
close to natural human vision as possible. Standard geometric visual illusions,
which are artifacts of human perception, would not be a desirable design goal.
These illusions often arise due to limitations or biases in the way human brains
process visual stimuli, and they could lead to confusion or misinterpretation
of the visual data for users relying on the prosthesis.
(b) If such errors of vision emerged as unintended consequences of your
design, would you consider them to be features, or bugs, and why?
If visual illusions emerged as unintended consequences, they would be consid-
ered bugs, not features. The purpose of a prosthetic system is to replicate
or approximate normal vision, and introducing illusions would hinder the sys-
tem’s ability to provide accurate visual information. Illusions are generally
undesirable because they interfere with perception and could lead to misin-
terpretation of visual cues, which could be especially harmful in contexts like
navigation or object identification.

Question 13 (3 points)
(a) Provide a 3 × 3 discrete filter kernel array that approximates the
Laplacian operator.
A common 3x3 discrete kernel that approximates the Laplacian operator is:
 
0 1 0
1 −4 1
0 1 0

14
This kernel computes the Laplacian by approximating the second-order partial
derivatives of the image.
(b) Explain what the Laplacian might be used for:
The Laplacian operator is commonly used for edge detection in image process-
ing. It highlights regions where the intensity changes rapidly, such as edges,
and is useful in applications like object recognition, segmentation, and fea-
ture extraction. The Laplacian can also be used for image sharpening, as it
highlights areas where there are significant intensity transitions.
(c) What is the significance of the sum of all of the coefficients in the
filter?
The sum of the coefficients in the Laplacian filter kernel is typically zero. This
property ensures that the filter detects differences in intensity across the image
rather than affecting the overall brightness of the image. A zero sum ensures
that the filter responds only to relative intensity changes (edges) rather than
to global image intensity, which is important for preserving the contrast in the
image while detecting edges or regions of rapid intensity change.

Question 14 (2 points)
Extraction of visual features from images often involves convolution with filters
that are themselves constructed from combinations of differential operators. One
example is the Laplacian of a Gaussian Gσ (x, y) having scale parameter σ, gener-
ating the filter ∇2 Gσ (x, y) for convolution with the image I(x, y). Explain in detail
each of the following three operator sequences, where ∗ signifies two-dimensional
convolution.

(a) ∇2 [Gσ (x, y) ∗ I(x, y)]


In this case, the image I(x, y) is first convolved with the Gaussian kernel
Gσ (x, y), producing a smoothed version of the image. The Laplacian operator
∇2 is then applied to this smoothed image. The result is a filter that detects
areas where the image intensity changes rapidly at different scales, as deter-
mined by the parameter σ. The convolution with the Gaussian smooths out
high-frequency noise, and the Laplacian then highlights regions of significant
intensity changes (edges) by detecting zero-crossings of the second derivative.
This sequence is often used for multi-scale edge detection and blob detection.
(b) Gσ (x, y) ∗ ∇2 I(x, y)
In this sequence, the Laplacian ∇2 I(x, y) is first applied to the image I(x, y),
which computes the second derivative of the image at each point, detecting
rapid intensity changes (edges). The result is then convolved with the Gaus-
sian kernel Gσ (x, y). This process results in the blurring or smoothing of the
second-derivative image, which reduces high-frequency noise and makes the
edge detection less sensitive to small fluctuations in intensity. The combina-
tion enhances the edge detection by making it more robust to noise and small
variations in the image, but it may lose some finer details compared to the
first sequence.

15
(c) What are the differences amongst their effects on the image?
The key difference between the two sequences lies in the order of operations and
the resulting effects on the image. - In ∇2 [Gσ (x, y) ∗ I(x, y)], the convolution
with the Gaussian smooths the image first, and then the Laplacian detects
edges or regions with rapid intensity change. This sequence is more sensitive to
noise, but it provides a more detailed edge detection. - In Gσ (x, y) ∗ ∇2 I(x, y),
the Laplacian is applied first to detect edges, and then the Gaussian smooths
the result. This sequence is more robust to noise and small intensity variations
but may miss finer edge details because it operates on a second-derivative
image that has already been modified by the Laplacian.

Question 15 (2 points)
Explain the notion of scale-space and how it is used in various areas of computer
vision. Include the following:

(a) Pyramidal representations of image structure across successive scales


of blurred undersampling.
Scale-space is a framework used in computer vision to represent an image at
different levels of resolution. It involves creating a series of images, each repre-
senting a different scale, typically achieved by progressively blurring the image
and then downsampling it. This process is useful for analyzing structures in
images at multiple levels of detail, capturing both fine and coarse features.
Pyramidal representations, such as Gaussian pyramids, are commonly used to
achieve this. In a Gaussian pyramid, the image is progressively blurred using
a Gaussian filter with increasing standard deviation σ, and downsampled at
each level. This representation helps in tasks like multi-scale object detec-
tion, feature matching, and image segmentation, where features may appear
at different scales in the image.
(b) Edge detection operators that extract edges at particular scales of
analysis, but not at others.
Scale-space theory is particularly useful in edge detection, as edges in images
are scale-dependent. The notion of extracting edges at particular scales in-
volves using edge detection operators that work differently depending on the
scale of the image. For example, the Canny edge detector or Laplacian of
Gaussian (LoG) operator can be applied at multiple scales to detect edges
that appear at different levels of resolution. When an operator is applied at a
given scale, it captures the edges that are significant at that scale but ignores
smaller or larger edges that do not correspond to the chosen scale. By applying
these operators at multiple scales, one can extract edges of varying significance
and ensure robustness in tasks like object detection and image segmentation.

16
Question 16 (4 points)
(a) Describe the Canny edge detection algorithm, including how the
edge strength and orientation is calculated.
The Canny edge detection algorithm is a multi-step process used to identify
edges in an image. It aims to detect areas with rapid intensity changes and is
considered one of the most effective edge detection techniques due to its accu-
racy and robustness to noise. The steps involved in the Canny edge detection
are as follows:
i. Smoothing: The image is first smoothed using a Gaussian filter to reduce
noise and fine details. The amount of smoothing is controlled by the
standard deviation σ of the Gaussian kernel.
ii. Gradient calculation: The gradients of the image in the horizontal and
vertical directions are calculated using operators such as the Sobel opera-
tor. This step helps to identify areas where intensity changes significantly.
The gradient magnitude is computed as:
q
G(x, y) = G2x (x, y) + G2y (x, y)

where Gx and Gy are the gradients in the x- and y-directions, respectively.


iii. Edge strength: The edge strength at each pixel is the magnitude of the
gradient vector G(x, y), which indicates how strong the intensity change
is at that point.
iv. Edge orientation: The edge orientation (or direction) at each pixel is
calculated using the arctangent of the gradient components:
 
Gy (x, y)
θ(x, y) = arctan
Gx (x, y)

This gives the direction of the edge at each pixel, which is important for
accurate edge localization.
v. Non-maximum suppression: To thin the edges, a non-maximum sup-
pression step is performed. This involves comparing each pixel’s gradient
magnitude with the magnitudes of the pixels along the gradient direction.
If a pixel’s magnitude is not the maximum along the gradient direction,
it is suppressed.
vi. Edge tracking by hysteresis: Finally, edge tracking by hysteresis is
used to determine which edges are real. Pixels with a gradient magnitude
above a high threshold are considered strong edges, and pixels with a
magnitude between a low and high threshold are considered weak edges.
Weak edges that are connected to strong edges are retained, while others
are discarded.
(b) What is the spatial orientation tensor and how is it used?
The spatial orientation tensor, also known as the structure tensor or the second
moment matrix, is a representation of the local image structure at each pixel.
It captures the directionality of image features (such as edges or corners) by
considering the local gradients at that pixel. The tensor is constructed from

17
the gradients of the image in both the x- and y-directions and is given by:
 
Ix2 (x, y) Ix (x, y)Iy (x, y)
J(x, y) =
Ix (x, y)Iy (x, y) Iy2 (x, y)

where Ix and Iy are the partial derivatives of the image I(x, y) in the x- and
y-directions, respectively. The eigenvalues of this matrix can be used to assess
the local structure of the image:
• If both eigenvalues are small, the region is flat (i.e., no edge or corner).
• If one eigenvalue is large and the other is small, the region represents an
edge.
• If both eigenvalues are large, the region represents a corner.
The spatial orientation tensor is particularly useful for analyzing the direction-
ality and structure of edges and corners in an image. It is widely used in corner
detection algorithms and in applications where local image structure needs to
be understood for tasks such as feature matching and object recognition.

Question 17 (4 points)
Let the following matrix H denote a general 2D planar transformation:
 
h11 h12 h13
H = h21 h22 h23 
h31 h32 h33

For each of the following special cases, write down:

(a) Euclidean
• Parameterization of the hij ’s: In the Euclidean case, the transformation
includes rotation, translation, and possibly reflection. The matrix H is
parameterized by:  
cos θ − sin θ tx
H =  sin θ cos θ ty 
0 0 1
where θ is the rotation angle and tx , ty are the translations in the x- and
y-directions, respectively.
• Degrees of freedom: The Euclidean transformation has 3 degrees of free-
dom: 2 for the translation and 1 for the rotation.
• Number of correspondences: To estimate the transformation, 3 correspon-
dences are needed (since each correspondence provides 2 equations for the
parameters).
(b) Similarity

18
• Parameterization of the hij ’s: A similarity transformation includes scal-
ing, rotation, and translation. The matrix H is parameterized as:
 
s cos θ −s sin θ tx
H =  s sin θ s cos θ ty 
0 0 1

where s is the scaling factor, θ is the rotation angle, and tx , ty are the
translations.
• Degrees of freedom: The similarity transformation has 4 degrees of free-
dom: 1 for scaling, 1 for rotation, and 2 for translation.
• Number of correspondences: To estimate the transformation, 4 correspon-
dences are needed.
(c) Affine
• Parameterization of the hij ’s: An affine transformation includes rotation,
translation, scaling, and shearing. The matrix H is parameterized as:
 
h11 h12 tx
H = h21 h22 ty 
0 0 1

where h11 , h12 , h21 , h22 are the shearing and scaling parameters, and tx , ty
are the translations.
• Degrees of freedom: The affine transformation has 6 degrees of freedom:
2 for translation, 2 for scaling/shearing, and 2 for rotation.
• Number of correspondences: To estimate the transformation, 3 correspon-
dences are needed (since each correspondence provides 2 equations).
(d) Projective
• Parameterization of the hij ’s: A projective transformation (or homogra-
phy) includes all the affine transformations as well as perspective distor-
tion. The matrix H is parameterized as:
 
h11 h12 h13
H = h21 h22 h23 
h31 h32 h33

where all 8 parameters (except for scale) are independent.


• Degrees of freedom: The projective transformation has 8 degrees of free-
dom.
• Number of correspondences: To estimate the transformation, 4 correspon-
dences are needed (since each correspondence provides 2 equations).

Question 18 (5 points)
(a) Describe when it would be suitable to use the Prewitt masks.

19
The Prewitt masks are suitable for edge detection in images, particularly when
detecting edges in a noisy environment or when the image has strong gradients in
the horizontal or vertical directions. The Prewitt operator is a simple convolution-
based edge detection technique that is commonly used for detecting edges in images
by approximating the gradient in the horizontal and vertical directions.
Prewitt masks are specifically useful in cases where:

• The edges are relatively well-defined and the image is not excessively noisy.
• You want to detect horizontal and vertical edges in an image.
• A simpler edge-detection method is required, as it is less computationally
expensive compared to more complex methods such as Sobel.

(b) Use a diagram to illustrate these masks.


The Prewitt masks for horizontal and vertical edge detection are:
 
−1 0 1
Prewitt Horizontal Mask(Gx ) = −1 0 1
−1 0 1
 
−1 −1 −1
Prewitt Vertical Mask(Gy ) =  0 0 0
1 1 1

(c) Sketch the result when one of these masks is applied to the greyscale
image below.
Given the following grayscale image:
 
6 7 7 8 8 9
6 7 8 34 38 39
 
7 8 9 36 38 39
Image = 
7

 6 36 35 37 40
6 6 8 36 18 39
3 7 8 35 20 39

We will apply the Prewitt masks to the central region of the image, covering the
area from the second to the fifth row and second to the fifth column.
 
7 7 8 34
8 9 36 38
Central region = 
6 36 35

37
6 8 36 18

Let’s apply the Prewitt Horizontal Mask (for example) to this central region:
 
  7 7 8 34
−1 0 1 8 9 36 38
Convolution = −1 0 1 ∗ 
6 36 35

37
−1 0 1
6 8 36 18

20
The result is computed by multiplying each element of the mask with the corre-
sponding image element and summing the products.
Similarly, apply the Prewitt Vertical Mask to the central region, and compute the
result.
The final edge magnitudes will be calculated using:
q
G= G2x + G2y

where Gx and Gy are the results of applying the horizontal and vertical Prewitt
masks, respectively.

Question 19 (5 points)
(a) Define the gradient vector field ∇f (x, y) over an image f (x, y), and
explain what makes it useful. Contrast its features and capabilities with
the ∇2 Gσ (x, y) (Laplacian of a Gaussian) operator.
The gradient vector field of an image f (x, y) is a vector field that describes the rate
and direction of change in intensity at each point in the image. It is defined as:
 
∂f ∂f
∇f (x, y) = ,
∂x ∂y

where ∂f
∂x
and ∂f
∂y
represent the partial derivatives of the image with respect to the
horizontal and vertical directions, respectively. The gradient is useful in image
processing because it highlights edges and boundaries in the image. Strong gradi-
ent magnitudes indicate areas of rapid intensity change, typically corresponding to
edges.
The gradient is used for tasks such as edge detection (e.g., Canny edge detector),
image segmentation, and texture analysis. The gradient provides information about
the orientation and magnitude of intensity changes in the image.
On the other hand, the Laplacian of a Gaussian (LoG) operator, ∇2 Gσ (x, y), is
defined as:

∂ 2 Gσ (x, y) ∂ 2 Gσ (x, y)
∇2 Gσ (x, y) = +
∂x2 ∂y 2
where Gσ (x, y) is a Gaussian kernel with scale parameter σ. The LoG operator
is used primarily for edge detection, especially in scale-space analysis, and is more
sensitive to the presence of corners and blobs in an image compared to the gradient.
It is useful for detecting edges with varying thickness and is often applied in multi-
scale approaches.
(b) Identify their respective orders as differential operators.

21
The gradient ∇f (x, y) is a first-order differential operator because it computes
the first derivatives of the image along the horizontal and vertical directions. It
measures the rate of change of intensity.
The Laplacian of a Gaussian ∇2 Gσ (x, y) is a second-order differential operator
because it involves second-order derivatives. The Laplacian operator calculates the
second derivatives, which help in identifying regions of the image where intensity
changes rapidly in all directions (i.e., edges or regions of interest).
(c) Explain how they can be implemented.
- The gradient can be implemented using discrete approximations of the partial
derivatives, such as the Sobel operator or the Prewitt operator, which convolve the
image with kernel masks to approximate the gradients in the horizontal and vertical
directions.

   
−1 0 1 −1 −2 −1
Sobel Operator (Horizontal) = −2 0 2 Sobel Operator (Vertical) =  0 0 0
−1 0 1 1 2 1

- The Laplacian of a Gaussian can be implemented by first applying a Gaussian


filter to the image (to smooth it), followed by applying the Laplacian operator
(second derivative). In discrete terms, the LoG operator can be implemented using a
convolution of the image with a precomputed kernel for the Laplacian of a Gaussian.

∂2 ∂2
 
1 − x2 +y2 2 2
Gσ (x, y) = e 2σ ∇ Gσ (x, y) = + Gσ (x, y)
2πσ 2 ∂x2 ∂y 2

(d) Discuss any neurobiological analogues for both.


In neurobiology, the gradient operator has an analogy in the receptive fields of
simple cells in the primary visual cortex (V1). These cells respond to edges and
boundaries in the visual scene, similar to how the gradient operator detects changes
in intensity. These cells are sensitive to the orientation and direction of edges, which
corresponds to the information provided by the gradient vector field.
The Laplacian of a Gaussian has an analogy in the neurobiological concept of edge-
detecting neurons that are sensitive to both the intensity changes and the scale of
the changes. The LoG operator can be compared to the behavior of complex cells
in the visual cortex, which detect more abstract features like corners and blobs at
multiple scales.

Question 20 (5 points)
Explain the Iterative Closest Point (ICP) Algorithm to estimate the
transform between two dense sets of points.
The Iterative Closest Point (ICP) algorithm is used to estimate the rigid trans-
formation (rotation and translation) between two dense sets of points, commonly

22
referred to as the source set P and the target set Q. The goal is to align the source
point cloud with the target point cloud as accurately as possible.
Steps in the ICP Algorithm:
1. **Initialization**: - Start with an initial guess for the transformation T, typically
a 4×4 matrix combining rotation R and translation t. - Apply this transformation
to the source set P.
2. **Closest Point Association**: - For each point p ∈ P (transformed source
set), find the closest point q ∈ Q in the target set. This step involves a nearest
neighbor search, which can be computationally intensive for large datasets. - The
correspondence pairs are recorded as (pi , qi ) for all i.
3. **Error Metric Definition**: - Define the alignment error as the sum of squared
distances between each correspondence:
X
E(T) = ∥qi − (Rpi + t)∥2
i

- The goal is to minimize this error by updating R and t iteratively.


4. **Update Transformation**: - Compute the optimal transformation T = (R, t)
that minimizes the error metric E(T). This can be done using methods such as
the Singular Value Decomposition (SVD): - Compute the centroids of the two point
clouds. - Align the centroids and calculate the optimal rotation R using SVD. -
Compute the optimal translation t to align the centroids.
5. **Transform Source Set**: - Apply the updated transformation T to the source
set P.
6. **Iterate Until Convergence**: - Repeat Steps 2–5 until the change in the
alignment error E(T) is below a specified threshold or the maximum number of
iterations is reached.
Key Properties of ICP: - The algorithm assumes a rigid-body transformation (no
scaling or deformation). - It minimizes the point-to-point distance by default,
though variations exist to minimize point-to-plane or other distance metrics.
Applications: - ICP is widely used in computer vision, robotics, and 3D modeling
for tasks such as: - Aligning 3D scans or point clouds. - Object pose estimation. -
SLAM (Simultaneous Localization and Mapping).
Strengths and Weaknesses: - **Strengths**: - Simple and intuitive. - Can achieve
high accuracy for small initial misalignments. - **Weaknesses**: - Sensitive to the
initial transformation guess (can converge to a local minimum). - Computationally
expensive due to the nearest neighbor search. - Assumes sufficient overlap between
the two point sets.

Question 21 (5 points)
1. Explain the Hough Transform.
The Hough Transform is a feature extraction technique used in image analysis and
computer vision to detect specific shapes, such as lines, circles, and other parametric

23
curves. The method maps points from the image space to a parameter space where
the desired shape can be identified as an intersection or peak in the accumulator
space.
Steps in the Hough Transform for Lines: 1. **Parametric Representation**: - A
line in 2D can be expressed parametrically as:
ρ = x cos θ + y sin θ,
where: - ρ is the perpendicular distance from the origin to the line, - θ is the angle
between the line’s normal and the x-axis, - (x, y) are the coordinates of a point on
the line.
2. **Voting in Parameter Space**: - Each edge point (x, y) in the image contributes
votes to all possible (ρ, θ) pairs in the parameter space that could define a line
passing through the point. - This results in a sinusoidal curve in the (ρ, θ) space.
3. **Accumulator Array**: - The parameter space is discretized into an accumula-
tor array. - Votes are tallied for each (ρ, θ) combination. Peaks in the accumulator
indicate potential lines.
4. **Peak Detection**: - Peaks in the accumulator array correspond to lines in the
image. These peaks are identified and mapped back to the image space.
Applications: - Detecting lines in edge-detected images (e.g., road lane detection). -
Identifying shapes like circles, ellipses, or other geometric forms through extensions.

2. Can the Hough Transform be used with shapes other than circles and
lines? If so, how?
Yes, the Hough Transform can be extended to detect shapes other than circles and
lines by modifying the parameterization to match the equation of the desired shape.
Below are some examples:
a. Detecting Circles: 1. **Parametric Representation**: - A circle is defined by its
center (a, b) and radius r:
(x − a)2 + (y − b)2 = r2 .
- The parameter space becomes (a, b, r).
2. **Voting Process**: - For each edge point (x, y), votes are cast for all possible
(a, b, r) combinations that could define a circle passing through the point. - This
results in a 3D accumulator array where peaks correspond to potential circles.
b. Detecting Arbitrary Shapes: 1. **Parametric Representation**: - Shapes can
be represented by a set of equations, such as ellipses or parabolas. - For example,
an ellipse can be parameterized as:
(x − a)2 (y − b)2
+ = 1,
rx2 ry2
where (a, b) is the center, and rx , ry are the semi-major and semi-minor axes.
2. **Template Matching**: - For complex shapes, the Hough Transform can be
adapted by using a shape template. Edge points vote for transformations (e.g.,
translation, rotation, scaling) that align the template with the image edges.

24
3. **Generalized Hough Transform**: - For non-parametric or irregular shapes, a
reference table is constructed with information about the relative position of edge
points. - Votes are cast based on this table to detect the shape in the image.

Advantages of the Hough Transform: - Robust to noise and partial occlusions. -
Handles shapes parametrically or via templates.
Limitations: - Computationally expensive for high-dimensional parameter spaces.
- Sensitive to parameter discretization.

Question 22 (5 points)
1. Explain how to make the Harris Corner Detector scale-invariant.
The Harris Corner Detector is inherently not scale-invariant because it relies on a
fixed window size for evaluating local image gradients. To achieve scale invariance,
the following steps are applied:

Steps to Make the Harris Corner Detector Scale-Invariant:
1. **Incorporate Scale-Space Analysis:** - Perform the corner detection over a
scale-space representation of the image. - A scale-space representation is con-
structed by convolving the image I(x, y) with a Gaussian kernel at multiple scales
σ:
Iσ (x, y) = Gσ (x, y) ∗ I(x, y),
where Gσ (x, y) is the Gaussian kernel with standard deviation σ.
2. **Adjust the Structure Tensor:** - At each scale σ, compute the structure tensor:
 
Gσ ∗ Ix2 Gσ ∗ Ix Iy
Mσ = ,
Gσ ∗ Ix Iy Gσ ∗ Iy2

where Ix and Iy are the image derivatives in the x and y directions.


3. **Multi-Scale Corner Response:** - Compute the corner response R at each
scale using the Harris corner measure:

R = det(Mσ ) − k · trace2 (Mσ ),

where k is a constant (typically k = 0.04 to 0.06).


4. **Non-Maximum Suppression Across Scales:** - Perform non-maximum sup-
pression in both spatial and scale dimensions to find the most prominent corner
points. - A point is considered a corner if it is a local maximum in its neighborhood
across all scales.
5. **Assign the Scale to Each Corner:** - Associate the scale σ at which the corner
response R is maximized to the detected corner point.

25
Why This Makes the Detector Scale-Invariant: - By analyzing the image
across multiple scales, the detector becomes sensitive to features at varying sizes.
- The use of Gaussian smoothing ensures that details irrelevant to the scale being
analyzed are suppressed. - Non-maximum suppression across scales ensures that
corners are detected consistently, regardless of the image’s scaling.

Applications: - The scale-invariant Harris Corner Detector is often used in tasks
where images may vary in size, such as object recognition or image matching.

Question 23 (5 points)
1. Explain how bilateral and guided image filters work.

Bilateral Filter

The bilateral filter is an edge-preserving and noise-reducing filter used in image


processing. It combines spatial and intensity information to smooth the image while
preserving edges.
1. **Definition:** The output of the bilateral filter at a pixel (x, y) is given by:
1 X
Ifiltered (x, y) = I(x′ , y ′ ) · Gspatial · Gintensity ,
W (x, y)
(x′ ,y ′ )∈Ω

where:

• Ω: The neighborhood around (x, y).


• Gspatial : Gaussian kernel measuring spatial proximity.
• Gintensity : Gaussian kernel measuring intensity similarity.
• W (x, y): Normalization factor ensuring the sum of weights is 1:
X
W (x, y) = Gspatial · Gintensity .
(x′ ,y ′ )∈Ω

2. **Key Features:** - Smooths regions of similar intensity. - Preserves edges by


assigning smaller weights to pixels with significantly different intensities.
3. **Applications:** - Image denoising while retaining details. - Preprocessing for
edge detection and segmentation.

26
Guided Filter

The guided filter is a more efficient edge-preserving filter that leverages a guidance
image to refine the filtering process.
1. **Definition:** The guided filter computes the filtered output Ifiltered based on a
guidance image G, which may or may not be the same as the input image I:

Ifiltered (x, y) = a(x, y) · G(x, y) + b(x, y),

where:

• a(x, y) and b(x, y) are coefficients computed locally over a window Ω.


• These coefficients minimize the difference between the output and the input I
in a least-squares sense.

2. **Key Steps:** - Compute local statistics (mean and variance) in Ω. - Estimate


a(x, y) and b(x, y):

covΩ (G, I)
a(x, y) = , b(x, y) = µΩ (I) − a(x, y) · µΩ (G),
varΩ (G) + ϵ

where µΩ and covΩ are the mean and covariance within Ω, and ϵ is a regularization
term. - Compute the final output using the guidance image.
3. **Key Features:** - Efficient and linear in complexity. - Suitable for applications
requiring real-time performance.
4. **Applications:** - Joint image filtering. - Image matting and feathering. - High
dynamic range (HDR) compression.

Comparison of Bilateral and Guided Filters

• Edge Preservation: Both filters preserve edges, but the guided filter is more
computationally efficient.
• Guidance: The bilateral filter relies only on the input image, while the guided
filter uses an explicit guidance image.
• Flexibility: The guided filter can be tailored to specific tasks by adjusting
the guidance image.

Question 23 (5 points)
1. Explain how image interpolation works and its uses.

27
Image Interpolation

Image interpolation refers to the process of estimating the values of pixels at


non-integer locations in an image. This is commonly used when resizing, rotating,
or otherwise transforming an image.
1. **Definition:** The pixel value at a non-integer coordinate (x′ , y ′ ) is estimated
using known values of neighboring pixels.
2. **Mathematical Expression:**
X
I(x′ , y ′ ) = I(i, j) · W (x′ − i, y ′ − j),
i,j

where:

• I(i, j): Intensity of a pixel at integer location (i, j).


• W : Interpolation kernel that determines how neighboring pixels contribute.

3. **Uses of Image Interpolation:** - **Image Resizing**: To enlarge or shrink


images while maintaining visual quality. - **Geometric Transformations**: For
rotating, warping, or aligning images. - **Medical Imaging**: For reconstruct-
ing high-resolution images from low-resolution scans. - **Computer Vision Appli-
cations**: For tasks such as object detection, feature alignment, and panoramic
stitching.

2. Explain the difference between ideal, nearest-neighbor, linear, and
Gaussian reconstructions.

Ideal Reconstruction

1. **Definition:** The ideal reconstruction uses the sinc function to perfectly in-
terpolate pixel values. The sinc function is the Fourier transform of a rectangular
function in the frequency domain.
X
Iideal (x′ , y ′ ) = I(i, j) · sinc(x′ − i) · sinc(y ′ − j).
i,j

2. **Key Features:** - Perfect reconstruction in theory but impractical due to


infinite kernel size. - Sensitive to aliasing.

Nearest-Neighbor Interpolation

1. **Definition:** Nearest-neighbor interpolation assigns the value of the nearest


pixel to the non-integer location:

INN (x′ , y ′ ) = I round(x′ ), round(y ′ ) .




28
2. **Key Features:** - Fast and computationally efficient. - Produces blocky and
jagged artifacts, especially for enlargements.

Linear Interpolation

1. **Definition:** Linear interpolation estimates the pixel value by linearly weight-


ing the four closest pixels:
X
Ilinear (x′ , y ′ ) = I(i, j) · (1 − |x′ − i|) · (1 − |y ′ − j|).
i,j

2. **Key Features:** - Balances computational efficiency and quality. - Produces


smoother results compared to nearest-neighbor interpolation.

Gaussian Reconstruction

1. **Definition:** Gaussian reconstruction uses a Gaussian kernel to weigh neigh-


boring pixels based on their distance:

(x′ − i)2 + (y ′ − j)2


 
′ ′ 1 X
IGaussian (x , y ) = I(i, j) · exp − ,
W i,j 2σ 2

where W is a normalization factor.


2. **Key Features:** - Provides smooth results. - Preserves edges better than linear
interpolation for appropriate kernel size σ. - Computationally more expensive.

Comparison of Methods

Method Advantages Disadvantages


Ideal Reconstruction Perfect quality in theory Impractical for real-time
Nearest-Neighbor Fast and simple Produces blocky artifacts
Linear Interpolation Smooth results, computationally efficient May blur edges
Gaussian Reconstruction Smooth and edge-preserving Computationally expensi

29
Examination #1

CS 766: Computer Vision

October 21, 2004

Last (Family) Name: _________SOLUTIONS______________

First Name: ___________________________________________

Problem Score Out of


____________________________

1 _______ 15

2 _______ 15

3 _______ 20

4 _______ 10

5 _______ 15

6 _______ 10

Total _______ 85
CS 766 Exam #1 Fall 2004

1. [15] Camera Calibration


The relationship between a 3D point at world coordinates (X,Y,Z) and its corresponding 2D
pixel at image coordinates (u,v) can be defined as a projective transformation using a 3 × 4
camera projection matrix P.

(a) [3] Can the matrix P incorporate any lens distortions that might be in the camera?
Briefly explain.

No because lens distortions are usually highly nonlinear,


which cannot be represented in P.

(b) [4] Give two lists, one specifying the intrinsic camera parameters and the other
giving the extrinsic camera parameters.

The five intrinsic parameters are the focal length, f,


principal point coordinates, (u0, v0), and the pixel size
scaling parameters, ku, kv.
The six extrinsic parameters are the three translational and
three rotational parameters that define the rigid body motion
between the world coordinate frame and the camera coordinate
frame.

(c) [4] Show how P can be decomposed into a product of matrices that contain
elements expressed in terms of the intrinsic and extrinsic camera parameters.

 fk u 0 u 0   r11 r12 r13 tx 


P = K[R | T] =  0 fk v v0   r21 r22 r23 t y 
 0 0 1   r31 r32 r33 t z 

(d) [4] Give the main steps of an algorithm for computing the matrix P from a single
image of a known 3D “calibration object.”

Since P is a 3 x 4 matrix with 11 unknowns (the overall scale of P


does not matter), observing a 3D scene containing at least 6 known
points (in a non-degenerate configuration) is sufficient. Each point
gives a pair of equations of the form
u = su/s = (p11X + p12Y + p13Z + p14) / (p31X + p32Y + p33Z + p34)
v = sv/s = (p21X + p22Y + p23Z + p24) / (p31X + p32Y + p33Z + p34)
Rewrite these 12 equations in the form Ap = 0, where p is a 12 x 1
vector of unknowns, and A is a 12 x 12 matrix of coefficients. Use
least squares to solve for p by computing the eigenvector
corresponding to the smallest eigenvalue of ATA. Note: This is only an
approximate solution and often this is then used as a starting point
for a nonlinear optimization.

2 of 8
CS 766 Exam #1 Fall 2004

2. [15] Planar Transformations


The planar façade of a building is captured in an image taken by a camera. Assume this
plane corresponds to the world coordinate frame’s Z=0 plane, and scene point (X,Y) on the
building projects to image pixel coordinates (u,v).

(a) [3] What is the planar projective transformation that describes the relationship
between (X,Y) and (u,v)? Give your answer using homogeneous coordinates.

 su   p11 p12 p13   X 


 sv  =  p p 22 p 23   Y 
   21
 s   p31 p32 p33   1 

(b) [1] How many degrees of freedom does this transformation have?

(c) [1] How many point correspondences are required to determine this transformation?

(d) [2] Would having more correspondences than your answer to (c) be helpful in any
way? If no, briefly explain why not. If yes, explain how they could be used.

Yes, if more points are available, the calibration accuracy can be


improved by using linear least squares.

(e) [2] Give one invariant of a planar projective transformation.

Invariants include concurrency, collinearity, order of contact,


tangent discontinuities and cusps, cross-ratio of 4 collinear
points, and measurements in a canonical frame.

(f) [2] Give one invariant of a planar affine transformation that is not an invariant for a
planar projective transformation.

Invariants include parallelism, ratio of areas, ratio of lengths on


collinear or parallel lines (e.g., midpoints).

(g) [4] If the building has sets of lines on it running parallel to both the X and Y axes,
how could we use the corresponding lines in the image to determine if the building
plane is parallel to the image plane?

If both sets have their vanishing point at infinity then the


building is fronto-parallel.

3 of 8
CS 766 Exam #1 Fall 2004

3. [20] Edge Detection


(a) [5] Show how an approximation to the first derivative of an image can be obtained by
convolving the image with the kernel [1 −1] where the image is defined as

[56 64 79 98 115 126 132 133]

Ignore computing a value for the first and last image pixels (in other words, your result
will be 6 values). In addition to showing the result of the convolution, indicate where
edges would be detected and why.

8 15 19 17 11 6 1

Intensity discontinuities occur at the maxima of the first


derivative, which this approximates. Here the maximum of 19
corresponds to the position where an edge is detected,
corresponding to a position between the pixels with
intensities 79 and 98 (or associated with one of these two
pixels).

(b) [6] What property of the coefficients of a kernel ensures that an appropriate output is
obtained for regions of constant intensity in an image when

(i) The kernel is approximating a first derivative.

The kernel values sum to 0.

(ii) The kernel is approximating a second derivative.

The kernel values sum to 0.

(iii) The kernel is approximating a Gaussian.

The kernel values sum to 1 (after normalization).

(c) [4] Describe a major advantage or disadvantage of using an isotropic operator instead
of a non-isotropic operator with respect to the following issues:
(i) Computational efficiency.

Isotropic operators such as the Laplacian of a Gaussian are


generally more computationally efficient than non-isotropic
operators because they can be implemented using a single convolution
(or multiplication in the frequency domain) followed by zero-
crossing detection. Directional operators such as the Canny
operator require a search for a local maxima.

4 of 8
CS 766 Exam #1 Fall 2004

(ii) Noise in the image.

Directional operators are generally more robust to noise by


smoothing pixels which are on only one side of an edge. Also, an
operator which uses higher order derivatives is more sensitive to
noise.

(d) [5] What is the purpose of (i) non-maxima suppression and of (ii) hysteresis that are
done in the Canny edge detector?

(i) Non-maxima suppression thins responses so that only a


single pixel in the gradient direction will be detected, thus
produces a one-pixel wide sequence of edge points.
(ii) Hysteresis thresholding simultaneously attempts to
eliminate weak edge points that are caused by noise while
filling in gaps between strong edge points where only weak
edge response is detected.

5 of 8
CS 766 Exam #1 Fall 2004

4. [10] Corner Detection


(a) [2] When would detecting corners be more appropriate than detecting edges as an
initial step in an application using computer vision?

Detecting corners would be more appropriate when only a sparse set


of points are needed, especially facilitating the detection of
corresponding points in multiple images. This is used, for example,
in stereopsis and motion tracking.

(b) [4] The Harris corner detection algorithm computes a 2 × 2 matrix at each pixel based
on the first derivatives at that point and then computes the two eigenvalues of the
matrix, λ1 and λ2, where λ1 ≤ λ2. How can these two values be used to label each pixel
as either a locally smooth region (S), an edge point (E), or a corner point (C)? Give
your answer by specifying “Label pixel S if …”, “Label pixel E if …” and “Label pixel C if
…”

The Harris corner detector first computes R = λ1λ2 –k(λ1 + λ2)2 and
then
labels a pixel S if |R| ≈ 0 (or, alternatively, λ1 ≈ λ2 ≈ 0);
labels a pixel E if R < T1 < 0 (or, alternatively, λ1 ≈ 0
(corresponding to the direction of the edge) and λ2 is large
(corresponding to the normal direction at the edge)); or
labels a pixel C if R > T2 > 0 (or, alternatively, λ1 and λ1 are both
large).

(c) [4] Given the two eigenvalues specified in (b), explain in English the rationale and
difference(s) between detecting corner points using the criterion λ1 > T1 (which is used
in the Tomasi and Kanade algorithm) versus the criterion λ1λ2 > T2 (which is used in
the Harris algorithm), where T1 and T2 are appropriate thresholds.

Since λ1 ≤ λ2, λ2 corresponds to the direction of maximum change in


intensity and λ1 corresponds to the direction of minimum change in
intensity. Therefore, the criterion λ1 > T1 is used to test if
there is more than one direction with a strong edge present. The
criterion λ1λ2 > T2 could hold if λ2 >> λ1 and yet λ1 is relatively
small compared to λ2. The criterion λ1 > T1 ensures that both
eigenvalues are large as defined by T1. Ideally, we’d like to
ensure that both eigenvalues are large and they are of similar size.

6 of 8
CS 766 Exam #1 Fall 2004

5. [15] Active Contours


The energy functional that is used with active contours (snakes) usually contains the three
terms: Econtinuity, Esmoothness, and Eimage (the first two terms are often combined to define a term
called Eint ).

(a) [9] For each of these three terms explain briefly what it measures, how it is
defined, and what happens if the term is omitted.

Econtinuity = dv/ds2 measures the degree of rigidity or elasticity of


the contour. If it is omitted, there can be discontinuities along
the contour. Removing it can also cause the snaxels to “bunch up.”

Esmoothness = d2v/ds22 measures the degree of bending or stiffness of


the contour. Omitting it allows tangent discontinuities along the
contour. This may cause the snake to become quite jagged and may
overfit noisy data.

Eimage measures features in the image that act as either attraction or


repulsion forces on the contour. For example, using edge magnitude
as a term makes the contour attracted to edges in the image.
Omitting this term means that the evolution of the snake will not be
influenced by the image data at all; with usual definitions of the
total energy functional this will mean that the contour will shrink
to a point or a line.

(b) [3] What is the effect of giving a negative weight to Econtinuity?

This will cause snaxels to bunch up and tend to cause the contour to
expand.

(c) [3] How could the energy functional definition be specified so as to cause the
contour to expand?

One approach would be to modify Eimage so that it also includes a


“balloon” force in the normal direction of the contour at each
snaxel point.

A second approach, given in the online reading, is to run the snake


algorithm, then segment the resulting contour to eliminate high
energy segments. Grow each end of each segment in the direction of
its tangents; then run the snake algorithm on each segment.
Finally, merge all of the final segments.

7 of 8
CS 766 Exam #1 Fall 2004

6. [10] Segmentation using Normalized Graph Cut


(a) [3] Define cut(A,B) and explain intuitively what it measures.

cut ( A, B) = ∑ aff (i, j ) measures


i∈ A, j∈B
the similarity between two groups

of pixels defined by the disjoint sets A and B.

(b) [5] Say we want to find “regions” corresponding to object boundaries by grouping
connected chains of “strong” edge points that are adjacent and their gradients
imply a smooth contour. Define an affinity measure for this purpose that uses the
magnitude, mag(∇I(p)), and direction, direc(∇I(p)), of the gradient, ∇I, at a pixel p in
image I to compute affedge(x,y) for a pair of adjacent pixels x and y. Include any
additional computational steps before computing the affinity that make sense for
obtaining good results. Briefly explain your definition.

Intuitively, if there is an edge between pixels x and y, then


x and y are on opposite sides of the edge and the
dissimilarity should be high, meaning the affinity should be
low. On the other hand, if the edge directions at both x and
y are nearly the same, then x and y are likely on the same
contour and the affinity should be high. Thus affinity can be
defined by measuring the angle between the gradient directions
at the two pixels, weighted by the gradient magnitude. If the
angle is near 180 degrees the affinity is low and if it is
near 0 degrees, then the affinity is high. One way to do this
is to compute the dot product of the two gradient vectors,
∇I(x) ⋅ ∇I(y), which computes the cosine of the angle between
these vectors. Non-maxima suppression should be done first so
that “thick” contours are not created.

(c) [2] Say we want to segment images into regions of uniform texture using a
measure of the texture at each pixel by computing some function over a (2n+1) ×
(2n+1) neighborhood centered on the pixel. Briefly explain why using this function
to define the texture affinity between two pixels, afftexture(x,y), will be biased
depending on how close the pixels are to a region boundary.

Using a fixed window to compute the texture at a pixel will be


“polluted” when the pixel is near a texture boundary because
some of the pixels in the window will be in one region and
other pixels will be in a different region, each with their
own texture characteristics.

8 of 8
Student Name: ____________________________ Student Number: _________________

CSCE460301 - Fundamental of Computer Vision (Fall 2024)

Final Exam - Practice

EXAMINATION INSTRUCTIONS:
· Do not turn this page until asked to do so.
· Exam time is 120 minutes.
· Write your name and Number in the indicated places.
· Put the answers on the same question sheet, do not use any additional papers, even for
scratch.
· Write all steps used to get the final answers.
· Read the exam instructions carefully.
· This exam is for assessment purposes only, and will not be available for review.
· Read the honesty policy and sign the following statement.
· There are 15 pages including this page. Please make sure your exam has these many pages.

Academic Integrity Policy


By signing the line below, you acknowledge that you will uphold the academic integrity standards of
AUC.
Violation of the policy incurs consequences that may include but are not limited to failing the course.
I am presenting this exam as entirely my effort.
Signature: _______________

▶ Page 1 | 11
________________________________________________________________________________________

QUESTION 1 1 point
What is a homography between two images?

Homography is a technique for transformation and it is computed between


two images of the same planar surface to describe how one point in one image corresponds
to the other image

QUESTION 2 2 points

Explain the distinction between motion field and optical flow in an image. Use a diagram to illustrate
your answer.

Ideally, Optical Flow is equal to the Motion field. However, the optical flow is more
of the motion of the brightness patterns in one image (the velocity of the brightness pixels),
while the motion field is the actual velocity of an the object has moved from to
that is being compared to.

QUESTION 3 4 points

Using appropriate mathematical expressions, define the following operations commonly used in
computer vision and briefly explain their function and applications:
a. Convolution
b. Correlation
c. edge detection by second-derivative zero-crossings
d. invariant transform

Convolution => filters like sobel, median, blurring...


Correlation => pattern recongition, Template matching
zero crossing => detecting high frequencies such as edges after second moment derivatives
Variant transform:

Invariant Transform or SIFT is a feature descriptor or detector where it is used in


detecting keypoints or unique blobs in an image so we can match the features from
one image to another. it is called invariant because it wont matter if the image got transofrmed:
sheared, translated, or rotated, feature will still be detected and matched successfully.

 Page 2 | 15
___________________________________________________________________________

QUESTION 4 6 points

Explain the notion of scale-space and how it is used in various areas of computer vision. Include the
following:
a. Pyramidal representations of image structure across successive scales of blurred
undersampling.
b. Edge detection operators that extract edges at particular scales of analysis, but not at others.
c. The behaviour of zero-crossings, their trajectories and “fingerprints” in scale-space.

a. First we get the pyramidal representations of the images using gaussian blurring and then
undersampling it. this is so useful in application like Hybrid image such as overlaying two aligned
shapes on top of each others while having one retains either frequencies. anotehr application is
the object detection where it is useful to try laplacian filters with different sigmas on different
image scale so we can find the maximum response.

[Link] Canny edge detector for example, this operator is used to retain the high frequencies
in an image after getting the gradients in both direction and retain the magnitude. So, it is a
powerful operator that is used to detect boundaries and edges using certain thresholds.

c. zero crossing is leverged using the Laplacian of the Gaussian operator which is resembling
the second derivative of an edge detector. it is widely used application is feature detection as well.
whenever the intensity in images changes, those trajectories appear more.
Zero-crossings are used in scale-space to detect edges by identifying points
where the image intensity undergoes a significant change in curvature.
In scale-space, these zero-crossings are tracked across different scales.
The trajectories of zero-crossings provide information about the object boundaries
at different scales and can reveal features that are robust across scales

QUESTION 5 2 points

Why is it important that images are not under-sampled if you wish to accurately track the images of
objects?

Using any technique of tracking objects will be through detecting keypoints or descriptors
like SIFT or using harris operator. When an image is undersampled, it becomes harder
for those appraoches to extract those features

undersamples images do lose alot of details and the target her is to find distinctive
features that can we use for tracking the motio of an object

 Page 3 | 15
________________________________________________________________________________________

QUESTION 6 3 points

The Harris corner detection algorithm computes a 2 × 2 matrix at each pixel based on the first
derivatives at that point and then computes the two eigenvalues of the matrix, λ 1 and λ2, where λ1 ≤ λ2.
How can these two values be used to label each pixel as either a locally smooth region (S), an edge
point (E), or a corner point (C)? Give your answer by specifying “Label pixel S if …”, “Label pixel E if
…” and “Label pixel C if …”

There are three labels wherther an flat region, an edge, or a corner which is the goal to find.
if the Lambda 1 and lambda 2 are both very low, then they are labeled as a flat area region.
if lamdba 1 and lambda 2 are significantly high and are close to each others in value, then
this si definitely a corner point. the last case is when one eigen value lambda 1 is significantly
higher than the otehr, this is an egde point.

QUESTION 7 4 points
Given the two eigenvalues specified in (QUESTION 6), explain in English the rationale and
difference(s) between detecting corner points using the criterion λ1 > T1 versus the criterion λ1λ2 > T2,
where T1 and T2 are appropriate thresholds?

If there is lambda1 > T1, that only detects one eigen value whether it is larger than the threshold
or not. if using lambda12>T2 is ddetecting whether the product is producing the expected
product when the two eigen values are near each other and signifciantly high.

 Page 4 | 15
___________________________________________________________________________

QUESTION 8 4 points

a. Explain the key principles underlying the Scale-Invariant Feature Transform (SIFT).
b. What is it used for?
c. What goals does it achieve?
d. How does it achieve them?

SIFT is a keypoints detector and descriptor that detects the blobs.


it is used for blob detection and feature matching where one feature is present in two images.
how? by using the difference of gaussian operator which is an optimized version of laplacian
of gaussian with a sigma squared. where it produces a laplacian level in different scale of an image
and get the features of that image . additioanlly a histogram for every image is constructed then
the orientation is being normalized and extracted for each pixel. then a 4x4 batch size is being divide
on he image so every 4x4 is in one batch with different orientations each with weights and unique bins
depedning on their angle

 Page 5 | 15
________________________________________________________________________________________

QUESTION 9 3 points
What special difficulties would you expect to have in computing optical flow given
a. Spatially aliased images?
b. Temporally aliased image sequence?
c. What can you do to deal with these situations?

a. Spatially aliased images would affect the optical flow veclocities of the brightness pixels
to be detected successfully since the pixels themselves would be distorted/ loss
of motion details. this alias happens because of the sampling rate is two low.
solutions: high sampling rate and multi scale on the same images using pyramids to detect
flow on

b. it is also result of limited frame rate. this can cause the motion to miss some keypoints
that was already its velocity changed from one place to another.
Solution is to increase the frame rate as well.

 Page 6 | 15
___________________________________________________________________________

QUESTION 10 4 points
For an image I(x,y), define its gradient vector field ∇ I(x,y).
a. Why is this vector field a useful thing to compute?
b. Define the gradient magnitude over the image plane (x,y).
c. Define the gradient direction over the image plane (x,y).
d. Explain how the gradient vector field is used in the Canny edge detector, the main
steps in its use, and its advantages.

a. The gradient vector will be useful for defining the gradient in teh ydirection and xdirection
where we can then use it in edge detection in canny for example or object detection or even in
corner detection by harris operation.

[Link] gradient magnitude is square root of x swuared + y swuared to result in all the edges in all directions

c. the gradient direction is tan^-1 of (di/dy/di/dx)^2

[Link] the image is being blurred by gaussian operator, then the gradient in x and y is obtained by usinh
some edge detection filter in both directions (fisrt derivative of gaussian). then the magnitude is
obtained to get the edge in both orientations, its advantages is that is its a robust approach for detecting
texture, features and even used as part of corner detection.

 Page 7 | 15
________________________________________________________________________________________

QUESTION 11 5 points

a. Define the gradient vector field ∇2 f(x,y) over an image f(x,y), and explain what makes
it useful. Contrast its features and capabilities with the ∇2Gσ(x,y) (Laplacian of a
Gaussian) operator shown below.
b. Identify their respective orders as differential operators,
c. explain how they can be implemented, and
d. discuss any neurobiological analogues for both.

a. this gradient vector field is the laplacian of the gaussian filter (Second derivative) with derivative
of the gaussian. derivative of the gaussian or LOG both detect edges pretty good. but the LOG
is more robust and accurate since it uses the zero crossing as edge detector threshold.

b. Laplacian is the second order derivative while the otehr is first order.

c. The first one is taking the laplacian operator and apply it on the gaussian filter.
the second is applying any filter of the edge detecttion such as sobel or prewitt on the gaussian direct

[Link] retina performs similar tasks through its network of retinal ganglion cells,
which process visual information before sending it to the brain similar to the laplacian.

 Page 8 | 15
___________________________________________________________________________

QUESTION 12 2 points
Suppose you were trying to design a machine vision system based as closely as possible upon
human vision; for example, perhaps a visual prosthesis for the blind.
a. Would you aim to include, as design goals, the standard geometric visual illusions as
well?
b. If such errors of vision emerged as unintended consequences of your design, would
you consider them to be features, or bugs, and why?

a. yes! i would. since i am planning

QUESTION 13 3 points
a. Provide a 3 × 3 discrete filter kernel array that approximates the Laplacian operator.
b. Explain what the Laplacian might be used for,
c. What is the significance of the sum of all of the coefficients in the filter?

0 -1 0
-1 4 -1 -1 -1 -1
-1 8 -1 the laplacian is the residual block that can retain the
0 -1 0 high frequencies of the image. it used in many ways:
-1 -1 -1
edge detection where i can apply laplacian to the
gaussian, image pyramid where i can reconstruct
the image by upscaling and add teh residual block to
it. it is also commonly used in sift detector where it can
be implemented on different scales and see the max
responses.
So i can check teh zero sum property. and to see if it is sum upto zero or not to check if it needs to
be normalized or does it maintain teh overall intensity or not.

 Page 9 | 15
________________________________________________________________________________________

QUESTION 14 2 points
Extraction of visual features from images often involves convolution with filters that are
themselves constructed from combinations of differential operators.
One example is the Laplacian of a Gaussian Gσ(x,y) having scale parameter σ, generating the
filter ∇2Gσ(x,y) for convolution with the image I(x,y). Explain in detail each of the following
three operator sequences, where ∗ signifies two-dimensional convolution.
a. ∇2 [Gσ(x,y) ∗ I(x,y)]
b. Gσ(x,y) ∗∇2I(x,y)
c. What are the differences amongst their effects on the image?

a. fisrt gaussian of the image then laplacian


b. fisrt laplacian then gaussian
c. their effect on them images ar ethe same since they are convolved they are having the
commutative property it wont matter which convolved first.

 Page 10 | 15
___________________________________________________________________________

QUESTION 15 2 points
Explain the notion of scale-space and how it is used in various areas of computer vision.
Include the following:
(a) Pyramidal representations of image structure across successive scales of blurred
undersampling.
(b) Edge detection operators that extract edges at particular scales of analysis, but
not at others.

a. This is used in different computer vision areas such as hybrid images for visualizing
the different scal of the image ro assess how good it was merged together. additionally, it is
used to extract different laplacian residuals blocks that is used in retaining the high frequencies
of the images. it is also used in blob and feature detection by SIFT where the different scales is
then been applied laplacian to it with different sigmas so we can see the maximum resonse
of the blov

b.

 Page 11 | 15
________________________________________________________________________________________

QUESTION 16 4 points
a. Describe the Canny edge detection algorithm, including how the edge strength and
orientation is calculated.
b. What is the spatial orientation tensor and how is it used?

Canny edge detector first smoothes the image with a sigma specific. then the gradient is
obtained in the x and y directions. after that the magnitude is extracted and the orientations.
for making it more efficeint, canny uses non maxiimum suppression to retain only the thin
edges eliminating the other noises. also we are using histerises thresholfing where there are
two thresholds on high and one low to detect the strong edges and retain the weak edges
if it is only connected to one of the strong edges

 Page 12 | 15
___________________________________________________________________________

QUESTION 17 4 points
Let the following matrix H denote a general 2D planar transformation:

ℎ11 ℎ12 ℎ13


𝐻 = [ℎ21 ℎ22 ℎ23 ]
ℎ31 ℎ32 33

For each of the following special cases:


a Euclidean
b Similarity
c Affine
d Projective
Write down,
i. the parameterization of the hij’s,
ii. the number of degrees of freedom, and
iii. the number of correspondences needed to estimate H.

a.
i. It has two: rotation and translation
degree of freedom: 3 theta and 2 for translaton

b.
i. Scaling + rotation + translation
dof: 4

Affine:Scaling, shearing, translation, and rotation:


DOF: 6

Projective : 8 dofs

 Page 13 | 15
________________________________________________________________________________________

QUESTION 18 5 points
a. Describe when it would be suitable to use the Prewitt masks.
b. Use a diagram to illustrate these masks.
c. Sketch the result when one of these masks is applied to the greyscale image below.
You only need to compute the values for the central region of the image that is fully
covered by the mask.
6 7 7 8 8 9

6 7 8 34 38 39

7 8 9 36 38 39

7 6 36 35 37 40

6 6 8 36 18 39

3 7 8 35 20 39

It is suitable to use prewitt masks when we need to retain the edges in the y direction and x firections as shown below:

prewitt vertical:
1 0 -1
1 0 -1
1 0 -1

 Page 14 | 15
___________________________________________________________________________

[This page has been intentionally left blank]

 Page 15 | 15

You might also like