Artificial Neural Network Tutorial
Instructor: Young H. Cho, Ph.D.
This tutorial is designed as an introduction to the most commonly used form of artificial neural
networks for a person with minimal knowledge of the algorithm in the shortest amount of time.
However, one would need to review and gain some understanding of the following prerequisite
skill sets. Since students who would go through this tutorial have different familiarity with these
topics.
Python
o Official Python Documentation – [Link]
o Real Python Tutorials – [Link]
o W3Schools Python Tutorial – [Link]
NumPy
o NumPy Official Documentation – [Link]
o NumPy Tutorial (W3Schools) – [Link]
o SciPy/NumPy Guide – [Link]
Artificial Neural Network Basics
o Neural Networks and Deep Learning (Michael Nielsen, free book) –
[Link]
o Deep Learning Specialization (Coursera, Andrew Ng) –
[Link]
o CS231n: Convolutional Neural Networks for Visual Recognition –
[Link]
PART – I: Basics
1. Crash Introduction of ANN-ReLU network
In this example, we will build a basic Multi-Layer Perceptron. This is one of the most common
types of artificial neural networks (ANN) used today. The perceptron was the earliest model of a
neuron (1950s, Rosenblatt). It takes inputs, multiplies each by a weight, sums them up with a bias,
and applies an activation function (e.g., step, sigmoid, ReLU).
An MLP is just many perceptrons stacked in layers:
Input layer → receives raw data.
Hidden layers → apply weighted sums and nonlinear activations.
Output layer → produces predictions (e.g., regression values, classification
probabilities).
An example structure may look like the following:
A single perceptron can only separate things with a straight line (linear decision boundary). By
adding hidden layers with nonlinear activations, MLPs can represent complex, nonlinear
relationships. The more layers/neurons, the more powerful (but also harder to train).
When you hear MLP, it usually means a feedforward neural network (no recurrence, no
convolution). Training is done via backpropagation + gradient descent. The network is called fully
connected (every neuron in one layer connects to every neuron in the next).
2. First MLP Code
If we wanted to classify several components in 2D space (x,y) into 2 sets, an MLP might look like:
Input layer: 2 nodes
Hidden layer 1: 4 nodes with ReLU
Hidden layer 2: 4 nodes with ReLU
Output layer: 2 nodes (probability of Class A vs Class B)
For gradient descent, we will implement the Stochastic Gradient Descent (SGD) algorithm. SGD
is an optimization heuristic algorithm that initially selects random weights, then updates weights
by searching the values around the previous weights that decrease the overall loss (network
training using gradient).
Since the gradient descent algorithm needs a loss function equation, we will use Mean Squared
Error (MSE) to compute the “loss” between the desired output (the output with the local minimum
slope) and the predicted output computed with the current weight that can be updated with the
algorithm.
A loss function for regression tasks:
Measures the average squared difference between predicted outputs and true outputs.
Lower MSE → predictions closer to true values.
In a supervised learning task where the goal is to assign inputs to discrete classes, you have
different algorithms that you can use. For example, predicting if an image is a cat, dog, or rabbit
may use softmax + cross-entropy as the loss. Softmax converts raw outputs into probabilities over
classes while cross-entropy measures how well predicted probabilities match true class labels.
To better understand this type of network, implement the following example code for a minimal
2-layer MLP with SGD and backprop on your machine with Python and numpy library.
import math, random
import numpy as np
def relu(x):
return [Link](0.0, x)
def d_relu(x):
return (x > 0).astype([Link])
def softmax(logits):
logits = logits - [Link](axis=1, keepdims=True) # stability
e = [Link](logits)
return e / [Link](axis=1, keepdims=True)
def one_hot(y, num_classes):
Y = [Link](([Link], num_classes))
Y[[Link]([Link]), y] = 1.0
return Y
class MLP2:
"""
2-layer MLP:
hidden = ReLU(X @ W1 + b1)
out = hidden @ W2 + b2
mode: 'regression' (MSE) or 'classification' (softmax CE)
"""
def __init__(self, in_dim, hidden_dim, out_dim, mode='regression',
seed=0):
rng = [Link].default_rng(seed)
k1 = [Link](2.0/in_dim) # He init for ReLU
k2 = [Link](2.0/hidden_dim)
self.W1 = [Link](0, k1, (in_dim, hidden_dim))
self.b1 = [Link]((1, hidden_dim))
self.W2 = [Link](0, k2, (hidden_dim, out_dim))
self.b2 = [Link]((1, out_dim))
assert mode in ('regression','classification')
[Link] = mode
def forward(self, X):
self.X = X
self.z1 = X @ self.W1 + self.b1
self.h1 = relu(self.z1)
self.z2 = self.h1 @ self.W2 + self.b2
if [Link] == 'classification':
[Link] = softmax(self.z2)
return [Link]
return self.z2 # regression raw output
def loss(self, y):
if [Link] == 'regression':
# y: shape (N, out_dim)
diff = self.z2 - y
return 0.5 * [Link]([Link](diff*diff, axis=1))
else:
# y: integer labels shape (N,)
N = [Link][0]
# cross-entropy
logp = -[Link]([Link][[Link](N), y] + 1e-12)
return [Link](logp)
def backward(self, y):
N = [Link][0]
if [Link] == 'regression':
# dL/dz2 = (z2 - y)
dz2 = (self.z2 - y) / N
else:
# dL/dz2 = probs - onehot(y)
Y = one_hot(y, [Link][1])
dz2 = ([Link] - Y) / N
# Gradients for layer 2
dW2 = self.h1.T @ dz2
db2 = [Link](axis=0, keepdims=True)
# Backprop through ReLU
dh1 = dz2 @ self.W2.T
dz1 = dh1 * d_relu(self.z1)
# Gradients for layer 1
dW1 = self.X.T @ dz1
db1 = [Link](axis=0, keepdims=True)
return dW1, db1, dW2, db2
def step(self, grads, lr=1e-2, weight_decay=0.0):
dW1, db1, dW2, db2 = grads
# L2 regularization on weights (not biases)
if weight_decay > 0:
dW1 += weight_decay * self.W1
dW2 += weight_decay * self.W2
self.W1 -= lr * dW1
self.b1 -= lr * db1
self.W2 -= lr * dW2
self.b2 -= lr * db2
def batch_iter(X, y, batch_size, shuffle=True, seed=0):
rng = [Link].default_rng(seed)
idx = [Link]([Link][0])
if shuffle:
[Link](idx)
for i in range(0, len(idx), batch_size):
j = idx[i:i+batch_size]
yield X[j], (y[j] if isinstance(y, [Link]) else y[j])
Training Function:
def train(model, X, y, epochs=200, lr=1e-2, batch_size=64, weight_decay=0.0,
verbose=True):
losses = []
for e in range(1, epochs+1):
for Xb, yb in batch_iter(X, y, batch_size, shuffle=True, seed=e):
[Link](Xb)
grads = [Link](yb)
[Link](grads, lr=lr, weight_decay=weight_decay)
# track loss full-batch
[Link](X)
L = [Link](y)
[Link](L)
if verbose and (e % max(1, epochs//10) == 0):
print(f"epoch {e:4d}/{epochs} loss={L:.4f}")
return losses
The above code is an implementation of a two-layer fully connected neural network and training
function using only the basic numpy library functions. The input layer takes data (features). The
hidden layers do a linear transform, then pass it through ReLU activation. Then the output layer
does another linear transform. Depending on the task, the code performs regression using raw
outputs, while classification applies softmax and computes probabilities.
The basic training tasks are:
1. Forward pass: compute predictions.
2. Loss: measure prediction error.
3. Backward pass: compute gradients using the chain rule.
4. Step: update weights with SGD.
5. Repeat over epochs.
Step by Step Breakdown
def relu(x):
return [Link](0.0, x)
def d_relu(x):
return (x > 0).astype([Link])
def softmax(logits):
logits = logits - [Link](axis=1, keepdims=True) # stability trick
e = [Link](logits)
return e / [Link](axis=1, keepdims=True)
def one_hot(y, num_classes):
Y = [Link](([Link], num_classes))
Y[[Link]([Link]), y] = 1.0
return Y
ReLU: Rectified Linear Unit
It introduces non-linearity so networks can approximate complex functions.
d_ReLU: derivative: 1 if z>0z > 0z>0, else 0. Used in backprop.
softmax: converts raw scores (“logits”) into probabilities that sum to 1.
one_hot: turns class labels into vectors (needed for cross-entropy loss).
class MLP2:
def __init__(self, in_dim, hidden_dim, out_dim, mode='regression',
seed=0):
rng = [Link].default_rng(seed)
k1 = [Link](2.0/in_dim) # He init for ReLU
k2 = [Link](2.0/hidden_dim)
self.W1 = [Link](0, k1, (in_dim, hidden_dim))
self.b1 = [Link]((1, hidden_dim))
self.W2 = [Link](0, k2, (hidden_dim, out_dim))
self.b2 = [Link]((1, out_dim))
[Link] = mode
Architecture: input → hidden → output.
Initialization: “He initialization” for ReLU layers → samples weights with variance.
This keeps gradients stable.
Mode: 'regression' or 'classification' determines how we interpret outputs.
def forward(self, X):
self.X = X
self.z1 = X @ self.W1 + self.b1
self.h1 = relu(self.z1)
self.z2 = self.h1 @ self.W2 + self.b2
if [Link] == 'classification':
[Link] = softmax(self.z2)
return [Link]
return self.z2
Step 1: Compute the first layer pre-activation
Step 2: Apply activation function (ReLU)
Step 3: Compute the second layer pre-activation (output layer)
Step 4: Compute softmax probabilities (for classification)
We also store intermediate values (z1, h1, etc.) because they’re needed for backprop.
def loss(self, y):
if [Link] == 'regression':
diff = self.z2 - y
return 0.5 * [Link]([Link](diff*diff, axis=1))
else:
N = [Link][0]
logp = -[Link]([Link][[Link](N), y] + 1e-12)
return [Link](logp)
Loss function incorporating MSE for regression and Cross-Entropy with Softmax for
classification.
def backward(self, y):
N = [Link][0]
if [Link] == 'regression':
dz2 = (self.z2 - y) / N
else:
Y = one_hot(y, [Link][1])
dz2 = ([Link] - Y) / N
dW2 = self.h1.T @ dz2
db2 = [Link](axis=0, keepdims=True)
dh1 = dz2 @ self.W2.T
dz1 = dh1 * d_relu(self.z1)
dW1 = self.X.T @ dz1
db1 = [Link](axis=0, keepdims=True)
return dW1, db1, dW2, db2
Backward propagation is a chain rule applied in reverse. Start from the loss derivative wrt outputs
then flow backward through output layer → hidden layer → input layer.
Step 1: Output layer error
Step 2: Gradients for second layer weights and biases
Step 3: Backpropagate through ReLU
where ReLU’(z1) is 0 for inactive neurons and 1 for active neurons.
Step 4: Gradients for first layer weights and biases
def step(self, grads, lr=1e-2, weight_decay=0.0):
dW1, db1, dW2, db2 = grads
if weight_decay > 0:
dW1 += weight_decay * self.W1
dW2 += weight_decay * self.W2
self.W1 -= lr * dW1
self.b1 -= lr * db1
self.W2 -= lr * dW2
self.b2 -= lr * db2
Weights are updated using SGD. The parameter weight_decay adds L2 regularization to help
prevent overfitting. Learning rate (step size) can be adjusted.
def batch_iter(X, y, batch_size, shuffle=True, seed=0):
rng = [Link].default_rng(seed)
idx = [Link]([Link][0])
if shuffle: [Link](idx)
for i in range(0, len(idx), batch_size):
j = idx[i:i+batch_size]
yield X[j], y[j]
This function splits the dataset into small random batches for higher efficiency (less memory),
allows escaping local minima by adding gradient noise, and allows weights to converge through
stochastic approximation.
def train(model, X, y, epochs=200, lr=1e-2, batch_size=64, weight_decay=0.0,
verbose=True):
losses = []
for e in range(1, epochs+1):
for Xb, yb in batch_iter(X, y, batch_size, shuffle=True, seed=e):
[Link](Xb)
grads = [Link](yb)
[Link](grads, lr=lr, weight_decay=weight_decay)
[Link](X)
L = [Link](y)
[Link](L)
if verbose and (e % max(1, epochs//10) == 0):
print(f"epoch {e:4d}/{epochs} loss={L:.4f}")
return losses
Training loops over epochs through forward and backward passes and weight updates for each
mini-batch. After this step, the overall loss is computed and then tracked.
3. Application Example 1 — 1D regression (piecewise-linear)
Goal: learn the function f(x) = max(0, 0.5x + 0.2) + 0.3*max(0, -x+0.5) with noise.
This shows how ReLU builds piecewise linear fits.
# Synthetic data
rng = [Link].default_rng(0)
X = [Link](-2, 2, (400, 1))
def target(x):
return [Link](0, 0.5*x + 0.2) + 0.3*[Link](0, -x + 0.5)
y = target(X) + 0.05*[Link](size=(400,1))
# Model: 1->32->1
m_reg = MLP2(in_dim=1, hidden_dim=32, out_dim=1, mode='regression', seed=1)
train(m_reg, X, y, epochs=500, lr=1e-2, batch_size=64, weight_decay=1e-4)
# Test few points
xt = [Link](-2,2,9).reshape(-1,1)
pred = m_reg.forward(xt)
print([Link]([xt, target(xt), pred]))
What’s happening:
ReLU units carve the x-axis into regions where each unit is active/inactive.
The network sums these regions to approximate any piecewise-linear function.
Backprop computes gradients of MSE wrt each parameter and nudges weights by SGD.
4. Application Example 2 — Binary classification (non-linear)
We’ll classify a two-circle (bullseye) dataset—linearly inseparable but easy for ReLU.
# Concentric rings dataset
def rings(n=600, inner_r=0.5, gap=0.2, noise=0.06, seed=0):
rng = [Link].default_rng(seed)
n2 = n//2
theta1 = [Link](0, 2*[Link], n2)
r1 = inner_r + noise*[Link](size=n2)
x1 = np.c_[r1*[Link](theta1), r1*[Link](theta1)]
theta2 = [Link](0, 2*[Link], n-n2)
r2 = inner_r + gap + noise*[Link](size=n-n2)
x2 = np.c_[r2*[Link](theta2), r2*[Link](theta2)]
X = [Link]([x1, x2])
y = [Link]([0]*n2 + [1]*(n-n2))
return X, y
Xb, yb = rings(n=800, inner_r=0.6, gap=0.5, noise=0.07, seed=1)
m_bin = MLP2(in_dim=2, hidden_dim=64, out_dim=2, mode='classification', seed=2)
train(m_bin, Xb, yb, epochs=200, lr=5e-3, batch_size=64, weight_decay=1e-4)
probs = m_bin.forward(Xb)
preds = [Link](axis=1)
acc = (preds == yb).mean()
print("train accuracy:", acc)
What’s happening:
Hidden ReLU layer learns radial/arc-shaped decision boundaries.
Softmax turns logits into class probabilities; cross-entropy penalizes wrong classes.
Gradients backprop through softmax→linear→ReLU→linear.
5. Application Example 3 — Multiclass classification (3 blobs)
def three_blobs(n=900, seed=0):
rng = [Link].default_rng(seed)
means = [Link]([[0,0], [2.5, 0.5], [-2.0, 1.5]])
cov = [Link]([[0.4,0.0],[0.0,0.4]])
Xs, ys = [], []
for k, m in enumerate(means):
Xk = rng.multivariate_normal(m, cov, size=n//3)
yk = [Link](n//3, k)
[Link](Xk); [Link](yk)
return [Link](Xs), [Link](ys)
Xm, ym = three_blobs(900, seed=4)
m_multi = MLP2(in_dim=2, hidden_dim=64, out_dim=3, mode='classification', seed=3)
train(m_multi, Xm, ym, epochs=200, lr=5e-3, batch_size=64)
print("train acc:", (m_multi.forward(Xm).argmax(1)==ym).mean())
What’s happening:
The model learns multiple linear pieces (via ReLUs) to separate 3 clusters.
PART – II: Application
Use the Kaggle Dogs vs. Cats dataset ([Link]
It contains images stored in folders like this:
data/
├── train/
│ ├── dog/
│ └── cat/
└── test/
We will construct a small CNN step by step using only NumPy operations to gain a deeper
understanding of its underlying mechanics.
1. Imports
import numpy as np
import os
from PIL import Image
import [Link] as plt
2. Load and Preprocess Images
def load_data(data_dir):
X, y = [], []
for label, folder in enumerate(['cat', 'dog']):
path = [Link](data_dir, folder)
for file in [Link](path):
img = [Link]([Link](path, file)).convert('RGB').resize((64, 64))
[Link]([Link](img) / 255.0)
[Link](label)
return [Link](X), [Link](y)
X, y = load_data('data/train')
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
3. Helper Functions
def relu(x):
return [Link](0, x)
def relu_derivative(x):
return (x > 0).astype(float)
def mse_loss(y_true, y_pred):
return [Link]((y_true - y_pred) ** 2)
def mse_grad(y_true, y_pred):
return 2 * (y_pred - y_true) / y_true.size
4. Initialize Parameters
conv_filter = [Link](3, 3, 3) * 0.1
W_fc = [Link](32 * 32 * 3, 1) * 0.01
b_fc = [Link]((1,))
5. Forward Pass
def conv2d(img, kernel):
h, w, c = [Link]
kh, kw, kc = [Link]
out = [Link]((h - kh + 1, w - kw + 1))
for i in range([Link][0]):
for j in range([Link][1]):
region = img[i:i+kh, j:j+kw, :]
out[i, j] = [Link](region * kernel)
return out
def maxpool(img, size=2):
h, w = [Link]
new_h, new_w = h // size, w // size
pooled = [Link]((new_h, new_w))
for i in range(new_h):
for j in range(new_w):
pooled[i, j] = [Link](img[i*size:(i+1)*size, j*size:(j+1)*size])
return pooled
def forward(img):
global conv_filter, W_fc, b_fc
conv_out = conv2d(img, conv_filter)
relu_out = relu(conv_out)
pooled = maxpool(relu_out)
flat = [Link]()
fc_out = relu([Link](flat, W_fc) + b_fc)
return fc_out, flat, pooled, relu_out
6. Training with MSE Loss
We’ll train only the fully connected layer for simplicity.
lr = 0.001
epochs = 10
for epoch in range(epochs):
total_loss = 0
for i in range(len(X_train)):
img = X_train[i]
label = [Link]([y_train[i]])
y_pred, flat, pooled, relu_out = forward(img)
loss = mse_loss(label, y_pred)
total_loss += loss
grad_y = mse_grad(label, y_pred) * relu_derivative(y_pred)
grad_W = [Link](flat, grad_y)
grad_b = grad_y
W_fc -= lr * grad_W
b_fc -= lr * grad_b
print(f"Epoch {epoch+1}/{epochs} | Loss: {total_loss/len(X_train):.4f}")
7. Evaluation
correct = 0
for i in range(len(X_test)):
y_pred, _, _, _ = forward(X_test[i])
prediction = int(y_pred > 0.5)
if prediction == y_test[i]:
correct += 1
accuracy = correct / len(X_test)
print(f"Test Accuracy: {accuracy*100:.2f}%")
Write an evaluation code that automatically trains with a set of images and then tests your
model with large blocks of unseen images
Measure the time it takes to process the images
Submit all of the relevant results in the report
PART – III: GPU Acceleration
Now that you understand how a CNN works internally, you’ll build and train the same model
using PyTorch. Please keep in mind that you must write your own code for the same system.
1. Imports
Import the following PyTorch modules:
- torch
- [Link]
- [Link]
- [Link]
- [Link] and transforms
2. Using CUDA-driven GPU Library
Check if a GPU is available and set the device accordingly:
```python
device = [Link]("cuda" if [Link].is_available() else "cpu")
print("Using device:", device)
```
Hint: Always move both your model and tensors to this device to utilize GPU.
3. Define Transformations and Load Dataset
Create a transform pipeline that:
- Resizes images to (64, 64)
- Converts them to tensors
Then use `[Link]()` to load your data and `DataLoader` to batch and shuffle the
samples.
Try: Print a batch of image shapes to confirm it’s loading correctly.
4. Define the CNN Model
Define a class `CatDogCNN([Link])` that includes:
- Two convolution layers with ReLU activation and max pooling
- One or two fully connected layers
- ReLU activation for all layers
Tip: Each convolution should double the channel depth (e.g., 3→16→32).
Use `[Link]()` for activation and `F.max_pool2d()` for pooling.
5. Move Model to GPU
After defining your model, move it to the device:
```python
model = CatDogCNN().to(device)
```
Hint: Failing to move tensors and model to the same device will cause runtime errors.
6. Choose Loss Function and Optimizer
Since you’re using ReLU for the output, you can use `[Link]()` as the loss function.
For optimization, use Adam (`[Link]`) with a learning rate of 0.001.
7. Implement Training Loop
Write a training loop that:
1. Iterates through all epochs
2. For each batch:
- Move `images` and `labels` to the same device as the model
- Clear gradients (`optimizer.zero_grad()`)
- Compute outputs
- Calculate loss
- Backpropagate and update weights
Hint: `[Link](device)` and `[Link](device)`
8. Evaluate the Model on GPU
After training, evaluate your model:
Disable gradient computation using `torch.no_grad()`
Move images and labels to the device
Compute outputs and accuracy
Write an evaluation code that automatically trains with a set of images and then tests your
model with large blocks of unseen images
Measure the time it takes to process the images
Compare the performance of the numpy version against GPU-accelerated version of your
model
Submit all of the relevant results in the report