7.
Write a program to show Back Propagation Network for XOR
function with Binary Input and Output
import numpy as np
[Link](42)
def sigmoid(x):
return 1.0 / (1.0 + [Link](-x))
def sigmoid_deriv(y):
# y is already sigmoid(x)
return y * (1.0 - y)
X = [Link]([[0, 0],
[0, 1],
[1, 0],
[1, 1]], dtype=float)
Y = [Link]([[0], [1], [1], [0]], dtype=float)
input_size = 2
hidden_size = 2
output_size = 1
lr = 0.5 # learning rate
epochs = 10000 # number of training iterations
W1 = [Link](-1.0, 1.0, (input_size, hidden_size))
B1 = [Link]((1, hidden_size))
W2 = [Link](-1.0, 1.0, (hidden_size, output_size))
B2 = [Link]((1, output_size))
for epoch in range(epochs):
Z1 = [Link](X, W1) + B1 # (4, hidden_size)
A1 = sigmoid(Z1) # hidden activations
Z2 = [Link](A1, W2) + B2 # (4, 1)
A2 = sigmoid(Z2) # output activations
loss = [Link](0.5 * (Y - A2) ** 2)
dA2 = A2 - Y # derivative of MSE wrt A2
dZ2 = dA2 * sigmoid_deriv(A2) # (4,1)
dW2 = [Link](A1.T, dZ2) / [Link][0]
dB2 = [Link](dZ2, axis=0, keepdims=True)
dA1 = [Link](dZ2, W2.T) # (4, hidden_size)
dZ1 = dA1 * sigmoid_deriv(A1)
dW1 = [Link](X.T, dZ1) / [Link][0]
dB1 = [Link](dZ1, axis=0, keepdims=True)
W2 -= lr * dW2
B2 -= lr * dB2
W1 -= lr * dW1
B1 -= lr * dB1
if epoch % 1000 == 0 or epoch == epochs - 1:
print(f"Epoch {epoch:5d} Loss: {loss:.6f}")
print("\nTrained predictions on XOR inputs:")
Z1 = [Link](X, W1) + B1
A1 = sigmoid(Z1)
Z2 = [Link](A1, W2) + B2
A2 = sigmoid(Z2)
print([Link]((X, Y, A2, [Link](A2))))
print("\nWeights and biases:")
print("W1:\n", W1)
print("B1:\n", B1)
print("W2:\n", W2)
print("B2:\n", B2)
preds = [Link](A2)
accuracy = [Link](preds == Y)
print(f"\nAccuracy (after rounding): {accuracy * 100:.1f}%")
Output:
Epoch 0 Loss: 0.143150
Epoch 1000 Loss: 0.124974
Epoch 2000 Loss: 0.124862
Epoch 3000 Loss: 0.124477
Epoch 4000 Loss: 0.121065
Epoch 5000 Loss: 0.070273
Epoch 6000 Loss: 0.018014
Epoch 7000 Loss: 0.008040
Epoch 8000 Loss: 0.004882
Epoch 9000 Loss: 0.003431
Epoch 9999 Loss: 0.002619
Trained predictions on XOR inputs:
[[0. 0. 0. 0.07022023 0. ]
[0. 1. 1. 0.92223364 1. ]
[1. 0. 1. 0.92231458 1. ]
[1. 1. 0. 0.06270003 0. ]]
Weights and biases:
W1:
[[-5.50637624 5.67926358]
[ 5.69407248 -5.46834499]]
B1:
[[2.81493644 2.78774586]]
W2:
[[-6.15130337]
[-6.15441065]]
B2:
[[9.01782245]]
Accuracy (after rounding): 100.0%
Explanation:
import numpy as np
Imports the NumPy library and gives it the short name np, so you can use
NumPy functions and arrays with np..
[Link](42)
Sets the random-number generator seed to 42 so any subsequent random
numbers (like initial weights) are reproducible every run.
def sigmoid(x):
Starts the definition of a function named sigmoid that takes one argument
x.
return 1.0 / (1.0 + [Link](-x))
Computes the sigmoid activation σ(x) = 1 / (1 + e^{-x}) elementwise for
input x; maps real values into the range (0,1).
def sigmoid_deriv(y):
Starts the definition of a function named sigmoid_deriv that expects y,
which should already be sigmoid(x).
return y * (1.0 - y)
Returns the derivative of sigmoid with respect to its input, using the
identity σ'(x) = σ(x) * (1 - σ(x)). This expects y = σ(x).
X = [Link]([[0, 0],
[0, 1],
[1, 0],
[1, 1]], dtype=float)
Creates the input matrix X as a NumPy array with four rows (samples) and
two columns (features). Each row is one XOR input pair. dtype=float
ensures numeric (floating point) math.
Y = [Link]([[0], [1], [1], [0]], dtype=float)
Creates the target/output column vector Y with four rows corresponding to
XOR outputs. It has shape (4,1) and is float for gradient math.
input_size = 2
Stores the number of input neurons/features (2) in a variable used for
shaping weights.
hidden_size = 2
Stores the number of hidden neurons (2). Two hidden units are sufficient
to represent XOR.
output_size = 1
Stores the number of output neurons (1) — the network predicts a single
scalar per input.
lr = 0.5 # learning rate
Sets the learning rate lr to 0.5; this scales how big each gradient descent
update is. The comment labels it.
epochs = 10000 # number of training iterations
Sets the number of training iterations (full passes over the dataset) to
10,000. The comment explains its meaning.
W1 = [Link](-1.0, 1.0, (input_size, hidden_size))
Initializes the input-to-hidden weight matrix W1 with random values
uniformly drawn from -1.0 to 1.0. Its shape is (2,2): rows correspond to
input features, columns to hidden neurons.
B1 = [Link]((1, hidden_size))
Initializes the hidden-layer bias B1 as a row vector of zeros with shape
(1,2). This will broadcast across the 4 samples when added.
W2 = [Link](-1.0, 1.0, (hidden_size, output_size))
Initializes the hidden-to-output weight matrix W2 randomly in [-1,1] with
shape (2,1): rows are hidden units, column is the single output unit.
B2 = [Link]((1, output_size))
Initializes the output-layer bias B2 as zeros with shape (1,1).
for epoch in range(epochs):
Begins the training loop that will run epochs times; epoch counts from 0 to
epochs-1. Each iteration performs one forward and backward pass over
the full dataset (batch gradient descent).
Z1 = [Link](X, W1) + B1 # (4, hidden_size)
Computes the pre-activation of the hidden layer: Z1 = X · W1 + B1.
[Link](X, W1) multiplies shape (4,2) × (2,2) → (4,2); adding B1 (1,2) uses
broadcasting to add the bias to every sample.
A1 = sigmoid(Z1) # hidden activations
Applies the sigmoid activation elementwise to Z1, producing hidden-layer
activations A1 with shape (4,2).
Z2 = [Link](A1, W2) + B2 # (4, 1)
Computes the pre-activation of the output layer: Z2 = A1 · W2 + B2.
Shapes: (4,2) × (2,1) → (4,1); add bias B2 (1,1) via broadcasting.
A2 = sigmoid(Z2) # output activations
Applies sigmoid to Z2 to get the network's predicted outputs A2 (4,1),
values in (0,1).
loss = [Link](0.5 * (Y - A2) ** 2)
Computes the scalar loss (Mean Squared Error): for each sample compute
0.5*(target - output)^2, then take the mean across samples. The 0.5
simplifies derivatives.
dA2 = A2 - Y # derivative of MSE wrt A2
Computes the derivative of loss w.r.t. the network output A2. For MSE
0.5*(Y-A2)^2, derivative is A2 - Y. Shape (4,1).
dZ2 = dA2 * sigmoid_deriv(A2) # (4,1)
Applies the chain rule: derivative w.r.t. pre-activation Z2 equals dA2 *
σ'(Z2). Since sigmoid_deriv expects the sigmoid output, we pass A2.
Elementwise multiply gives shape (4,1).
dW2 = [Link](A1.T, dZ2) / [Link][0]
Computes the gradient of the loss w.r.t. W2: A1^T · dZ2 yields shape
(2,1). Dividing by [Link][0] (4) averages the gradient across samples
(batch gradient).
dB2 = [Link](dZ2, axis=0, keepdims=True)
Computes gradient of the loss w.r.t. bias B2 by averaging dZ2 across
samples, resulting in shape (1,1). keepdims=True preserves 2D shape for
broadcasting consistency.
dA1 = [Link](dZ2, W2.T) # (4, hidden_size)
Backpropagates the gradient to the hidden activations: dA1 = dZ2 ·
W2^T. Shapes: (4,1) × (1,2) → (4,2). This represents how changes in
hidden activations change loss.
dZ1 = dA1 * sigmoid_deriv(A1)
Applies elementwise multiplication with the derivative of the sigmoid to
get gradient w.r.t. pre-activation Z1. sigmoid_deriv(A1) returns shape
(4,2), so dZ1 is (4,2).
dW1 = [Link](X.T, dZ1) / [Link][0]
Computes gradient w.r.t. W1 as X^T · dZ1 with shapes (2,4) × (4,2) →
(2,2), then divides by 4 to average over samples.
dB1 = [Link](dZ1, axis=0, keepdims=True)
Computes gradient w.r.t. hidden bias B1 by averaging dZ1 across samples,
returning shape (1,2).
W2 -= lr * dW2
Updates the output weights W2 by subtracting the learning-rate-scaled
gradient (gradient descent step).
B2 -= lr * dB2
Updates the output bias B2 similarly.
W1 -= lr * dW1
Updates input-to-hidden weights W1 with gradient descent.
B1 -= lr * dB1
Updates hidden bias B1 with gradient descent.
if epoch % 1000 == 0 or epoch == epochs - 1:
Checks whether to print progress: either every 1000 epochs, or the very
last epoch.
print(f"Epoch {epoch:5d} Loss: {loss:.6f}")
If the condition is met, prints the current epoch number and the loss
formatted to 6 decimal places so you can observe training progress.
print("\nTrained predictions on XOR inputs:")
After training finishes, prints a header line announcing that final
predictions follow.
Z1 = [Link](X, W1) + B1
Recomputes hidden pre-activations using the final trained W1 and B1.
A1 = sigmoid(Z1)
Computes final hidden activations.
Z2 = [Link](A1, W2) + B2
Computes final output pre-activations.
A2 = sigmoid(Z2)
Computes the final network outputs for the training inputs (raw values in
(0,1)).
print([Link]((X, Y, A2, [Link](A2))))
Horizontally stacks and prints the input X, target Y, raw output A2, and
rounded output [Link](A2) (0 or 1). This shows inputs, expected
outputs, predicted probabilities, and final discrete predictions side-by-side.
print("\nWeights and biases:")
Prints a header announcing that the learned weights and biases will be
displayed.
print("W1:\n", W1)
Prints the final W1 matrix (input→hidden weights).
print("B1:\n", B1)
Prints the final B1 bias row for the hidden layer.
print("W2:\n", W2)
Prints the final W2 matrix (hidden→output weights).
print("B2:\n", B2)
Prints the final B2 bias scalar for the output layer.
preds = [Link](A2)
Rounds the final raw outputs A2 to 0 or 1 and stores them in preds.
accuracy = [Link](preds == Y)
Computes accuracy as the mean of the boolean array preds == Y.
Booleans convert to 1/0, so the mean is the fraction of correct predictions.
print(f"\nAccuracy (after rounding): {accuracy * 100:.1f}%")
Prints the accuracy as a percentage with one decimal place