Introduction
Handwritten digit classification stands as a foundational challenge in machine learning tasks and computer vision, with
transformative applications ranging from automated postal services to digitized document processing.(3) The Modified
National Institute of Standards and Technology (MNIST) dataset, it was introduced by LeCun et al. (1998), has served as a
benchmark for over two decades, offering 70,000 grayscale images of handwritten digits ranging from 0 to 9 that enclose
the complexities of real-world handwriting variations.(3) This task demands robust models capable of capturing complex
spatial patterns while remaining computationally tractable.(11)
Single layer perceptron has only the ability to classify the linearly sparable patterns and non-linearly separable cannot be
classified so from here originated the idea of Multi-Layer Perceptron's (MLPs), a class of fully connected feedforward
neural networks, emerged as an early and enduring solution(1) to this problem. By using non-linear activation function and
hierarchical layers, Multi-Layer Perceptron (MLPs) map flattened pixel intensities (784 inputs) to digit class probabilities
of 10 outputs from 0 to 9, learning through backpropagation to minimize prediction errors.(1)(3) Originated by (Rumelhart
et al. (1986)), backpropagation enabled Multi-Layer Perceptron (MLPs) to approximate complex functions, that establish
their usefulness in digits recognition.(1) although Convolutional Neural Networks (CNNs) later surpassed MLPs in accuracy
by exploiting spatial hierarchies (LeCun et al., 1998), but MLPs remain a pivotal in understanding neural network
fundamentals and benchmarking algorithmic innovation.(11)
Background
Multi-Layer Perceptron for MNIST Handwritten Digit Classification is a essential computer vision and machine learning
problem and has uses in postal automation sorting, bank check processing, and optical character recognition.(3)
A Multi-Layer Perceptron (MLP) is a fully connected feedforward network of neural type which is designed to mimic non-
linear relationships by way of its input, hidden, and output layer structure. Using non-linear activation functions (such as
sigmoid, tanh, or ReLU (in our case we have used ReLU because ReLU is fast, easy, adds non-linearity, has no vanishing
gradients, and promotes sparse, efficient activation.)), MLPs can learn sophisticated patterns in data. For MNIST, an MLP
accepts flattened pixel values (784-dimensional input vector) and produces probabilities for one of the 10 digits classes each
class is a number between 0 and 9.(3)
The network can learn by weights updating via backpropagation, a form of forms of optimization that decreases prediction
error through gradient descent. Multi-layer Perceptron (MLPs) are best applicable for MNIST as they have the capability to
express non-linear patterns but possess limitations such as being computationally costly, being hyperparameter settings-
sensitive (e.g., number of layers, learning rate, neurons), and have very poor chances of overfitting.(5)(7) In spite of these
limitations, MLPs are a starting point for neural networks and a basis for comparison to more sophisticated approaches such
as CNNs. (11)
This project involves using an MLP to identify handwritten digits in the MNIST dataset. The goal is to investigate the
impact of architectural choices, activation functions. By analyzing accuracy and training dynamics, the study aims to
demonstrate the feasibility of MLPs in this task and how to maximize their performance.(3)
Literature Review
The MNIST dataset, introduced by LeCun et al. (1998), revolutionized handwritten digit classification by providing a
standardized benchmark of 70,000 grayscale images (28x28 pixels) of digits 0–9. This dataset became a cornerstone for
evaluating machine learning models due to its accessibility and representation of real-world handwriting variations. While
1
originally designed to test convolutional neural networks (CNNs), MNIST also facilitated the exploration of Multi-Layer
Perceptrons (MLPs), establishing a baseline for comparing architectural innovations in neural networks. (3)
Rumelhart et al. (1986) pioneered backpropagation as a training algorithm for neural networks, enabling MLPs to learn
complex mappings through gradient descent. By minimizing prediction errors iteratively, this work laid the foundation for
applying MLPs to digit recognition and other pattern recognition tasks. Backpropagation became a cornerstone of neural
network training, allowing MLPs to model non-linear relationships in data through hidden layers.(1)
LeCun et al. (1989) demonstrated early applications of neural networks to handwritten character recognition, exploring both
MLPs and CNNs. Their work highlighted the potential of neural networks to automate digit classification, though
computational limitations of the era constrained scalability. This study set the stage for later advancements in optimizing
MLPs for image-based tasks like MNIST.(2)
In their 1998 work, LeCun et al. developed LeNet-5, a CNN architecture that achieved state-of-the-art accuracy on MNIST.
While LeNet-5 showcased the superiority of CNNs for spatial data, it also underscored MLPs’ role as a simpler, more
interpretable alternative for educational purposes and baseline comparisons.(3)
Simard et al. (2003) investigated preprocessing techniques such as normalization and deskewing to enhance MLP
performance on MNIST. Their findings revealed that while MLPs benefited from these methods, their dense connectivity
led to higher computational costs compared to CNNs, emphasizing the trade-off between simplicity and efficiency.(8)
Glorot and Bengio (2010) addressed training challenges in deep MLPs by proposing Xavier initialization, a weight
initialization method that stabilized gradient flow during backpropagation. This innovation mitigated vanishing/exploding
gradient issues, making MLPs more viable for training on datasets like MNIST and enabling deeper architectures.(5)
Kingma and Ba (2014) introduced the Adam optimizer, an adaptive learning rate algorithm that accelerated MLP
convergence on MNIST. By dynamically adjusting learning rates based on gradient moments, Adam improved training
efficiency and reliability, becoming a widely adopted optimization tool for neural networks.(10)
Goodfellow et al. (2016) critiqued MLPs’ limitations in handling high-dimensional data like images, noting their dense
connectivity led to parameter explosion and inefficiency compared to CNNs. Their analysis highlighted MLPs’ role as a
foundational model rather than a state-of-the-art solution for tasks requiring spatial invariance.(11)
Methodology
The Multi-Layer Perceptron (MLPs)learns through an iterative process that involves:
1. Forward propagation.
2. Loss calculation.
3. Backward propagation (gradient computation).
4. Parameter updates using Adam optimization.3
5. Performance evaluation.
Workflow and Mathematical Intuition
Epoch Initialization
2
Randomizes sample order to prevent sequence bias
Ensures different batch compositions each epoch
Batch Processing (Mini-Batch)
Divides 60,000 training samples into 1,875 batches (with batch_size=32)
A. Forward Pass
Layer-by-layer calculation
Dense Layer
Z = XW+b
ReLU Activation
A = max(0, Z)
Softmax Activation (final layer)
𝑒 𝑍𝑖
∑ 𝑒 𝑍𝑗
Output: Probability distribution for each of 10 classes
B. Loss Calculation
Categorical Cross-Entropy:
1
L = -𝑁 ∑𝑁 𝐶
𝑖=1 ∑𝑐=1 𝑦𝑖, 𝑐 log (𝑝𝑖, 𝑐)
Measures difference between predicted probabilities and true labels
C. Backward Pass
Output Layer Gradient
𝜕𝐿
=Ŷ−𝑌
𝜕𝑍 [𝐿]
Simplified derivative for softmax + cross-entropy:
Hidden Layer Gradients:
ReLU derivative:
3
𝜕𝐴
= 𝑖𝑓 𝑍 > 0 𝑜𝑢𝑡𝑝𝑢𝑡 1 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 0
𝜕𝑍
Chain rule application:
𝜕𝐿 1 [𝑙−1] 𝜕𝐿
= 𝐴
𝜕𝑊 [𝑙] 𝑚 𝜕𝑍 [𝑙]
D. Parameter Updates (Adam Optimization)
Update biased moment estimates
𝑚𝑡 = 𝛽1 . 𝑚𝑡−1 + (1 − 𝛽1 ). 𝑔𝑡
𝑢𝑡 = 𝛽2 . 𝑢𝑡−1 + (1 − 𝛽2 ). 𝑔𝑡2
Compute bias-corrected estimates
𝑚𝑡
𝑚^ 𝑡 =
1 − 𝛽1𝑡
Update parameters
Implementation
import numpy as np
import [Link] as plt
from [Link] import fetch_openml
from [Link] import OneHotEncoder
from sklearn.model_selection import train_test_split
from [Link] import confusion_matrix
import seaborn as sns
mnist = fetch_openml('mnist_784', version=1, parser='auto')
X = mnist["data"].to_numpy() / 255.0
y = mnist["target"].to_numpy().astype(int).reshape(-1, 1)
encoder = OneHotEncoder(sparse_output=False)
y = encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
class DenseLayer:
def __init__(self, n_inputs, n_neurons, activation=None):
[Link] = 0.01 * [Link](n_inputs, n_neurons)
[Link] = [Link]((1, n_neurons))
[Link] = activation
4
def forward(self, inputs):
[Link] = inputs
[Link] = [Link](inputs, [Link]) + [Link]
if [Link] == 'relu':
[Link] = [Link](0, [Link])
elif [Link] == 'softmax':
exp_values = [Link]([Link] - [Link]([Link], axis=1,
keepdims=True))
[Link] = exp_values / [Link](exp_values, axis=1, keepdims=True)
return [Link]
class NeuralNetwork:
def __init__(self):
[Link] = []
def add_layer(self, layer):
[Link](layer)
def forward(self, X):
for layer in [Link]:
X = [Link](X)
return X
def predict(self, X):
return [Link]([Link](X), axis=1)
class AdamOptimizer:
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
[Link] = learning_rate
self.beta1 = beta1
self.beta2 = beta2
[Link] = epsilon
self.m = {}
self.v = {}
self.t = 0
def update(self, layers):
self.t += 1
for i, layer in enumerate(layers):
if i not in self.m:
self.m[i] = np.zeros_like([Link])
self.v[i] = np.zeros_like([Link])
self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * [Link]
self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * ([Link]
** 2)
5
m_hat = self.m[i] / (1 - self.beta1 ** self.t)
v_hat = self.v[i] / (1 - self.beta2 ** self.t)
[Link] -= [Link] * m_hat / ([Link](v_hat) + [Link])
[Link] -= [Link] * [Link]
def categorical_crossentropy(y_true, y_pred):
epsilon = 1e-15
y_pred = [Link](y_pred, epsilon, 1 - epsilon)
return -[Link](y_true * [Link](y_pred))
model = NeuralNetwork()
model.add_layer(DenseLayer(784, 128, activation='relu'))
model.add_layer(DenseLayer(128, 64, activation='relu'))
model.add_layer(DenseLayer(64, 10, activation='softmax'))
optimizer = AdamOptimizer(learning_rate=0.001)
epochs = 10
batch_size = 32
history = {'train_acc': [], 'test_acc': [], 'loss': []}
for epoch in range(epochs):
permutation = [Link](X_train.shape[0])
X_train_shuffled = X_train[permutation]
y_train_shuffled = y_train[permutation]
epoch_loss = []
for i in range(0, X_train.shape[0], batch_size):
X_batch = X_train_shuffled[i:i+batch_size]
y_batch = y_train_shuffled[i:i+batch_size]
output = [Link](X_batch)
loss = categorical_crossentropy(y_batch, output)
epoch_loss.append(loss)
error = output - y_batch
[Link][-1].dweights = [Link]([Link][-2].output.T, error)
[Link][-1].dbiases = [Link](error, axis=0, keepdims=True)
for l in range(len([Link])-2, -1, -1):
error = [Link](error, [Link][l+1].weights.T) *
([Link][l].output > 0)
inputs = X_batch if l == 0 else [Link][l-1].output
[Link][l].dweights = [Link](inputs.T, error)
[Link][l].dbiases = [Link](error, axis=0, keepdims=True)
[Link]([Link])
6
history['loss'].append([Link](epoch_loss))
train_preds = [Link](X_train)
test_preds = [Link](X_test)
history['train_acc'].append([Link](train_preds == [Link](y_train, axis=1)))
history['test_acc'].append([Link](test_preds == [Link](y_test, axis=1)))
print(f"Epoch {epoch+1}/{epochs} - Loss: {history['loss'][-1]:.4f} | "
f"Train Acc: {history['train_acc'][-1]:.4f} | Test Acc:
{history['test_acc'][-1]:.4f}")
test_predictions = [Link](X_test)
final_accuracy = [Link](test_predictions == [Link](y_test, axis=1))
print(f"\nFinal Test Accuracy: {final_accuracy:.4f}")
def plot_metrics(history):
[Link](figsize=(15, 5))
[Link](1, 2, 1)
[Link](history['train_acc'], label='Train Accuracy', marker='o')
[Link](history['test_acc'], label='Test Accuracy', marker='o')
[Link]('Accuracy Evolution', fontsize=14)
[Link]('Epochs'), [Link]('Accuracy')
[Link](), [Link](True)
[Link](range(epochs), range(1, epochs+1))
[Link](1, 2, 2)
[Link](history['loss'], label='Loss', color='red', marker='o')
[Link]('Training Loss', fontsize=14)
[Link]('Epochs'), [Link]('Loss')
[Link](), [Link](True)
[Link](range(epochs), range(1, epochs+1))
plt.tight_layout()
[Link]('training_metrics.png', dpi=300)
[Link]()
plot_metrics(history)
def plot_samples(X, y, preds, num=12):
[Link](figsize=(15, 7))
[Link]('Sample Predictions', fontsize=16)
indices = [Link](len(X), num)
for i, idx in enumerate(indices):
[Link](3, 4, i+1)
[Link](X[idx].reshape(28, 28), cmap='gray')
[Link](f"Pred: {preds[idx]}\nTrue: {[Link](y[idx])}",
color='green' if preds[idx] == [Link](y[idx]) else 'red')
[Link]('off')
7
plt.tight_layout()
[Link]('sample_predictions.png', dpi=300)
[Link]()
plot_samples(X_test, y_test, test_predictions)
def visualize_pixel_values(index=1):
img = X_test[index].reshape(28, 28)
true_label = [Link](y_test[index])
pred_label = [Link](X_test[index:index+1])[0]
fig = [Link](figsize=(8,8))
ax = fig.add_subplot(111)
[Link](img, cmap='gray')
ax.set_title(f"True: {true_label} | Pred: {pred_label}", fontsize=14)
width, height = [Link]
thresh = [Link]()/2.5
for x in range(width):
for y in range(height):
val = round(img[x][y], 2) if img[x][y] != 0 else 0
[Link](str(val), xy=(y,x),
horizontalalignment='center',
verticalalignment='center',
color='white' if img[x][y]<thresh else 'black',
fontsize=8)
[Link]('off')
[Link]()
visualize_pixel_values(index=1)
cm = confusion_matrix([Link](y_test, axis=1), test_predictions)
[Link](figsize=(12, 10))
[Link](cm, annot=True, fmt='d', cmap='Blues', cbar=False)
[Link]('Confusion Matrix', fontsize=16)
[Link]('Predicted'), [Link]('True')
[Link]('confusion_matrix.png', dpi=300)
[Link]()
[Link]('mnist_model.npz',
weights0=[Link][0].weights,
biases0=[Link][0].biases,
weights1=[Link][1].weights,
biases1=[Link][1].biases,
weights2=[Link][2].weights,
biases2=[Link][2].biases)
Results
The model achieved a final test accuracy of 97.10% and a cross-entropy loss of 0.0025. Training accuracy plateaued at
97.97%.
8
Figure 1. Shows the training and test metrics and the Training loss over 10 epochs.
Figure 2. Sample Predictions on the Test Set.
Each image shows a digit from the test set along with its predicted and true labels. Most predictions match the
true labels, indicating good model performance.
9
Figure 3. Confusion Matrix of digit classifier
Each row represents the actual digit, and each column represents the predicted digit. Diagonal values show correct
predictions, while off-diagonal values indicate misclassification.
Figure 4. Pixel Intensity Heatmap of a Digit "4".
This visualization shows grayscale intensity values (from 0 to 1) of each pixel in the image. The model correctly
predicted the digit as 4.
Conclusion
The Multi-Layer Perceptrons (MLPs) has achieved consistent approach in training and achieved test accuracy of
97.10%. Using Adam's optimizer and categorical cross-entropy loss function, the training and test metrics trained
over 10 epochs, shows efficient gradient update as well as consistent learning dynamics. The close alignment of
10
training accuracy (97.97%) and test accuracy suggests that the structure with 2 hidden layers one consisting of
128 neurons and another with 64 neurons and ReLU activation functions captures a balance between
generalization and capacity. Such performance suggests that the effective techniques like adaptive moment
estimation and softmax-based probability corrections for digit classification.
References
1. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back propagating errors.
Nature.
2. LeCun, Y., Boser, B., Denker, J. S., et al. (1989). Backpropagation applied to handwritten zip code recognition.
Neural Computation.
3. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE.
4. Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to
visual document analysis. ICDAR.
5. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
AISTATS.
6. Tang, Y. (2013). Deep learning using linear support vector machines. ICML Workshop.
7. Srivastava, N., Hinton, G., Krizhevsky, A., et al. (2014). Dropout: A simple way to prevent neural networks from
overfitting. JMLR.
8. Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to
visual document analysis. ICDAR.
9. Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. NeurIPS.
10. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. ICLR.
11. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
12. Cireşan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification.
CVPR.
11