Multi-Layer Perceptron (MLP) Implementation in
TensorFlow 2
This Python script implements a multi-layer perceptron (MLP), a type of
feedforward neural network, using TensorFlow 2 to perform multi-class
classification on the MNIST dataset of handwritten digits (0-9). The MLP
consists of two hidden layers and an output layer, making it more complex than
a single-layer perceptron, allowing it to capture non-linear patterns in the data.
Below is a detailed breakdown of the code, its components, and the rationale
behind its design.
1. Importing Libraries
import tensorflow as tf
import numpy as np
import [Link] as plt
from [Link] import mnist
• tensorflow: The core library for building, training, and evaluating neural
networks in TensorFlow 2.
• numpy: Provides efficient numerical operations for array manipulation,
used here for preprocessing and data handling.
• [Link]: Used to visualize the training loss over epochs, aiding
in performance analysis.
• mnist: A built-in TensorFlow module providing access to the MNIST
dataset, which contains 60,000 training and 10,000 test grayscale images
(28x28 pixels) of handwritten digits (0-9).
2. Loading and Preprocessing the Data
mnist = [Link]
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1, 784).astype("float32") / 255.0
x_test = x_test.reshape(-1, 784).astype("float32") / 255.0
y_train = [Link].to_categorical(y_train, 10)
y_test = [Link].to_categorical(y_test, 10)
• Loading the MNIST Dataset:
o The mnist.load_data() function returns two tuples: (x_train,
y_train) for training (60,000 samples) and (x_test, y_test) for
testing (10,000 samples).
o Each image is a 28x28 grayscale array, and each label is an integer
(0-9) representing the digit.
• Preprocessing the Input Data (x_train, x_test):
o reshape(-1, 784): Flattens each 28x28 image into a 1D array of 784
values (28 × 28 = 784) to match the input requirements of a fully
connected layer.
o astype("float32"): Converts pixel values from integers to 32-bit
floats for numerical stability during gradient computations.
o / 255.0: Normalizes pixel values from the range [0, 255] to [0, 1],
which improves training convergence by scaling inputs to a
consistent range.
• Preprocessing the Labels (y_train, y_test):
o to_categorical(y_train, 10): Converts integer labels (e.g., 5) into
one-hot encoded vectors (e.g., [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]). This is
necessary for multi-class classification with categorical cross-
entropy loss.
• Purpose of Preprocessing:
o Flattening ensures compatibility with the dense layers.
o Normalization and type conversion enhance training stability and
efficiency.
o One-hot encoding aligns the labels with the model’s output format
(probabilities for 10 classes).
3. Defining Parameters
learning_rate = 0.001
training_epochs = 20
batch_size = 100
n_hidden_1 = 256 # 1st layer num features
n_hidden_2 = 256 # 2nd layer num features
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)
• learning_rate = 0.001: Controls the step size of weight updates during
optimization. A smaller learning rate ensures stable convergence but
may require more epochs.
• training_epochs = 20: Specifies 20 complete passes through the training
dataset, balancing training time and model performance.
• batch_size = 100: Divides the training data into mini-batches of 100
samples for gradient updates, offering a trade-off between
computational efficiency and gradient accuracy.
• n_hidden_1 = 256, n_hidden_2 = 256: Defines the number of neurons in
the first and second hidden layers, respectively. These layers enable the
model to learn complex, non-linear patterns.
• n_input = 784: Matches the flattened image size (28 × 28).
• n_classes = 10: Corresponds to the 10 digit classes (0-9).
• Design Choice:
o The choice of 256 neurons per hidden layer provides sufficient
capacity to model the complexity of MNIST digits while keeping
computational costs manageable.
o A batch size of 100 is a common choice for MNIST, balancing
memory usage and training stability.
4. Building the Model
model = [Link]([
[Link](n_hidden_1, activation='sigmoid',
input_shape=(n_input,)),
[Link](n_hidden_2, activation='sigmoid'),
[Link](n_classes)
])
• The model is constructed using Keras' Sequential API, which allows
stacking layers in a linear sequence:
o First Hidden Layer: A Dense (fully connected) layer with 256 units,
sigmoid activation, and an input shape of (784,). The sigmoid
function maps inputs to [0, 1], introducing non-linearity.
o Second Hidden Layer: Another Dense layer with 256 units and
sigmoid activation, further transforming the features learned in
the first layer.
o Output Layer: A Dense layer with 10 units (one per digit class),
outputting raw logits (no activation applied here, as softmax is
handled implicitly by the loss function).
• Architecture Overview:
o This MLP has three layers: two hidden layers (256 neurons each)
and one output layer (10 neurons).
o Unlike a single-layer perceptron, the hidden layers allow the
model to learn hierarchical feature representations, making it
capable of handling non-linearly separable data.
o The sigmoid activation is used for historical and simplicity reasons,
though modern MLPs often use ReLU for faster convergence.
• Parameters:
o First layer: (784 × 256 + 256 biases) = 200,960 parameters.
o Second layer: (256 × 256 + 256 biases) = 65,792 parameters.
o Output layer: (256 × 10 + 10 biases) = 2,570 parameters.
o Total: ~269,322 trainable parameters.
5. Compiling the Model
[Link](optimizer=[Link](learning_rate=learning_rat
e),
loss=[Link](from_logits=True),
metrics=['accuracy'])
• optimizer='Adam': Uses the Adam optimizer, an adaptive gradient-based
method that combines momentum and RMSProp. It is well-suited for
deep learning due to its efficiency and robustness.
• learning_rate=0.001: A standard choice for Adam, balancing speed and
stability.
• loss='CategoricalCrossentropy(from_logits=True)': Computes the
categorical cross-entropy loss between one-hot encoded labels and the
model’s logits output. The from_logits=True setting applies softmax
internally to convert logits to probabilities, ensuring numerical stability.
• metrics=['accuracy']: Tracks the classification accuracy (fraction of
correctly classified samples) during training and evaluation.
• Why Adam?: Compared to SGD (used in the single-layer perceptron),
Adam adapts the learning rate for each parameter, leading to faster
convergence, especially for deeper networks like this MLP.
6. Training the Model
avg_set = []
epoch_set = []
class LossHistory([Link]):
def on_epoch_end(self, epoch, logs=None):
avg_set.append(logs['loss'])
epoch_set.append(epoch + 1)
if epoch % 1 == 0:
print(f"Epoch: {epoch + 1:04d}, cost={logs['loss']:.9f}")
history = [Link](x_train, y_train,
batch_size=batch_size,
epochs=training_epochs,
verbose=0,
callbacks=[LossHistory()])
• Initialization:
o avg_set: Stores the average training loss for each epoch, used for
plotting.
o epoch_set: Stores epoch numbers (1 to 20) for plotting.
• Custom Callback (LossHistory):
o A custom Keras callback that runs at the end of each epoch.
o Records the epoch’s loss (logs['loss']) in avg_set.
o Stores the epoch number (epoch + 1) in epoch_set.
o Prints the epoch number and loss every epoch (e.g., Epoch: 0001,
cost=0.123456789).
• Training Process:
o [Link]: Trains the model for 20 epochs, processing the training
data in mini-batches of 100 samples.
o verbose=0: Disables default Keras logging to rely on the custom
callback for output.
o The training loop updates the model’s weights using
backpropagation and the Adam optimizer, minimizing the
categorical cross-entropy loss.
• Batch Processing:
o The training set (60,000 samples) is divided into 600 batches
(60,000 / 100 = 600).
o Each epoch processes all 600 batches, computing gradients and
updating weights for each batch.
• Output:
o The callback prints the loss for each epoch, providing insight into
the model’s learning progress.
o Example output: Epoch: 0001, cost=0.123456789, showing how
the loss decreases as training progresses.
7. Evaluating the Model
test_loss, test_accuracy = [Link](x_test, y_test, verbose=0)
print(f"MODEL accuracy: {test_accuracy:.4f}")
• Evaluation:
o [Link]: Computes the loss and accuracy on the test set
(10,000 samples).
o verbose=0: Suppresses detailed evaluation logs, returning only the
final metrics.
• Output:
o Prints the test accuracy (e.g., MODEL accuracy: 0.9780), indicating
the fraction of correctly classified test samples.
o The test loss is also computed but not printed, though it could be
accessed via test_loss.
• Significance:
o The test accuracy reflects the model’s generalization to unseen
data, a critical measure of performance.
o For MNIST, an MLP with this architecture typically achieves ~97-
98% accuracy, competitive but slightly below modern
convolutional neural networks (CNNs).
8. Plotting the Training Loss
[Link](epoch_set, avg_set, 'o', label='MLP Training phase')
[Link]('cost')
[Link]('epoch')
[Link]()
[Link]()
• Visualization:
o Uses Matplotlib to plot the training loss (avg_set) against epoch
numbers (epoch_set).
o The 'o' argument adds markers at each data point, making the plot
easier to interpret.
o Labels the y-axis as “cost” (loss) and the x-axis as “epoch.”
o Adds a legend to identify the plot as the “MLP Training phase.”
• Purpose:
o The plot visualizes the model’s learning curve, showing how the
loss decreases over epochs.
o A downward trend indicates successful training, while plateaus or
spikes could suggest issues like insufficient capacity or overfitting.
• Expected Outcome:
o The loss typically decreases steadily, reflecting the model’s ability
to fit the training data.
9. What the Code Does
The script implements a multi-layer perceptron to classify MNIST handwritten
digits (0-9). Key features include:
• Architecture: Two hidden layers (256 neurons each with sigmoid
activation) and an output layer (10 neurons, logits output).
• Training: Uses the Adam optimizer to minimize categorical cross-entropy
loss over 20 epochs with a batch size of 100.
• Data Processing: Flattens and normalizes MNIST images and converts
labels to one-hot encoded vectors.
• Evaluation: Measures test accuracy to assess generalization (typically
~97-98%).
• Visualization: Plots the training loss to monitor convergence.
• Improvements Over Single-Layer Perceptron:
o The addition of hidden layers enables the MLP to learn non-linear
relationships, unlike the single-layer perceptron, which is limited
to linearly separable data.
o The Adam optimizer and sigmoid activations improve training
efficiency and expressiveness compared to the simpler SGD-based
single-layer model.
10. Additional Context
• MNIST Dataset:
o A benchmark dataset in machine learning, consisting of 70,000
grayscale images (60,000 training, 10,000 test) of handwritten
digits.
o Each image is 28x28 pixels, and labels are integers (0-9).
o Widely used for testing classification algorithms due to its
simplicity and well-understood properties.
• Why MLP?:
o MLPs are suitable for structured data like MNIST and serve as a
foundational neural network model.
o While not as powerful as CNNs for image data, MLPs are simpler
to implement and understand, making them ideal for educational
purposes.
• Limitations:
o The use of sigmoid activations can lead to vanishing gradients,
slowing training compared to ReLU.
o The model may not achieve state-of-the-art accuracy (~99%+ with
CNNs) due to its inability to exploit spatial structure in images.
o Overfitting is a risk with deeper networks, though the MNIST
dataset is relatively robust to this with sufficient training data.
• Potential Improvements:
o Replace sigmoid with ReLU activation for faster convergence.
o Increase the number of epochs or adjust the learning rate for
better performance.
o Use a CNN for higher accuracy on image data.
11. Example Output
During training, the script produces output like:
Epoch: 0001, cost=0.623456789
Epoch: 0002, cost=0.345678901
...
Epoch: 0020, cost=0.098765432
Training phase finished
MODEL accuracy: 0.9780
The plot displays a decreasing loss curve, confirming that the model learns
effectively. The final test accuracy (~97-98%) indicates strong performance on
the MNIST classification task.