UNIT-5
Practical Methodology
Contents
● Performance Metrics,
● Default Baseline Models,
● Determining Whether to
● Gather More Data,
● Selecting Hyperparameters,
● Debugging Strategies,
● Example: Multi-Digit Number
● Recognition. Applications: Vision, NLP, Speech.
Performance metrics
Performance metrics are quantitative measures used to evaluate how well a machine learning
model is performing. They help you understand:
● Whether your model is making accurate predictions
● How good or bad your model is under different conditions (e.g., class imbalance, regression vs
classification)
● What kind of errors it’s making, and how serious those errors are
Different metrics used
● Confusion Matrix
● Accuracy
● Precision
● Recall or Sensitivity
● Specificity
● F1 Score
● Log Loss
● Area under the curve (AUC)
● MAE – Mean Absolute Error
● MSE – Mean Squared Error
Confusion Matrix
● True Positives (TP):These are the cases when the actual class of the data point is 1 (True),
and the model also predicts 1 (True).
Example: A person actually has cancer (1), and the model classifies it as cancer (1).
Result: True Positive
● True Negatives (TN):These are the cases when the actual class is 0 (False), and the predicted
class is also 0 (False).
Example: A person does not have cancer (0), and the model classifies it as not cancer (0).
Result: True Negative
● False Positives (FP):False Positives occur when the actual class of the data point is 0 (False),
but the predicted class is 1 (True).“False” because the prediction is incorrect.
“Positive” because the model predicted a positive [Link]: A person does not have
cancer, but the model classifies their case as cancer → This is a False Positive.
● False Negatives (FN):False Negatives occur when the actual class is 1 (True), but the model
predicts 0 (False).“False” because the prediction is wrong.“Negative” because the model
incorrectly predicted a negative [Link]: A person has cancer, but the model classifies
the case as no cancer → This is a False Negative.
Accuracy
● Accuracy in classification problems refers to the proportion of correct predictions made by the model
out of all predictions made.
● Numerator: Includes only the correct predictions — True Positives (TP) and True Negatives (TN).
(These are highlighted in red in the figure.)
● Denominator: Includes all predictions, both correct and incorrect — TP, FP, FN, TN.
Accuracy
When to Use Accuracy
Accuracy is a good performance measure when the target variable classes are nearly balanced.
Example:
60% of classes in a fruit image dataset are apples, and 40% are oranges.
A model that predicts correctly 97% of the time whether the fruit is an apple or orange performs well here.
In such balanced datasets, accuracy is a meaningful metric.
When Not to Use Accuracy
Accuracy is not reliable when the dataset is highly imbalanced, i.e., one class [Link]:
In a cancer detection dataset with 100 people, only 5 have cancer.
Let’s say a poor model predicts everyone as Non-Cancer:
● It correctly classifies the 95 healthy people.
● It misses all 5 cancer patients, classifying them as [Link] this, the model’s accuracy is 95%,
which looks good—but it's misleading, because it's completely failing at identifying cancer.
Precision
Precision is a performance metric that tells us what proportion of people predicted as having cancer actually do
have cancer.
● It focuses only on the positive predictions made by the model.
In a cancer prediction dataset of 100 people, only 5 actually have cancer.
● Let’s assume the model is very poor and predicts everyone as having cancer.
● That means:
TP + FP = 100 (all 100 are predicted as positive)
Only 5 of them actually have cancer → TP = 5, FP = 95
Even though the model catches all the true cases, the precision is very low because it wrongly predicts many
people as positive.
Recall or Sensitivity (True Positive Rate)
● Definition: Recall, also known as sensitivity or true positive rate (TPR), measures the model's ability to
identify all relevant instances of a particular class. It answers the question, "Of all the actual positives, how
many were correctly predicted by the model?".
● Formula: TP / (TP + FN)
○ TP: True Positives (correctly predicted positive cases)
○ FN: False Negatives (actual positives incorrectly predicted as negatives)
● Example: In a spam detection system, recall measures the percentage of all actual spam emails that were
correctly identified as spam.
● When to Use: Crucial when minimizing false negatives is important. For example, in disease detection,
missing a positive case (false negative) can have serious consequences.
Specificity (True Negative Rate)
● Definition: Specificity measures the model's ability to correctly identify the instances that do not
belong to the positive class. It's the proportion of actual negatives that are correctly identified as
negative.
● Formula: TN / (TN + FP)
○ TN: True Negatives (correctly predicted negative cases)
○ FP: False Positives (actual negatives incorrectly predicted as positives)
● Example: In a medical diagnostic test, specificity indicates how well the test can correctly identify
healthy individuals as negative for the disease.
● When to Use: Important when minimizing false positives is critical, for instance, in diagnostic tests
where a false positive might lead to unnecessary further investigations or treatment.
F1 Score
● Definition: The F1 Score is the harmonic mean of precision and recall. It provides a balanced measure,
especially useful when dealing with imbalanced datasets where accuracy can be misleading.
● Formula: 2 * (Precision * Recall) / (Precision + Recall)
○ Precision: TP / (TP + FP) - Measures the accuracy of positive predictions.
○ Recall: TP / (TP + FN) - Measures the completeness of positive predictions.
● When to Use: Ideal when you need to balance between false positives and false negatives, particularly
in situations with imbalanced datasets where both precision and recall are important, like fraud
detection, medical diagnosis, or spam filtering.
Log Loss (Binary Cross-Entropy Loss)
● Definition: Log loss, also known as binary cross-entropy loss, measures the performance of a
classification model that outputs a prediction with a probability value between 0 and 1. It
quantifies the difference between the predicted probabilities and the true labels, penalizing the
model for assigning low probabilities to the correct classes. A lower log loss value indicates
better performance.
● Formula (Binary Classification): - (y * log(p) + (1 - y) * log(1 - p))
○ y: The true label (0 or 1).
○ p: The predicted probability for the positive class (between 0 and 1).
● When to Use: When you need to evaluate the uncertainty of a model's predictions and compare
models based on how well-calibrated their probability estimates are.
Area Under the Curve (AUC) - ROC Curve
● Definition: AUC, or Area Under the Curve, is a performance metric that measures a model's ability to
distinguish between classes. It is the area under the Receiver Operating Characteristic (ROC) curve, which
plots the true positive rate (recall/sensitivity) against the false positive rate (1 - specificity) at various
threshold settings.
● Range: AUC values range from 0 to 1.
○ 0.5 indicates random guessing.
○ 1.0 indicates perfect discrimination.
● When to Use: Useful for comparing the performance of different models, particularly in binary classification
tasks and when dealing with imbalanced datasets.
MAE – Mean Absolute Error
● Definition: MAE measures the average of the absolute differences between predicted values and
actual values in a regression problem. It provides a direct interpretation of how far off the
predictions are on average.
● Formula: (1/n) * Σ |actual - predicted|
● When to Use: In regression tasks, when the dataset may contain outliers, as MAE is less
sensitive to them compared to MSE.
MSE – Mean Squared Error
● Definition: MSE calculates the average of the squared differences between predicted values and actual
values in a regression problem. It penalizes larger errors more heavily due to the squaring of the
differences.
● Formula: (1/n) * Σ (actual - predicted)^2
● When to Use: In regression tasks when larger errors are particularly undesirable, such as when
predicting house prices or optimizing pricing models.
Default Baseline Models
Default baseline models are the initial models or algorithms chosen to quickly build a working end-to-end
system for a machine learning or deep learning task. These models serve as starting points to:
● Set a benchmark for performance,
● Identify obvious flaws or limitations,
● Guide further development and tuning.
They don’t need to be perfect — their main goal is to test feasibility and help improve the model in later
iterations.
Key Characteristics of Default Baseline Models:
Simple and fast to implement
– Like logistic regression for linearly separable data.
Chosen based on input structure and task
– CNNs for images, RNNs for sequences, Feedforward networks for tabular data.
Good enough to learn basic patterns
– Not necessarily the final solution, but effective for early development.
Includes good training practices
– Like early stopping, dropout, and basic optimization strategies (e.g., SGD or Adam).
Why Use Default Baseline Models?
● To avoid wasting time on over complicated models too early.
● To quickly test whether a problem is solvable.
● To compare performance when trying more advanced approaches later.
● To help in debugging and understanding the data and problem.
Determining Whether to Gather More Data
● First evaluate performance of the initial end-to-end system.
● If training performance is poor, focus on improving the model—not data.
● Increase model capacity or tune hyperparameters like the learning rate.
● Check for noisy or insufficient features in training data.
● If training is good but test performance is poor, model is overfitting.
● In such cases, gathering more data is often the best solution.
● Consider feasibility: data is easier to collect in web-scale applications than in medical settings.
● Alternatives: apply regularization (e.g., dropout, weight decay).
● Use training size vs. test error curves to estimate how much more data is needed.
Selecting Hyperparameters
Selecting hyperparameters means choosing the right settings that control how a deep
learning algorithm learns and performs, either by manual tuning or automatic methods.
● Manual Hyperparameter Tuning.
● Automatic Hyperparameter Optimization Algorithms.
● Grid Search.
● Random Search.
● Model-Based Hyperparameter Optimization (Bayesian Optimization)
Manual Hyperparameter Tuning
Definition:
Manual tuning is the process of adjusting hyperparameters by hand, typically using trial and error. It is often
guided by human intuition and prior experience.
Process:
● Choose a set of hyperparameters based on intuition or past results.
● Train the model and evaluate its performance.
● Adjust the hyperparameters based on results.
● Repeat until performance is satisfactory.
Advantages:
● Simple and intuitive.
● No need for complex tools or algorithms.
● Useful for small models or when only a few hyperparameters exist.
Disadvantages:
● Time-consuming and inefficient for large parameter spaces.
● Subject to human bias.
● Not scalable or reproducible.
Automatic Hyperparameter Optimization Algorithms
Definition:
Automatic hyperparameter optimization involves using algorithms to search for the best hyperparameters,
reducing human effort and often improving performance.
Goal:
Minimize or maximize an objective function (e.g., validation loss or accuracy) over the space of
hyperparameters.
Common Algorithms:
● Grid Search
● Random Search
● Bayesian Optimization
● Evolutionary Algorithms
● Hyperband
● Gradient-based optimization (for differentiable settings)
Advantages:
● Systematic and reproducible.
● More effective exploration of complex search spaces.
● Can save significant time compared to manual tuning.
Disadvantages:
● Requires computational resources.
● May still be expensive for deep models.
Grid Search
Definition:
Grid search systematically explores all possible combinations of hyperparameters from a predefined
set of values.
Process:
1. Define a grid (discrete values) for each hyperparameter.
2. Train and validate the model for every combination.
3. Choose the combination with the best performance.
Ex:
Learning Rate = [0.01, 0.1],
Batch Size = [32, 64]
→ 4 combinations to test.
Advantages:
● Simple to implement.
● Exhaustive—guarantees the best combination from the grid.
Disadvantages:
● Computationally expensive.
● Scales poorly with number of hyperparameters and their ranges.
● Might test irrelevant or suboptimal values.
Random Search
Definition:
Random search selects hyperparameter combinations at random rather than exhaustively testing all possible
combinations.
Process:
● Define a distribution or list of possible values for each hyperparameter.
● Randomly sample combinations and evaluate performance.
Random Search
Advantages:
● More efficient than grid search when some hyperparameters are more influential than others.
● Can explore a wider range with fewer evaluations.
● Simple and parallelizable.
Disadvantages:
● May miss optimal regions due to randomness.
● Still not ideal for very high-dimensional spaces.
Equation (for probability of finding best config):
Let p be the probability that a configuration is the best, and n is the number of trials:
Probability of not finding the best after n trials:
Model-Based Hyperparameter Optimization (Bayesian Optimization)
Definition:
Model-based optimization builds a probabilistic model of the objective function and uses it to select the
most promising hyperparameter combinations to evaluate.
Process:
1. Start with some random evaluations.
2. Fit a surrogate model (e.g., Gaussian Process, Tree Parzen Estimator).
3. Use acquisition function (like Expected Improvement) to suggest next hyperparameters.
4. Update the model with new results and repeat.
Model-Based Hyperparameter Optimization (Bayesian
Optimization)
Advantages:
● Efficient for expensive evaluations.
● Finds good hyperparameters with fewer trials.
● Takes prior evaluations into account.
Disadvantages:
● More complex to implement.
● Requires additional computation to maintain the surrogate model.
Equation (Bayesian Update):
● Bayes’ theorem used in modeling:
Where: f is the objective function.,
D is the set of observed results.
Debugging strategies
1. Start simple
● Simplify the problem: Begin with a small dataset and a simplified version of your model architecture.
This helps to isolate potential issues and ensures the basic setup works correctly before adding
complexity.
● Overfit a single batch: Test if your model can successfully overfit a single batch of data, meaning it
can achieve perfect or near-perfect performance on that specific batch. If the model fails to overfit a
small batch, it often indicates a fundamental issue with the model architecture, implementation, or data
processing.
● Compare to a known result: Use off-the-shelf components, or benchmark models to compare your
model's performance against a known, successful implementation. This helps to identify problems with
your own implementation if the benchmark model performs significantly better
2. Check your data
● Data preprocessing and formatting: Verify that the input data is correctly preprocessed, scaled, and
formatted according to your model's expectations. Incorrect data handling can lead to poor model
performance or even prevent training.
● Missing values, outliers, and imbalances: Examine your dataset for missing values, outliers, or class
imbalances. Address these issues through imputation, outlier handling techniques, or data
augmentation to ensure data quality and avoid biasing the model.
● Visualize the data: Visualize the data distribution, feature importance, and relationships between
variables to identify inconsistencies or potential issues,
3. Monitor training progress
● Track loss and accuracy: Monitor the loss and accuracy metrics during training to detect
overfitting, underfitting, or unstable training. Tools like TensorBoard and MLflow can help
visualize these metrics.
● Visualize activations and gradients: Observe the activations of hidden units and the gradient
values to detect issues like vanishing or exploding gradients. These issues can hinder the
training process and prevent the network from learning effectively.
● Learning curves: Plot learning curves to analyze how the model's performance on the training
and validation sets changes over time. This helps in identifying underfitting or overfitting early
in the process.
4. Evaluate and refine
● Bias-Variance Decomposition: Analyze the bias and variance of your model to determine whether
underfitting or overfitting is the primary issue. Underfitting suggests the model is too simple, while
overfitting indicates it's memorizing the training data.
● Error Analysis: Manually examine misclassified examples to understand the types of errors your
model is making. This can provide valuable insights for improving the model or data collection
process.
● Hyperparameter Tuning: Systematically explore different hyperparameter combinations (e.g.,
learning rate, regularization factors, batch size) to find the optimal settings for your model. Techniques
like grid search, random search, or Bayesian optimization can be employed for this.
● Address underfitting: If the model is underfitting, consider increasing model complexity (e.g., adding
more layers or neurons), trying different architectures, or adding more features.
● Address overfitting: If the model is overfitting, try collecting more training data, applying
regularization techniques (e.g., L1/L2 regularization, dropout), or using data augmentation.
5. Leverage debugging tools
● IDEs and Built-in Debuggers: Utilize the debugging features within your IDE (like Visual
Studio Code or PyCharm) to set breakpoints, inspect variables, and step through the code
execution.
● Visualization Tools: Tools like TensorBoard can visualize network architectures, layer
activations, and training progress, offering deeper insights into the model's behavior.
● Logging and Monitoring: Implement effective logging to capture and analyze information about
the program's execution, especially in production environments where interactive debugging
might not be feasible.
6. Document and track experiments
● Version control: Use version control systems like Git to track changes in your codebase and easily
revert to previous versions.. This helps in identifying when a bug was introduced and what changes
might have caused it.
● Experiment tracking tools: Use tools like MLflow or Neptune to track model performance,
hyperparameters, and datasets across different experiments. This helps in reproducing experiments and
understanding the impact of various changes.
● Documentation: Record the debugging process, including bug descriptions, reproduction steps,
changes made, and solutions found. This documentation serves as a valuable resource for future
debugging challenges and facilitates knowledge sharing within a team.
Example: Multi-Digit Number Recognition
Multi-digit number recognition, often using Convolutional Neural Networks (CNNs), presents specific
debugging challenges beyond the general ones outlined previously.
Let's explore these challenges and some debugging strategies, referencing the MNIST dataset as a common
example.
Specific challenges in multi-digit recognition
● Segmentation: Unlike single-digit recognition, multi-digit images require accurately segmenting
individual digits before classifying them. Incorrect segmentation can lead to misclassification.
● Variable number of digits: Images can contain varying numbers of digits, making the model design
and output interpretation more complex.
● Spatial Relationships: Beyond individual digit classification, understanding the order and spatial
arrangement of the digits is critical to reconstructing the complete number.
● Touching or overlapping digits: Handwritten digits can often be connected or partially overlap,
complicating segmentation and leading to errors.
Debugging strategies
1. Data inspection and preprocessing
● Visualize the training data: Scrutinize the multi-digit training images for inconsistencies, particularly
regarding the spacing between digits, character sizes, and any augmentation strategies employed.
Ensure your preprocessing correctly handles these variations.
● Test data preprocessing with single digits: Before training on multi-digit images, confirm your
single-digit recognition pipeline works flawlessly by testing on segmented digits. This helps isolate
potential preprocessing issues before introducing the complexity of multi-digit sequences.
● Normalize pixel values: Ensure that pixel values are normalized to a consistent range (e.g., 0 to 1) for
stable training.
2. Model architecture and training
● Start with a simple model: Begin with a relatively simple CNN architecture that can learn to classify
individual digits accurately. Gradually introduce complexity, like handling variable digit lengths or
incorporating techniques for learning spatial relationships.
● Track training progress (loss and accuracy): Monitor the loss function and accuracy on both
training and validation sets. High training accuracy but low validation accuracy often indicates
overfitting.
● Intermediate layer analysis: Visualize the feature maps generated by intermediate layers of your
CNN to see how the model extracts features. This can help identify if the model is learning the
relevant features or if there are issues like vanishing or exploding gradients impacting the feature
learning process.
● Overfit a small batch: Test if your model can successfully overfit a small batch of training data. If it
can't, there may be fundamental issues with the model architecture or loss function.
● Experiment with different loss functions: For multi-digit recognition, depending on your model, you
might need a different loss function than the simple categorical cross-entropy used for single digits.
Consider specialized loss functions or approaches to account for the sequential nature or variable
length of digits.
3. Error analysis and visualization
● Examine misclassified multi-digit images: Analyze cases where the model makes mistakes. Are the
errors due to incorrect segmentation of digits, misclassifying an individual digit, or failing to
understand the sequence/order?
● Confusion Matrix: Use a confusion matrix to pinpoint which digits are frequently misclassified as
others. This can inform where to focus improvements.
● Visualize predictions with bounding boxes: For object detection-based approaches to multi-digit
recognition, visualize the predicted bounding boxes around each digit to identify if the model is
correctly segmenting the digits.
4. General debugging practices
● Version control and documentation: Track changes in your code and document experiments to help
recreate or revert to previous versions if needed.
● Utilize debugging tools: Use tools like TensorBoard to visualize training progress, layer activations,
and other aspects of the model's behavior.
By combining these strategies and systematically addressing the unique challenges of multi-digit
recognition, you can effectively debug your deep learning models and achieve higher accuracy in your tasks.
Applications: Vision, NLP, Speech
These are three key branches of Artificial Intelligence (AI), each with diverse and impactful applications:
● Computer Vision: Teaches machines to see and interpret visual information like humans do. It involves
tasks like object recognition, image analysis, and scene understanding.
● Natural Language Processing (NLP): Enables computers to understand, interpret, and generate human
language in both written and spoken forms.
● Speech Recognition & Synthesis: Deals with converting spoken language into text (recognition) and
generating human-like speech from text (synthesis).
Computer Vision
● Autonomous Vehicles: Detecting obstacles, traffic signs, and lanes for self-driving cars.
● Medical Imaging: Assisting in diagnosing diseases by analyzing X-rays, MRIs, and CT scans
for anomalies.
● Facial Recognition: Unlocking phones, identifying individuals in surveillance, and enhancing
security measures.
● Retail: Tracking inventory, analyzing customer behavior, and enabling "just walk out" shopping
experiences.
● Manufacturing: Quality control, defect detection, and automating assembly lines.
● Agriculture: Monitoring crop health, detecting pests, and optimizing resource use.
● Security and Surveillance: Identifying intruders and suspicious activities.
● Augmented Reality (AR) and Gaming: Creating immersive experiences by blending virtual
objects with the real world.
● Sports Analysis: Tracking player movements, analyzing performance, and even assisting with
referee calls.
● Remote Physiotherapy: Monitoring and correcting patient movements during at-home
rehabilitation exercises.
Natural Language Processing (NLP)
● Chatbots and Virtual Assistants: Understanding and responding to user queries for customer service,
support, and information retrieval.
● Machine Translation: Translating text or speech between languages, such as Google Translate.
● Sentiment Analysis: Analyzing text to determine the emotional tone or sentiment (positive, negative,
neutral) towards a product, service, or brand.
● Text Classification: Categorizing documents, emails, or social media posts based on topics, intent, or
emotions.
● Information Extraction: Identifying and extracting specific information like names, locations, dates,
or keywords from text.
● Text Summarization: Generating concise summaries of long documents or articles.
● Predictive Text and Autocorrect: Suggesting words and correcting typos in real-time while
typing.
● Email Filtering: Categorizing and filtering emails, such as spam detection.
● Market Intelligence: Analyzing social media, reviews, and other sources to gain insights into
customer preferences and market trends.
● Hiring and Recruitment: Streamlining the resume screening and candidate selection processes.
● Healthcare: Analyzing medical records, assisting with diagnostics, and automating
documentation.
● Legal: Automating legal discovery and analyzing documents
Speech Recognition & Synthesis
● Voice Assistants: Enabling hands-free interaction with devices like smartphones, smart
speakers, and cars through voice commands.
● Speech-to-Text Transcription: Converting spoken words into written text for documentation,
captions, or searches.
● Voice Search: Finding information or controlling devices by speaking queries instead of typing.
● Voice Biometrics: Identifying and authenticating individuals based on their unique voice
characteristics for security purposes.
● Accessibility Features: Providing voice control for individuals with disabilities who may
struggle with traditional input methods.
● Customer Service: Automating interactions in call centers and allowing self-service options.
● Education: Assisting with language learning, providing feedback on pronunciation, and helping
with note-taking.
● Healthcare: Allowing doctors to dictate patient notes, analyze speech patterns for potential
diseases, and control medical equipment hands-free.
● Automotive: Controlling navigation, infotainment, and other vehicle functions through voice
commands while driving.
● Generative AI Integration: Enabling voice dictation with large language models to enhance,
organize, and expand ideas or even generate code.
● Multimodal Voice Experiences: Combining speech recognition with other sensory inputs like
computer vision for more interactive and contextual responses.