Unit – I Part II
Introduction to ML
Training and Test Data
Training and Test data
• In machine learning, data is often split into different sets to ensure the
model can learn effectively and be evaluated fairly.
• The two most important sets are:
▪ Training Data Training Set Test Set
▪ Test Data Dataset
Training Data
Definition: Training data is the part of dataset which is used to teach or train the
machine learning model. It contains input features along with their corresponding
correct outputs (labels), especially in supervised learning.
It is used to teach the model how to make predictions by learning from
examples
Purpose: The model adjusts its internal parameters to minimize errors on this
dataset. It learns the underlying patterns, relationships, and features from this data.
Training Data
Characteristics:
• Usually the larger portion of the entire dataset.
• Data distribution ideally reflects real-world scenarios the model will encounter.
• Often labeled (especially in supervised learning).
• Helps avoid underfitting by providing ample examples.
Example:
In building a spam filter, thousands of emails labeled as "spam" or "not spam"
form the training set. The model learns the characteristics of spam emails from
this set.
Testing Data
• Definition: Test data is a separate dataset not seen by the model during
training.
• Purpose: It is used after training to evaluate the model's performance and
generalization ability, i.e., how well the model predicts on unseen data.
Testing Data
Characteristics:
• Always kept separate from the training data to prevent data leakage.
• Typically, smaller in size than training data, but should be representative.
• Helps detect overfitting (when a model memorizes training data but performs poorly on
new data).
Example:
Using a test set of emails the model has never encountered; we assess how
accurately the spam filter can classify them.
Training and Testing
• Training Phase:-
input Logic/ Model
Learning Algorithm
• Testing Phase:-
input Output
Logic/ Model
(Prediction)
Training and Test Data
Key Differences Between Training and Test Data
Feature Training Data Test Data
Purpose To train and teach the model To evaluate the performance of the trained model
Exposure to Model Used directly to fit the model parameters Not shown during training
Size Usually larger (70%-80% of the dataset) Usually smaller (20%-30% of the dataset)
Labels Labeled (in supervised learning) Labeled, for performance comparison
Role in Model Life Cycle Adjusts model parameters Measures accuracy, precision, recall, etc.
Should be from the same distribution as training but
Data Distribution Should represent real-world data distribution unseen
Risk of underestimating model performance if too
Risk Risk of overfitting if model is too tailored small or biased
Usage Model learns patterns Model predictions are compared to actual outcomes
Learning System
A learning system in machine learning is a framework that enables
machines to learn from data, recognize patterns, and make informed
decisions with minimal human guidance.
The goal is for the system to generalize well from its training data to
new, unseen data, supporting applications such as prediction,
classification, and recommendation
Key Components of a Learning System
• Data Input: The foundation of any learning system; quality and quantity of
data directly affect performance.
• Learning Algorithm: The method or procedure by which the system learns
(e.g., Linear Regression, Neural Networks, Decision Trees).
• Model Output: Predictions, classifications, or recommendations generated
by the model.
• Feedback Mechanism: Compares predictions versus actual outcomes to
iteratively improve the model
1. Define the Problem
2. Collecting Data
3. Preparing the Data
Designing a 4. Choosing the Model
Learning System
5. Training the Model
6. Evaluating the Model
7. Parameter Tuning
8. Model Deployment
(Making Predictions)
Step 1: Define the Problem
• The first and most critical step in designing a machine learning system is to clearly
define the problem the system is intended to solve.
• This involves understanding the domain context, user, need, identifying the type
of learning task (e.g., classification, regression, clustering), and specifying the
desired output based on input data.
• A well-defined problem provides the foundation for all subsequent stages in the
development process.
• Misidentification at this stage can lead to the selection of inappropriate algorithms,
incorrect data handling, and ultimately ineffective solutions.
• Example: In the context of email filtering, the objective might be to classify
incoming emails as either spam or not spam. This is a binary classification problem,
where the input is the email content and the output is one of two predefined
categories.
Step 2: Collecting data
• Once the problem is clearly defined, the next step is to collect data that the machine
learning system will learn from. The data should be relevant, accurate, and large
enough to help the model understand different situations.
• The quality of your data is very important. If the data is wrong, incomplete, or biased,
the model will not perform well. Data can be collected from various sources like
files, databases, websites, sensors, or user inputs.
• Example: For an email spam filter, you need a dataset of emails, where each email is
labeled as spam or not spam. These labels help the model learn what spam usually
looks like.
Step 3: Preparing the data
• After collecting the data, the next step is to prepare it so that the machine learning
model can understand and learn from it. It is also known as data preprocessing.
• Raw data often contains errors, missing values, or unwanted information that must be
cleaned and organized.
• This step may include-
• Removing duplicate or irrelevant data
• Filling in or removing missing values
• Normalizing or scaling values to bring them to the same range and format
• Splitting the data into training and testing sets
The training set contains the data from which the model learns and testing set is used
to check how much the model has learnt.
Step 3: Preparing the data
• Data preprocessing improves the quality of the data and ensures that the machine
learning model is able to interpret it correctly.
• Good data preparation helps the model learn better and make more accurate
predictions.
• Example: In the email spam filter example, you might remove symbols and
stopwords (like "the", "is", "at") from email texts, and then divide the emails into two
parts — one to train the model and one to test it.
Step 4: Choosing the Model
• Once the data is ready, the next step is to choose a suitable machine learning model. A
model is a system or method that learns from data to make predictions or decisions.
• Choosing the right model is very important, as it affects how well your learning
system works.
• Factors to consider when choosing a model include Type of Problem (classification,
regression or grouping etc.), size and type of Data, accuracy Needed, complexity of
the problem.
• Example:
In the email spam filter example, you could choose a simple classification model that
looks at word patterns in emails and learns which ones are often found in spam. If
your dataset is small and the task is straightforward, a simple model like that will
work well and save time.
Step 5: Training the Model
• After selecting the right model, the next important step is to train the model. Training
means teaching the model to learn patterns from the data by showing it many examples.
• The model studies the data to understand how input features (like words in an email) are
linked to the correct output (like spam or not spam).
• During training, the model makes predictions, compares them with the correct answers,
and then adjusts itself to reduce mistakes. This process continues until the model learns to
make accurate predictions on most of the training data.
• Avoiding Overfitting and Underfitting -While training, it's important to make sure the
model learns just the right amount — not too little (Underfitting- model performs poorly
on training as well as test data) and not too much (Overfitting- Model performs well on
training data but poor on test data)
• Example: In the spam filter case, the model is trained using many labeled emails. It
learns which words or styles are common in spam. If trained well (without overfitting or
underfitting), it can correctly detect spam even in new emails it hasn't seen before.
Step 6: Evaluating the Model
• Once the model is trained, the next step is to evaluate how well it performs.
• This means testing the model using new data that it hasn't seen before, called test data.
• Evaluation helps us know if the model is-
• Learning correctly
• Making useful predictions
• Ready to be used in a real application
• Example: You trained a spam email filter using past emails. Now, you test it on a new
set of 200 emails. The model correctly identifies 180 emails and makes 20 mistakes.
You calculate accuracy and other metrics to understand how reliable the model is
before using it in a real email system.
Step 7: Hyperparameter Tuning
After evaluating the model, the next step is to improve its performance by adjusting
certain settings known as hyperparameters. This process is called hyperparameter
tuning.
• Hyperparameters - Hyperparameters are settings chosen before training the
model, and they control how the learning happens. They are not learned from the
data but must be set manually or through experiments. Each machine learning
algorithm has its own hyperparameters. Example- Learning rate, No. of trees in
Random Forest, Maximum depth in decision tree, no. of neurons/ hidden layers in
neural networks etc.
Example: Suppose your spam filter model uses a decision tree. If the maximum depth
is too small, the model may not capture enough detail (underfitting). If it’s too large, it
may memorize the training data (overfitting). By tuning the depth to the right level,
the model performs better on new emails.
Step 8: Model deployment (Making Predictions)
• The final step is model deployment. This means placing the model into real-world use, so
it can start making predictions on new, unseen data.
• Deployment is when the model leaves the lab or classroom and becomes part of an actual
application, website, or system used by people.
• What Happens During Deployment-The trained model is saved and integrated into a
software system or app, It receives new input data (not used during training), It processes
the input and makes predictions or decisions, The results are used by users or other
systems.
Example: Suppose you’ve built a spam detection model for emails. After training and
testing, you deploy it into an email application. Now, whenever a new email arrives, the
model checks it in real-time and marks it as spam or not spam — helping users without
needing manual review.
Issues in Machine Learning
Inadequate Training Data
Poor Quality of Data
Non-Representative Training Data
Data –Related Issues
Data Bias
Irrelevant Features
Data Privacy and Security
Issues in Machine Learning
Overfitting and Underfitting
Lack of Explainability
Model Related Issues
Model Drift/ Concept Drift
High Computational Cost
Deployment and Maintenance Monitoring and Maintenance
Issues Slow Implementation
Human Resource and process Lack of Skilled Resources
issues Process Complexity of Machine Learning
Lack of Fairness and Accountability
Ethical and Social Issues
Environmental Impact
Data Related Issues
1. Inadequate Training Data
It means that the machine learning model doesn’t get enough examples to learn the
patterns properly. A model trained on too little data will struggle to generalize, make
poor predictions and often give unreliable results.
2. Poor Quality of Data
This refers to data that is incomplete, incorrect, inconsistent, outdated, duplicate, or
noisy. If the data fed into a machine learning model is of poor quality, the model will
learn wrong patterns or fail to learn useful ones. Poor Quality issues include:
Issue Explanation Example
Missing Data Some values are blank or not recorded. A customer profile is missing the age or gender field.
Inaccurate Data Data contains wrong or false information. A student's test score is recorded as 800 out of 100.
A patient appears twice in the system with slightly different
Duplicate Entries The same record appears multiple times.
IDs.
Different formats or labels are used for the same
Inconsistent Data Country names listed as "USA", "U.S.", "United States".
information.
An employee still marked as “active” who left the company
Outdated Data Information is old or no longer valid.
a year ago.
Contains irrelevant, random, or meaningless Social media posts with random emojis or irrelevant
Noisy Data
information. hashtags.
Typos and Spelling Errors Human input mistakes during data entry. Product name written as "samsang" instead of "Samsung".
3. Non-representative training data
It means that the dataset used to train a machine learning model does not accurately
reflect the real-world data the model will face after deployment.
In other words, the training data covers only a part of the actual scenario.
The model will likely make incorrect predictions when exposed to different or unseen
types of data and will fail to generalize.
4. Data Bias
Data bias happens when the data used to train a machine learning model is unfair,
unbalanced, or distorted, leading the model to learn incorrect or discriminatory patterns.
This bias can result in models that favor certain groups or types of data, ignore or
misrepresent others, produce unfair, inaccurate, or even harmful outcomes.
5. Irrelevant Features
Irrelevant features are input variables (columns in your dataset) that do not help the
model make better predictions—they are unrelated or weakly related to the target/output.
Including irrelevant features adds noise to the model, increases training time, reduces
model accuracy and makes the model harder to interpret
• Example: House price prediction dataset containing house color and owner name.
6. Data Privacy and Security:
• Data Privacy: Protecting individuals’ personal or sensitive information from
being exposed, misused, or accessed without consent.
• Data Security: Preventing unauthorized access, theft, or tampering with data
during collection, storage, processing, or sharing.
• In machine learning, models are trained on large amounts of data—often
containing personal details, such as names, health records, financial info, or
behavior patterns. If not handled properly, this data can pose serious ethical,
legal, and reputational risks.
Model Related Issues
7. Overfitting and underfitting
Overfitting and underfitting are common problems in machine learning that
happen when the model doesn’t learn the right patterns from the data.
Concept Meaning Problem Caused Example
Model learns too Training the same model on every
Works well on training
much, including tiny detail in the training images —
Overfitting data, fails on new/unseen
noise and irrelevant it memorizes those exact pictures
data
details but fails to recognize new ones.
Training a model to recognize cats
Model learns too and dogs using only 2–3 very
Performs poorly even on
Underfitting little, missing simple features (like size and color)
training data
important patterns — it performs poorly because it
doesn’t capture enough patterns.
8. Lack of Explainability
ML model is generally considered to be a “black box,” its internal decision-making
process is hidden or too complex for humans to understand. Lack of explainability
means you can’t clearly answer the question:
“Why did the model make that particular prediction or decision?”
Example: A loan application is rejected by ML model, but the bank can’t explain
why—credit score, income, or something else?
9. Model Drift
Model Drift (or Concept Drift) occurs when the data patterns change over time, but
your model was trained on older data. As a result, the model’s predictions become less
accurate or irrelevant over time.
10. High Computational Cost
Training or running a machine learning model requires a lot of time, memory, and
processing power due to its need of large datasets, high dimensional data and
complexity.
Deployment and Maintenance Issues
11 Monitoring and Maintenance
Over time, the data changes, user behavior shifts and model performance may degrade.
Monitoring and maintenance refer to the ongoing process of –
• Tracking the model’s performance
• Updating it when needed
• Ensuring it continues to perform accurately and fairly
If this is not done properly, the model may become outdated, biased, or even harmful.
12. Slow implementation
ML models consumes a lot of time. These delays can be due to technical, organizational,
or process-related bottlenecks.
Human Resource and Process Issues
13. Lack of Skilled Resources
There is a shortage of professionals with the right knowledge and experience to-
• Build effective machine learning models
• Prepare and manage data
• Deploy and maintain models in real-world systems
• Interpret and explain model outputs to stakeholders
This skills gap can slow down or even derail machine learning projects.
14. Process complexity in Machine Learning
Design and development of a ML system is a complex process as a lot of steps, tools,
skills, and decisions are involved in building a working ML solution, from raw data to
final deployment and beyond.
Ethical and Social Issues
15. Lack of Fairness and Accountability
Lack of fairness means that a machine learning model treats certain groups unfairly —
often because of biased data or design choices.
Lack of accountability means that no one is clearly responsible for what a model does,
especially when it causes harm or unfair outcomes.
These are ethical and social issues that can have serious consequences.
16. Environmental Impact
The environmental impact of machine learning refers to the energy and resource
consumption required to build, train, and run ML models—especially large ones like deep
learning and generative AI. These processes contribute to carbon emissions, electronic
waste, and overall environmental degradation.
Applications of Machine Learning
1. Image and Video Analysis
ML models can learn to interpret visual data and identify patterns in images or videos.
• Facial Recognition – Identifies or verifies a person’s identity from images or videos. It is
used in phone unlocking, airport security and tagging people in photos.
• Object Detection – Locates and labels multiple objects in a scene (e.g., detecting traffic signs
for self-driving cars).
• Medical Imaging – Assists doctors by detecting tumors, fractures, or diseases in X-rays,
MRIs, and CT scans.
2. Natural Language Processing (NLP)
NLP allows machines to understand, process, and generate human language.
• Text Classification – Automatically sorts text into categories, such as spam vs. non-spam
emails.
• Sentiment Analysis – Determines whether the tone of text is positive, negative, or neutral,
useful for customer feedback analysis.
• Language Translation – Converts text or speech from one language to another (e.g., Google
Translate, Microsoft Translator).
3. Recommendation Systems
ML predicts what users might like based on their past behavior.
• E-commerce Recommendations – Suggests products that a customer is likely to buy (e.g.,
“You might also like” on Amazon).
• Content Streaming – Recommends movies, TV shows, or music based on past
viewing/listening habits (e.g., Netflix, Spotify).
4. Speech and Audio Processing
ML can process sound waves to recognize or analyze speech and audio patterns.
• Speech Recognition – Converts spoken language into written text (e.g., Siri, Google
Assistant).
• Speaker Identification – Recognizes a person’s identity from their voice (e.g., banking apps
with voice authentication).
5. Healthcare
ML is transforming diagnosis, treatment, and hospital management.
• Disease Prediction – Analyzes patient data to predict diseases such as diabetes or heart issues
before symptoms appear.
• Drug Discovery – Speeds up the creation of new medicines by predicting how chemical
compounds will react.
• Medical Imaging – ML algorithms help in identifying tumors in X-rays and MRIs.
• Personalized Treatment Plans based on patient history and genetics.
• Predictive Analytics – Predicting patient readmission, epidemic outbreaks.
6. Finance
ML helps detect fraud, manage risks, and make investment decisions.
• Fraud Detection – Identifies suspicious transactions by spotting unusual patterns.
• Stock Market Prediction – Uses historical trends and real-time data to forecast stock prices.
• Risk Management - Assessing market or credit risk using historical data.
7. Marketing and Sales
• Customer Segmentation – Grouping users based on behavior for targeted marketing.
• Recommendation Systems – Like those on Amazon or Netflix.
• Churn Prediction – Identifying users which are likely to stop using a service.
• Sentiment Analysis – Analyzing customer feedback and social media posts.
8. Manufacturing
• Predictive Maintenance – Forecasting equipment failure before it occurs.
• Quality Control – Detecting defects using computer vision.
• Supply Chain Optimization – Forecasting demand and optimizing logistics.
9. Transportation
• Autonomous Vehicles – Self-driving cars use ML for perception and control.
• Traffic Prediction – Google Maps, Uber use ML to suggest optimal routes.
• Fleet Management – Predicting delivery times and maintenance needs.
10. Cybersecurity
• Intrusion Detection Systems (IDS) – Detecting unauthorized access or attacks.
• Spam Filtering – Classifying emails using ML algorithms.
• Phishing Detection – Identifying fraudulent websites and emails.
11. Agriculture
ML helps improve farming efficiency and productivity.
• Crop Health Monitoring – Uses drones and sensors to detect plant diseases or pest
infestations early.
• Yield Prediction – Estimates future harvest quantities for better supply planning.
12. Education
ML is personalizing and automating learning.
• Personalized Learning – Suggests learning material based on a student’s strengths and
weaknesses.
• Automated Grading – Evaluates assignments, quizzes, or exams instantly, saving teachers’
time.
• Dropout Prediction – Identifying at risk students early.