0% found this document useful (0 votes)
11 views25 pages

Introduction to Machine Learning Basics

Machine learning (ML) enables computers to learn from data and make decisions without explicit programming, with applications in various fields such as healthcare and finance. It encompasses different types including supervised, unsupervised, and reinforcement learning, each with unique methodologies for pattern recognition and decision-making. The importance of quality data is emphasized, as it underpins the effectiveness of ML models, while challenges such as data bias and interpretability are also discussed.

Uploaded by

lavanya.d
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views25 pages

Introduction to Machine Learning Basics

Machine learning (ML) enables computers to learn from data and make decisions without explicit programming, with applications in various fields such as healthcare and finance. It encompasses different types including supervised, unsupervised, and reinforcement learning, each with unique methodologies for pattern recognition and decision-making. The importance of quality data is emphasized, as it underpins the effectiveness of ML models, while challenges such as data bias and interpretability are also discussed.

Uploaded by

lavanya.d
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Introduction to Machine Learning

Machine learning (ML) allows computers to learn and make decisions without being
explicitly programmed. It involves feeding data into algorithms to identify patterns
and make predictions on new data. Machine learning is used in various applications,
including image and speech recognition, natural language processing, and recommender
systems.

Why do we need Machine Learning?


Machine Learning algorithm learns from data, train on patterns, and solve or predict
complex problems beyond the scope of traditional programming. It drives better decision-
making and tackles intricate challenges efficiently.
Here’s why ML is indispensable across industries:
1. Solving Complex Business Problems
Traditional programming struggles with tasks like image recognition, natural language
processing (NLP), and medical diagnosis. ML, however, thrives by learning from examples
and making predictions without relying on predefined rules.
Example Applications:
 Image and speech recognition in healthcare.
 Language translation and sentiment analysis.
2. Handling Large Volumes of Data
With the internet’s growth, the data generated daily is immense. ML effectively processes
and analyzes this data, extracting valuable insights and enabling real-time predictions.
Use Cases:
 Fraud detection in financial transactions.
 Social media platforms like Facebook and Instagram predicting personalized feed
recommendations from billions of interactions.
3. Automate Repetitive Tasks
ML automates time-intensive and repetitive tasks with precision, reducing manual effort
and error-prone systems.
Examples:
 Email Filtering: Gmail uses ML to keep your inbox spam-free.
 Chatbots: ML-powered chatbots resolve common issues like order tracking and
password resets.
 Data Processing: Automating large-scale invoice analysis for key insights.
4. Personalized User Experience
ML enhances user experience by tailoring recommendations to individual preferences. Its
algorithms analyze user behavior to deliver highly relevant content.
Real-World Applications:
 Netflix: Suggests movies and TV shows based on viewing history.
 E-Commerce: Recommends products you’re likely to purchase.
5. Self Improvement in Performance
ML models evolve and improve with more data, making them smarter over time. They
adapt to user behavior and refine their performance.
Examples:
 Voice Assistants (e.g., Siri, Alexa): Learn user preferences, improve voice recognition,
and handle diverse accents.
 Search Engines: Refine ranking algorithms based on user interactions.
 Self-Driving Cars: Enhance decision-making using millions of miles of data from
simulations and real-world driving.
What Makes a Machine “Learn”?
A machine “learns” by recognizing patterns and improving its performance on a task based
on data, without being explicitly programmed.
The process involves:
1. Data Input: Machines require data (e.g., text, images, numbers) to analyze.
2. Algorithms: Algorithms process the data, finding patterns or relationships.
3. Model Training: Machines learn by adjusting their parameters based on the input data
using mathematical models.
4. Feedback Loop: The machine compares predictions to actual outcomes and corrects
errors (via optimization methods like gradient descent).
5. Experience and Iteration: Repeating this process with more data improves the
machine’s accuracy over time.
6. Evaluation and Generalization: The model is tested on unseen data to ensure it
performs well on real-world tasks.
In essence, machines “learn” by continuously refining their understanding through data-
driven iterations, much like humans learn from experience.
Importance of Data in Machine Learning
Data is the foundation of machine learning (ML). Without quality data, ML models cannot
learn, perform, or make accurate predictions.
 Data provides the examples from which models learn patterns and relationships.
 High-quality and diverse data improves model accuracy and generalization.
 Data ensures models understand real-world scenarios and adapt to practical applications.
 Features derived from data are critical for training models.
 Separate datasets for validation and testing assess how well the model performs on
unseen data.
 Data fuels iterative improvements in ML models through feedback loops.
Types of Machine Learning
1. Supervised learning
Supervised learning is a type of machine learning where a model is trained on labeled data
—meaning each input is paired with the correct output. The model learns by comparing its
predictions with the actual answers provided in the training data.
Both classification and regression problems are supervised learning problems.
Example: Consider the following data regarding patients entering a clinic. The data
consists of the gender and age of the patients and each patient is labeled as “healthy” or
“sick”.
Gender Age Label

M 48 sick

M 67 sick

F 53 healthy

M 49 sick

F 32 healthy

M 34 healthy

M 21 healthy

In this example, supervised learning is to use this labeled data to train a model that can
predict the label (“healthy” or “sick”) for new patients based on their gender and age. For
instance, if a new patient (e.g., Male, 50 years old) visits the clinic, the model can classify
whether the patient is “healthy” or “sick” based on the patterns it learned during training.
2. Unsupervised learning:
Unsupervised learning algorithms draw inferences from datasets consisting of input data
without labeled responses. In unsupervised learning algorithms, classification or
categorization is not included in the observations.
Example: Consider the following data regarding patients entering a clinic. The dataset
includes unlabeled data, where only the gender and age of the patients are available, with
no health status labels.
Gender Age

M 48

M 67

F 53

M 49
Gender Age

F 34

M 21

Here, unsupervised learning technique will be used to find patterns or groupings in the data
such as clustering patients by age or gender. For example, the algorithm might group
patients into clusters, such as “younger healthy patients” or “older patients,” without prior
knowledge of their health status.
3. Reinforcement Learning
Reinforcement Learning (RL) trains an agent to act in an environment by maximizing
rewards through trial and error. Unlike other machine learning types, RL doesn’t provide
explicit instructions.
Instead, the agent learns by:
 Exploring Actions: Trying different actions.
 Receiving Feedback: Rewards for correct actions, punishments for incorrect ones.
 Improving Performance: Refining strategies over time.
Example: Identifying a Fruit
The system receives an input (e.g., an apple) and initially makes an incorrect prediction
(“It’s a mango”). Feedback is provided to correct the error (“Wrong! It’s an apple”), and
the system updates its model based on this feedback.
Over time, it learns to respond correctly (“It’s an apple”) when encountering similar inputs,
improving accuracy through trial, error, and feedback.

Beyond these three of machine learning techniques, there are two additional approaches
have gained significant attention in modern Machine Learning Self-Supervised
Learning and Semi-Supervised Learning .
Self-Supervised Learning
Self-supervised learning (SSL) is a machine learning technique that trains models using unlabeled
data to generate labels, or supervisory signals.
Semi-Supervised Learning .

Semi-supervised learning is a type of machine learning that falls in


between supervised and unsupervised learning. It is a method that
uses a small amount of labeled data and a large amount of
unlabeled data to train a model. The goal of semi-supervised
learning is to learn a function that can accurately predict the output
variable based on the input variable

Benefits of Machine Learning


 Enhanced Efficiency and Automation: ML automates repetitive tasks, freeing up
human resources for more complex work. It also streamlines processes, leading to
increased efficiency and productivity.
 Data-Driven Insights: ML can analyze vast amounts of data to identify patterns and
trends that humans might miss. This allows for better decision-making based on real-
world data.
 Improved Personalization: ML personalizes user experiences across various platforms.
From recommendation systems to targeted advertising, ML tailors content and services
to individual preferences.
 Advanced Automation and Robotics: ML empowers robots and machines to perform
complex tasks with greater accuracy and adaptability. This is revolutionizing fields like
manufacturing and logistics.
Challenges of Machine Learning
 Data Bias and Fairness: ML algorithms are only as good as the data they are trained
on. Biased data can lead to discriminatory outcomes, requiring careful data selection and
monitoring of algorithms.
 Security and Privacy Concerns: As ML relies heavily on data, security breaches can
expose sensitive information. Additionally, the use of personal data raises privacy
concerns that need to be addressed.
 Interpretability and Explainability: Complex ML models can be difficult to
understand, making it challenging to explain their decision-making processes. This lack
of transparency can raise questions about accountability and trust.
 Job Displacement and Automation: Automation through ML can lead to job
displacement in certain sectors. Addressing the need for retraining and reskilling the
workforce is crucial.
Evolution of Machine Learing
 Machine learning is a subset of Artificial Intelligence (AI) which uses algorithms to
learn from data. If you are familiar with AI as it has been through the news recently in
regards to fending off cyber attacks or causing self-driving cars accidents then this will
be easy for you understand. The idea that machines can “learn” without being
programmed by humans goes back further than just recent years though; think about the
first computers in the history of machine learning and how they could only do one thing
at a time. For a practical example of this concept in action, consider how Clickworker
facilitated the training of face recognition software through their innovative crowd-
sourced approach. You can read about this case study here.
 One of the main reasons that machine learning is important for business use today is
because it helps businesses be more efficient in their processes, as well as provide better
customer service through AI-assisted chat bots or automated emails. It also provides
great tools to help teach other people about new things like geography or historical
events!

Machine Learning Paradigms:

Machine learning (ML) is a dynamic field dedicated to developing methods that enable
machines to learn from extensive datasets to enable machines to learn and make predictions.
The learning paradigms in ML are categorized based on their resemblance to human
interventions, each serving specific purposes and applications. This dynamic field
encompasses various learning paradigms, each with its unique approach to handling data.

Supervised and Unsupervised learning

Supervised Learning (SL)

Supervised learning involves labelled datasets, where each data observation is paired with a
corresponding class label. Algorithms in supervised learning aim to build a mathematical
function that maps input features to desired output values based on these labeled examples.
Common applications include classification and regression.
Stages in Supervised Learning

Understanding Supervised Learning pictorially

Unsupervised Learning

In unsupervised learning, algorithms work with unlabeled data to identify patterns and
relationships. These methods uncover commonalities within the data without predefined
categories. Techniques such as clustering and association rules fall under unsupervised
learning.

Stages in Unsupervised Learning


Understanding Unsupervised Learning pictorially

Semi-supervised Learning

Semi-supervised learning strikes a balance by combining a small amount of labelled data with
a larger pool of unlabeled data. This approach leverages the benefits of both supervised and
unsupervised learning paradigms, making it a cost-effective and efficient method for training
models when the labeled data is limited.

Understanding Semi-supervised Learning pictorially

Self-supervised Learning (SSL)


In scenarios where obtaining high-quality labeled data is challenging, self-supervised learning
emerges as a solution. In this paradigm, models are pre-trained using unlabeled data, and data
labels are generated automatically during subsequent iterations. SSL transforms unsupervised
ML problems into supervised ones, enhancing learning efficiency. This paradigm is
particularly relevant with the rise of large language models.

Reinforcement Learning

Reinforcement learning focuses on enabling intelligent agents to learn tasks through trial-and-
error interactions with dynamic environments. Without the need for labelled datasets, agents
make decisions to maximize a reward function. This autonomous exploration and learning
approach is crucial for tasks where explicit programming is challenging.

Action-Reward feedback loop: an agent takes actions in an environment, which is interpreted


into a reward and a representation of the state, which are fed back into the agent.

Action-Reward Feedback Loop:

Reinforcement learning operates on an action-reward feedback loop, where agents take


actions, receive rewards, and interpret the environment’s state. This iterative process allows
the agent to autonomously learn optimal actions to maximize positive feedback.
Learning by rote
Rote learning in machine learning is a simple learning pattern that involves memorizing new
information and storing it for future use. It involves the accumulation of information over
time, which can impact the speed and accuracy of recognition as the database grows.
Learn by induction
"Learn by induction" in machine learning refers to the process where a model learns general
rules or patterns by analyzing specific examples from a dataset, essentially making inferences
about broader concepts based on observed data, allowing it to predict outcomes on unseen
data - this is the core principle of most machine learning algorithms and is considered the
primary method of learning in the field.
Key points about inductive learning:
 Generalization:
The key goal of inductive learning is to generalize from the training data to make accurate
predictions on new, unseen data.
 Example-based learning:
Models learn by observing labelled examples where input data is paired with corresponding
outputs.
 Pattern recognition:
The algorithm identifies patterns and relationships within the data to build a model that can
represent the underlying structure.
 Inductive bias:
This refers to the assumptions a model makes about the data, which can be built into the
algorithm and influence how it learns patterns.
How it works:
1. Training phase:
The model is presented with a labeled dataset containing input data and corresponding target
values.
2. Pattern extraction:
The model analyzes the data, identifying recurring patterns and relationships between features
and targets.
3. Model building:
Based on the identified patterns, the model builds a mathematical representation that can be
used to make predictions on new data.
Examples of inductive learning algorithms:
 Decision trees: Learn by splitting data based on features to create a set of rules for
classification.
 Naive Bayes classifiers: Based on Bayes theorem, making probabilistic predictions based on
features
 Neural networks: Learn complex patterns through a layered structure, adapting weights to
improve predictions
Important considerations:
 Overfitting:
If a model learns the training data too closely, it may not generalize well to unseen data.
 Underfitting:
If a model is too simple, it may not capture important patterns in the data.
 Data quality:
The quality of the training data significantly impacts the model's ability to learn accurate
patterns.
Reinforcement Learning
Reinforcement Learning: An Overview
Reinforcement Learning (RL) is a branch of machine learning focused on making decisions
to maximize cumulative rewards in a given situation. Unlike supervised learning, which
relies on a training dataset with predefined answers, RL involves learning through
experience. In RL, an agent learns to achieve a goal in an uncertain, potentially complex
environment by performing actions and receiving feedback through rewards or penalties.
Key Concepts of Reinforcement Learning
 Agent: The learner or decision-maker.
 Environment: Everything the agent interacts with.
 State: A specific situation in which the agent finds itself.
 Action: All possible moves the agent can make.
 Reward: Feedback from the environment based on the action taken.
How Reinforcement Learning Works
RL operates on the principle of learning optimal behaviour through trial and error. The
agent takes actions within the environment, receives rewards or penalties, and adjusts its
behaviou to maximize the cumulative reward. This learning process is characterized by the
following elements:
 Policy: A strategy used by the agent to determine the next action based on the current
state.
 Reward Function: A function that provides a scalar feedback signal based on the state
and action.
 Value Function: A function that estimates the expected cumulative reward from a given
state.
 Model of the Environment: A representation of the environment that helps in planning
by predicting future states and rewards.
Example: Navigating a Maze
The problem is as follows: We have an agent and a reward, with many hurdles in between.
The agent is supposed to find the best possible path to reach the reward. The following
problem explains the problem more easily.
The above image shows the robot, diamond, and fire. The goal of the robot is to get the
reward that is the diamond and avoid the hurdles that are fired. The robot learns by trying
all the possible paths and then choosing the path which gives him the reward with the least
hurdles. Each right step will give the robot a reward and each wrong step will subtract the
reward of the robot. The total reward will be calculated when it reaches the final reward
that is the diamond.

Main points in Reinforcement learning –


 Input: The input should be an initial state from which the model will start
 Output: There are many possible outputs as there are a variety of solutions to a particular
problem
 Training: The training is based upon the input, The model will return a state and the user
will decide to reward or punish the model based on its output.
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.

Difference between Reinforcement learning and Supervised learning:


Reinforcement learning Supervised learning

Reinforcement learning is all about making decisions


In Supervised learning, the
sequentially. In simple words, we can say that the
decision is made on the initial
output depends on the state of the current input and
input or the input given at the
the next input depends on the output of the previous
start
input

In supervised learning the


In Reinforcement learning decision is dependent, So decisions are independent of
we give labels to sequences of dependent decisions each other so labels are given to
each decision.

Example:Object recognition,spam
Example: Chess game,text summarization
detetction

Types of Reinforcement:
1. Positive: Positive Reinforcement is defined as when an event, occurs due to a particular
behavior, increases the strength and the frequency of the behaviour. In other words, it
has a positive effect on behaviour.
Advantages of reinforcement learning are:

 Maximizes Performance
 Sustain Change for a long period of time
 Too much Reinforcement can lead to an overload of states which can diminish the
results
2. Negative: Negative Reinforcement is defined as strengthening of behaviour because a
negative condition is stopped or avoided.
Advantages of reinforcement learning:

 Increases behaviour
 Provide defiance to a minimum standard of performance
 It Only provides enough to meet up the minimum behaviour
Elements of Reinforcement Learning
i) Policy: Defines the agent’s behaviour at a given time.
ii) Reward Function: Defines the goal of the RL problem by providing feedback.
iii) Value Function: Estimates long-term rewards from a state.
iv) Model of the Environment: Helps in predicting future states and rewards for planning.
Types of data In ML
Data Types Are A Way Of Classification That Specifies Which Type Of Value A Variable
Can Store And What Type Of Mathematical Operations, Relational, Or Logical Operations
Can Be Applied To The Variable Without Causing An Error. In Machine Learning, It Is
Very Important To Know Appropriate Datatypes Of Independent And Dependent Variable.

As It Provides The Basis For Selecting Classification Or Regression Models. Incorrect


Identification Of Data Types Leads To Incorrect Modelling Which In Turn Leads To An
Incorrect Solution.

Here I Will Be Discussing Different Types Of Data Types With Suitable Examples.

Different Types Of Data Types


The Data Type Is Broadly Classified Into

1. Quantitative
2. Qualitative

Fig: Different Data


Types

1. Quantitative Data Type: –


This Type Of Data Type Consists Of Numerical Values. Anything Which Is Measured By
Numbers.
E.G., Profit, Quantity Sold, Height, Weight, Temperature, Etc.

This Is Again Of Two Types

A.) Discrete Data Type: –


The Numeric Data Which Have Discrete Values Or Whole Numbers. This Type Of Variable
Value If Expressed In Decimal Format Will Have No Proper Meaning. Their Values Can Be
Counted.

E.G.: – No. Of Cars You Have, No. Of Marbles In Containers, Students In A Class, Etc.

Fig: Discrete Data Types

B.) Continuous Data Type: –


The Numerical Measures Which Can Take The Value Within A Certain Range. This Type Of
Variable Value If Expressed In Decimal Format Has True Meaning. Their Values Can Not
Be Counted But Measured. The Value Can Be Infinite

E.G.: – Height, Weight, Time, Area, Distance, Measurement Of Rainfall, Etc.

Fig: Continuous Data Types

2. Qualitative Data Type: –


These Are The Data Types That Cannot Be Expressed In Numbers. This Describes
Categories Or Groups And Is Hence Known As The Categorical Data Type.

This Can Be Divided Into:-


A. Structured Data:
This Type Of Data Is Either Number Or Words. This Can Take Numerical Values But
Mathematical Operations Cannot Be Performed On It. This Type Of Data Is Expressed In
Tabular Format.

E.G.) Sunny=1, Cloudy=2, Windy=3 Or Binary Form Data Like 0 Or1, Good Or Bad, Etc.

Fig: Structured Data

B. Unstructured Data:
This Type Of Data Does Not Have The Proper Format And Therefore Known As
Unstructured [Link] Comprises Textual Data, Sounds, Images, Videos, Etc.

Fig: Unstructured
Data
Besides This, There Are Also Other Types Refer As Data Types Preliminaries Or Data
Measures:-

1. Nominal
2. Ordinal
3. Interval
4. Ratio
These Can Also Be Refer Different Scales Of Measurements.
I. Nominal Data Type:
This Is In Use To Express Names Or Labels Which Are Not Order Or Measurable.

E.G., Male Or Female (Gender), Race, Country, Etc.

Fig: Gender (Female, Male), An Example Of Nominal Data


Type

II. Ordinal Data Type:


This Is Also A Categorical Data Type Like Nominal Data But Has Some Natural Ordering
Associated With It.

E.G., Likert Rating Scale, Shirt Sizes, Ranks, Grades, Etc.

Fig: Rating
(Good, Average, Poor), An Example Of Ordinal Data Type

III. Interval Data Type:


This Is Numeric Data Which Has Proper Order And The Exact Zero Means The True
Absence Of A Value Attached. Here Zero Means Not A Complete Absence But Has Some
Value. This Is The Local Scale.

E.G., Temperature Measured In Degree Celsius, Time, Sat Score, Credit Score, PH, Etc.
Difference Between Values Is Familiar. In This Case, There Is No Absolute Zero. Absolute
Fig: Temperature, An Example Of Interval Data
Type

IV. Ratio Data Type:


This Quantitative Data Type Is The Same As The Interval Data Type But Has The Absolute
Zero. Here Zero Means Complete Absence And The Scale Starts From Zero. This Is The
Global Scale.

E.G., Temperature In Kelvin, Height, Weight, Etc.

Fig: Weight, An
Example Of Ratio Data Type
The stages in machine learning
It include data collection, model development, and model deployment.
Data collection
 Data collection: The first step in the machine learning process
 Data preparation: The process of cleaning, normalizing, and organizing data so it can be
used for analysis and modeling
Model development
 Model engineering: The process of creating, training, and refining machine learning models
 Model evaluation: The process of assessing the model's performance
 Hyperparameter tuning and optimization: The process of adjusting the model's parameters
to improve its performance
Model deployment
 Model deployment
The process of putting the machine learning model into production so it can be used to make
predictions
 Monitoring and maintenance
The process of continuously monitoring and improving the model's performance
ML LifeCycle
Machine learning lifecycle is a process that guides development and deployment of
machine learning models in a structured way. It consists of various steps. Each step plays a
crucial role in ensuring the success and effectiveness of the machine learning model. By
following the machine learning lifecycle we can solve complex problems, can get data-
driven insights and create scalable and sustainable models. The steps are:
1. Problem Definition
2. Data Collection
3. Data Cleaning and Preprocessing
4. Exploratory Data Analysis (EDA)
5. Feature Engineering and Selection
6. Model Selection
7. Model Training
8. Model Evaluation and Tuning
9. Model Deployment
10. Model Monitoring and Maintenance

Machine
Learning Lifecycle
Step 1: Problem Definition
In this initial phase we need to identify the business problem and frame it. By framing the
problem in a comprehensive manner, team can establishes foundation for machine learning
lifecycle. Crucial elements such as project objectives, desired outcomes and the scope of
the task are carefully designed during this stage.
Here are some steps for problem definition:
 Collaboration: Work together with stakeholders to understand and define the business
problem.
 Clarity: Clearly write down the objectives, desired outcomes and scope of the task.
 Foundation: Establish a solid foundation for the machine learning process by framing
the problem comprehensively.
Step 2: Data Collection
After problem definition, machine learning lifecycle progresses to data collection. This
phase involves systematic collection of datasets that can be used as raw data to train model.
The quality and diversity of the data collected directly impact the robustness and
generalization of the model.
During data collection we must consider the relevance of the data to the defined problem
ensuring that the selected datasets consist all necessary features and characteristics. A well-
organized approach for data collection helps in effective model training, evaluation and
deployment ensuring that the resulting model is accurate and can be used for real world
scenarios.
Here are some basic features of Data Collection:
 Relevance: Collect data should be relevant to the defined problem and include
necessary features.
 Quality: Ensure data quality by considering factors like accuracy and ethical use.
 Quantity: Gather sufficient data volume to train a robust model.
 Diversity: Include diverse datasets to capture a broad range of scenarios and patterns.
Step 3: Data Cleaning and Preprocessing
With datasets in hand now we need to do data cleaning and preprocessing . Raw data is
often messy and unstructured and if we use this data directly to train then it can lead to poor
accuracy and capturing unnecessary relation in data, data cleaning involves addressing
issues such as missing values, outliers and inconsistencies in data that could compromise
the accuracy and reliability of the machine learning model.
Preprocessing is done by standardizing formats, scaling values and encoding categorical
variables creating a consistent and well-organized dataset. The objective is to refine the raw
data into a format that it is meaningful for analysis and training. By data cleaning and
preprocessing we ensure that the model is trained on high-quality and reliable data.
Here are the basic features of Data Cleaning and Preprocessing:
 Data Cleaning: Address issues such as missing values, outliers and inconsistencies in
the data.
 Data Preprocessing: Standardize formats, scale values, and encode categorical
variables for consistency.
 Data Quality: Ensure that the data is well-organized and prepared for meaningful
analysis.
Step 4: Exploratory Data Analysis (EDA)
To find patterns and characteristics hidden in the data Exploratory Data Analysis (EDA) is
used to uncover insights and understand the dataset's structure. During EDA patterns, trends
and insights are provided which may not be visible by naked eyes. This valuable insight can
be used to make informed decision.
Visualizations helps in showing statistical summary in easy and understandable way. It also
helps in making choices in feature engineering, model selection and other critical aspects.
Here are the basic features of Exploratory Data Analysis:
 Exploration: Use statistical and visual tools to explore patterns in data.
 Patterns and Trends: Identify underlying patterns, trends and potential challenges
within the dataset.
 Insights: Gain valuable insights for informed decisions making in later stages.
 Decision Making: Use EDA for feature engineering and model selection.
Step 5: Feature Engineering and Selection
Feature engineering and selection is a transformative process that involve selecting only
relevant features for model prediction. Feature selection refines pool of variables
identifying the most relevant ones to enhance model efficiency and effectiveness.
Feature engineering involves selecting relevant features or creating new features by
transforming existing ones for prediction. This creative process requires domain expertise
and a deep understanding of the problem ensuring that the engineered features contribute
meaningfully for model prediction. It helps accuracy while minimizing computational
complexity.
Here are the basic features of Feature Engineering and Selection:
 Feature Engineering: Create new features or transform existing ones to capture better
patterns and relationships.
 Feature Selection: Identify subset of features that most significantly impact the model's
performance.
 Domain Expertise: Use domain knowledge to engineer features that contribute
meaningfully for prediction.
 Optimization: Balance set of features for accuracy while minimizing computational
complexity.
Step 6: Model Selection
For a good machine learning model, model selection is a very important part as we need to
find model that aligns with our defined problem and the characteristics of the dataset.
Model selection is a important decision that determines the algorithmic framework for
prediction. The choice depends on the nature of the data, the complexity of the problem and
the desired outcomes.
Here are the basic features of Model Selection:
 Alignment: Select a model that aligns with the defined problem and characteristics of
the dataset.
 Complexity: Consider the complexity of the problem and the nature of the data when
choosing a model.
 Decision Factors: Evaluate factors like performance, interpretability and scalability
when selecting a model.
 Experimentation: Experiment with different models to find the best fit for the problem.
Step 7: Model Training
With the selected model the machine learning lifecycle moves to model training process.
This process involves exposing model to historical data allowing it to learn patterns,
relationships and dependencies within the dataset.
Model training is an iterative process where the algorithm adjusts its parameters to
minimize errors and enhance predictive accuracy. During this phase the model fine-tunes
itself for better understanding of data and optimizing its ability to make predictions.
Rigorous training process ensure that the trained model works well with new unseen data
for reliable predictions in real-world scenarios.
Here are the basic features of Model Training:
 Training Data: Expose the model to historical data to learn patterns, relationships and
dependencies.
 Iterative Process: Train the model iteratively, adjusting parameters to minimize errors
and enhance accuracy.
 Optimization: Fine-tune model to optimize its predictive capabilities.
 Validation: Rigorously train model to ensure accuracy to new unseen data.
Step 8: Model Evaluation and Tuning
Model evaluation involves rigorous testing against validation or test datasets to test
accuracy of model on new unseen data. We can use technique like accuracy, precision,
recall and F1 score to check model effectiveness.
Evaluation is critical to provide insights into the model's strengths and weaknesses. If the
model fails to acheive desired performance levels we may need to tune model again and
adjust its hyperparameters to enhance predictive accuracy. This iterative cycle of evaluation
and tuning is crucial for achieving the desired level of model robustness and reliability.
Here are the basic features of Model Evaluation and Tuning:
 Evaluation Metrics: Use metrics like accuracy, precision, recall and F1 score to
evaluate model performance.
 Strengths and Weaknesses: Identify the strengths and weaknesses of the model
through rigorous testing.
 Iterative Improvement: Initiate model tuning to adjust hyperparameters and enhance
predictive accuracy.
 Model Robustness: Iterative tuning to achieve desired levels of model robustness and
reliability.
Step 9: Model Deployment
Upon successful evaluation machine learning model is ready for deployment for real-world
application. Model deployment involves integrating the predictive model with existing
systems allowing business to use this for informed decision-making.
Here are the basic features of Model Deployment:
 Integration: Integrate the trained model into existing systems or processes for real-
world application.
 Decision Making: Use the model's predictions for informed decision.
 Practical Solutions: Deploy the model to transform theoretical insights into practical
use that address business needs.
 Continuous Improvement: Monitor model performance and make adjustments as
necessary to maintain effectiveness over time.
CRISP -MODEL-CRISP-DM FRAMEWORK
The CRISP-DM (Cross Industry Standard Process for Data Mining) framework is a standard
process or framework for solving analytical problems. The framework is comprised of a six-
phase workflow and was designed to be flexible, so it suits a wide variety of projects.
What is crisp in ML?
Cross-Industry Standard Process for Machine Learning with Quality Assurance is abbreviated
as CRISP-ML
What are the 6 CRISP-DM Phases?
I. Business Understanding. ...
II. Data Understanding. ...
III. Data Preparation. ...
IV. Modeling. ...
V. Evaluation. ...
VI. Deployment. ...
CRISP-DM Waterfall: Horizontal Slicing. Learn more about slicing at Vertical vs Horizontal
Slicing Data Science.
CRISP-DM

Creating a machine learning system involves more than just selecting a model, training it, and
applying it to new data. There are frameworks that help us organize machine learning
projects.

One such framework is CRISP-DM — the Cross-Industry Standard Process for Data Mining.
It was invented quite long ago, in 1996, but in spite of its age, it’s still applicable to today’s
problems.

According to CRISP-DM, the machine learning process has six steps:

1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment

Each phase covers typical tasks:

 In the business understanding step, we try to identify the problem, to understand how
we can solve it, and to decide whether machine learning will be a useful tool for
solving it.
 In the data understanding step, we analyze available datasets and decide whether we
need to collect more data.
 In the data preparation step, we transform the data into tabular form that we can use as
input for a machine learning model.
 When the data is prepared, we move to the modeling step, in which we train a model.
 After the best model is identified, there’s the evaluation step, where we evaluate the
model to see if it solves the original business problem and measure its success at
doing that.
 Finally, in the deployment step, we deploy the model to the production environment.

Example
Suppose we want to build a spam detection system: for each email we get, we want to
determine if it’s spam or not. If it is, we want to put it into the “spam” folder.

You might also like