0% found this document useful (0 votes)

23 views36 pages

Data Sources for Machine Learning Models

The document provides an overview of machine learning, detailing the steps involved in the process, including data collection, preparation, model selection, training, evaluation, parameter tuning, and making predictions. It also categorizes machine learning into supervised, unsupervised, and reinforcement learning, explaining various algorithms associated with each type. Key points about data sources and the importance of data quality are emphasized throughout the text.

Uploaded by

Chaitali Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views36 pages

Data Sources for Machine Learning Models

Uploaded by

Chaitali Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Machine Learning

By [Link] Mhatre
Application
1. Collecting Data:

 As you know, machines initially learn from the data that you give them.
 It is of the utmost importance to collect reliable data so that your machine learning
model can find the correct patterns.
 The quality of the data that you feed to the machine will determine how accurate your
model is.
 If you have incorrect or outdated data, you will have wrong outcomes or predictions
which are not relevant.
 Make sure you use data from a reliable source, as it will directly affect the outcome of
your model.
 Good data is relevant, contains very few missing and repeated values, and has a good
representation of the various subcategories/classes present.

Figure 1: Collecting Data

Some popular sources for datasets include:

Kaggle (a large repository with diverse data types), FiveThirtyEight (curated

datasets from data journalism), AWS Public Datasets (datasets across various
domains like genomics and climate), World Bank Open Data (global development
data), EU Open Data Portal (datasets from European Union institutions), Pew
Research Center (datasets related to public opinion), and GitHub (user-
contributed datasets)
Key points about these sources:

Kaggle:
Well-known for hosting machine learning competitions and providing a vast collection of
datasets.

FiveThirtyEight:
Offers high-quality datasets often accompanied by code used for visualizations in their news
articles.

AWS Public Datasets:

Access to datasets across various fields, hosted on Amazon S3 storage.

World Bank:
Comprehensive data source on global development topics like poverty, education, and health.

EU Open Data Portal:

Public access to datasets from European Union institutions on topics like government, health,
and energy.

Pew Research Center:

Datasets based on surveys, providing insights into public opinion and societal trends.

GitHub:
2. Preparing the Data:

After you have your data, you have to prepare it. You can do this by :
 Putting together all the data you have and randomizing it. This helps make sure that
data is evenly distributed, and the ordering does not affect the learning process.
 Cleaning the data to remove unwanted data, missing values, rows, and columns,
duplicate values, data type conversion, etc. You might even have to restructure the
dataset and change the rows and columns or index of rows and columns.
 Visualize the data to understand how it is structured and understand the relationship
between various variables and classes present.
 Splitting the cleaned data into two sets - a training set and a testing set. The training set
is the set your model learns from. A testing set is used to check the accuracy of your
model after training.

Figure 2: Cleaning and Visualizing Data

3. Choosing a Model:

 A machine learning model determines the output you get after running a machine
learning algorithm on the collected data.
 It is important to choose a model which is relevant to the task at hand. Over the
years, scientists and engineers developed various models suited for different tasks
like speech recognition, image recognition, prediction, etc.
 Apart from this, you also have to see if your model is suited for numerical or
categorical data and choose accordingly.

Figure 3: Choosing a model

4. Training the Model:
 Training is the most important step in machine learning. In training, you pass the
prepared data to your machine learning model to find patterns and make
predictions.
 It results in the model learning from the data so that it can accomplish the task set.
 Over time, with training, the model gets better at predicting.

Figure 4: Training a model

5. Evaluating the Model:

 After training your model, you have to check to see how it’s performing.
 This is done by testing the performance of the model on previously unseen data.
 The unseen data used is the testing set that you split our data into earlier.
 If testing was done on the same data which is used for training, you will not get an
accurate measure, as the model is already used to the data, and finds the same
patterns in it, as it previously did.
 This will give you disproportionately high accuracy.
 When used on testing data, you get an accurate measure of how your model will
perform and its speed.

Figure 5: Evaluating a model

 Unseen data:
•This is a broader term encompassing any data that the model has not been
trained on, including potential real-world scenarios that the model might
encounter.

 Testing data:
•A carefully selected subset of unseen data used to test the model's
accuracy and ability to generalize to new situations.

How they are used:

•Training phase: The model learns patterns from the "training data".
•Evaluation phase: Once the model is trained, it is tested on the "testing
data" to assess its performance on unseen examples
6. Parameter Tuning:

 Once you have created and evaluated your model, see if its accuracy can be
improved in any way.
 This is done by tuning the parameters present in your model.
 Parameters are the variables in the model that the programmer generally decides.
 At a particular value of your parameter, the accuracy will be the maximum.
 Parameter tuning refers to finding these values.

Figure 6: Parameter Tuning

7. Making Predictions:

 In the end, you can use your model on unseen data to make predictions accurately .
Types of Machine Learning:
There are three important types of Machine Learning Algorithms :

 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning

Supervised Learning:

 Supervised Learning is the most popular paradigm for performing machine learning
operations. It is widely used for data where there is a precise mapping between input-
output data.
 The dataset, in this case, is labeled, meaning that the algorithm identifies the features
explicitly and carries out predictions or classification accordingly.
 As the training period progresses, the algorithm is able
to identify the relationships between the two variables such that we can predict a new
outcome.
.

 Resulting Supervised learning algorithms are task-oriented.

 As we provide it with more and more examples, it is able to learn more
properly so that it can undertake the task and yield us the output more accurately.
 Some of the algorithms that come under supervised learning are as follows –

Linear Regression:

 Linear Regression may be a supervised machine learning algorithm where the

anticipated output is continuous and features a constant slope.
 It’s to predict values within endless range, (e.g. sales, price) instead of trying to
classify them into categories (e.g. cat, dog).
 It’s used whenever we want to predict the worth of a variable supported the worth
of another variable.
 The variable we would like to predict is named the variable (or sometimes,
the result variable).
 In linear regression, we measure the linear relationship between two or more than
two variables. Based on this relationship, we perform predictions that follow
this linear pattern.
Random Forest:

 Random forest may be a flexible, easy to use machine learning algorithm that
produces, even without hyper-parameter tuning, an excellent result most of the
time.
 It’s also one among the foremost used algorithms, due to
its simplicity and variety (it are often used for both classification and regression
tasks).
 Random Forests are an ensemble learning method that is for performing
classification, regression as well as other tasks through the construction of decision
trees and providing the output as a class which is the mode or mean of the
underlying individual trees.
Gradient Boosting:

 Gradient boosting may be a machine learning technique

for regression and classification problems, which produces
a prediction model within the sort of an ensemble of weak
prediction models, typically decision trees.
 It is an ensemble learning method that is a collection of several
weak decision trees which results in a powerful classifier.

Support Vector Machine:

 SVMs are powerful classifiers that are used

for classifying the binary dataset into two classes with the help
of hyperplanes.
 The benefits of support vector machines are Effective in high
dimensional spaces.
 Still effective in cases where a number of
dimensions is bigger than the number of samples.
Logistic Regression:

 Logistic regression may be a statistical model that in its basic form uses a logistic
function to model a binary variable, although more complex extensions exist.
 In multivariate analysis , logistic regression (or logit regression) is estimating
the parameters of a logistic model (a sort of binary regression).
 It makes use of a bell-shaped S curve that is generated with the help of logit
function to categorize the data into their respective classes.
Artificial Neural Networks:

 Artificial Neural Networks are modeled after the human brain and they learn from the
data over time. They form a much larger portion of machine learning called Deep
Learning.
Unsupervised Learning:

 In the case of an unsupervised learning algorithm, the data is not explicitly labeled into different
classes, that is, there are no labels. The model is able to learn from the data by finding implicit
patterns.
 Unsupervised Learning algorithms identify the data based on their densities, structures, similar
segments, and other similar features. Unsupervised Learning Algorithms are based on Hebbian
Learning.
 Cluster analysis is one of the most widely used techniques in supervised learning.
Let us look at some of the important algorithms that come under Unsupervised Learning.
Clustering:

Clustering, also known as cluster analysis, is a technique of grouping similar sets of

objects in the same group that is different from the objects in other group.
Some of the essential clustering techniques are as follows –

a. K-means:
The aim of the k-means clustering algorithm is to partition the n observations in the data
into k clusters such that each observation belongs to the cluster with the nearest mean.

This serves as the prototype of the cluster.

b. DBSCAN:
This is a clustering method that groups the data based on the density. It groups
together the points that are given in the space and marks the outliers in the low-
density region.

c. Hierarchical clustering:
In this form of clustering, a hierarchy of clusters is built.
Classification Regression

In this problem statement, the target variables are discrete. In this problem statement, the target variables are continuous.

Problems like Spam Email Classification, Disease prediction like Problems like House Price Prediction, Rainfall Prediction like
problems are solved using Classification Algorithms. problems are solved using regression Algorithms.

In this algorithm, we try to find the best possible decision boundary

In this algorithm, we try to find the best-fit line which can represent
which can separate the two classes with the maximum possible
the overall trend in the data.
separation.

Evaluation metrics like Mean Squared Error, R2-Score, and MAPE

Evaluation metrics like Precision, Recall, and F1-Score are used
are used here to evaluate the performance of the regression
here to evaluate the performance of the classification algorithms.
algorithms.

Here we face the problems like binary Classification or Here we face the problems like Linear Regression models as well as
Multi-Class Classification problems. non-linear models.

Input Data are Independent variables and categorical dependent Input Data are Independent variables and continuous dependent
variable. variable.

The classification algorithm’s task mapping the input value of x The regression algorithm’s task is mapping input value (x) with
with the discrete output variable of y. continuous output variable (y).

Output is Categorical labels. Output is Continuous numerical values.

Objective is to Predict categorical/class labels. Objective is to Predicting continuous numerical values.

Example use cases are Spam detection, image recognition, Example use cases are Stock price prediction, house price
sentiment analysis prediction, demand forecasting.

Examples of regression algorithms are:

Examples of classification algorithms are:
Linear Regression, Polynomial Regression, Ridge Regression, Lasso
Logistic Regression, Decision Trees, Random Forest, Support Vector
Regression, Support Vector Regression (SVR), Decision Trees for
Machines (SVM), K-Nearest Neighbors (K-NN), Naive Bayes, Neural
Regression, Random Forest Regression, K-Nearest Neighbors (K-NN)
Networks, K-Means Clustering, Multi-layer Perceptron (MLP), etc.
Regression, Neural Networks for Regression, etc.
Thank you……………

AI Algorithms for Traffic Prediction
No ratings yet
AI Algorithms for Traffic Prediction
11 pages
IoT Platforms and Frameworks Overview
No ratings yet
IoT Platforms and Frameworks Overview
42 pages
AI, IoT & Blockchain in Supply Chains
No ratings yet
AI, IoT & Blockchain in Supply Chains
10 pages
Classification Techniques in Data Mining
No ratings yet
Classification Techniques in Data Mining
141 pages
Overview of Classification Algorithms
No ratings yet
Overview of Classification Algorithms
75 pages
Lesson 8 - Classification
No ratings yet
Lesson 8 - Classification
74 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
60 pages
Understanding Factor Analysis Techniques
No ratings yet
Understanding Factor Analysis Techniques
55 pages
IoT Overview by Dr. Vinay Chamola
No ratings yet
IoT Overview by Dr. Vinay Chamola
24 pages
Designing CPS for Smart Cities
No ratings yet
Designing CPS for Smart Cities
50 pages
Cyber-Physical Systems Course Overview
No ratings yet
Cyber-Physical Systems Course Overview
82 pages
Classification and Prediction Techniques
No ratings yet
Classification and Prediction Techniques
19 pages
Understanding P-Values in Data Analysis
No ratings yet
Understanding P-Values in Data Analysis
30 pages
Supervised Learning: Classification Overview
No ratings yet
Supervised Learning: Classification Overview
45 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
211 pages
Cybersecurity Strategies for ITS Networks
No ratings yet
Cybersecurity Strategies for ITS Networks
37 pages
Artificial Intelligence - A Smart and Empowering Approach To Womens Safety
No ratings yet
Artificial Intelligence - A Smart and Empowering Approach To Womens Safety
19 pages
Logistic Regression in Python Basics
No ratings yet
Logistic Regression in Python Basics
28 pages
Smart Traffic Management with IoT and ML
No ratings yet
Smart Traffic Management with IoT and ML
6 pages
AI Seminar Report Overview
No ratings yet
AI Seminar Report Overview
23 pages
GCC Clickstream Analytics Overview
No ratings yet
GCC Clickstream Analytics Overview
23 pages
IoT Architecture and Applications Overview
No ratings yet
IoT Architecture and Applications Overview
5 pages
Machine Learning in Intelligent Transportation
No ratings yet
Machine Learning in Intelligent Transportation
30 pages
DDoS Attack Detection Using ML Techniques
No ratings yet
DDoS Attack Detection Using ML Techniques
6 pages
Data Science Career Track Overview
No ratings yet
Data Science Career Track Overview
42 pages
Introduction to Predictive Analytics
No ratings yet
Introduction to Predictive Analytics
40 pages
Machine Learning Classification Overview
No ratings yet
Machine Learning Classification Overview
81 pages
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
No ratings yet
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
89 pages
Intelligent Transport System Architecture
No ratings yet
Intelligent Transport System Architecture
51 pages
Understanding Genetic Algorithms Basics
No ratings yet
Understanding Genetic Algorithms Basics
80 pages
Cyber Security Techniques Overview
No ratings yet
Cyber Security Techniques Overview
17 pages
Deep Learning for Crop Yield Prediction
No ratings yet
Deep Learning for Crop Yield Prediction
17 pages
Navigating the Machine Learning Journey
No ratings yet
Navigating the Machine Learning Journey
21 pages
Feature Engineering
No ratings yet
Feature Engineering
9 pages
Cyber-Physical Systems Overview at BITS Pilani
No ratings yet
Cyber-Physical Systems Overview at BITS Pilani
21 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
17 pages
Machine Learning Projects Overview 2023-2024
No ratings yet
Machine Learning Projects Overview 2023-2024
23 pages
Lesson 5 - Supervised Learning-Classification
100% (1)
Lesson 5 - Supervised Learning-Classification
91 pages
Overview of Factor Analysis Types
No ratings yet
Overview of Factor Analysis Types
2 pages
Intelligent Transportation Systems Overview
No ratings yet
Intelligent Transportation Systems Overview
56 pages
Understanding Bayesian Networks Basics
No ratings yet
Understanding Bayesian Networks Basics
50 pages
AdaBoost Classifier Tutorial in Python
100% (1)
AdaBoost Classifier Tutorial in Python
9 pages
Metaheuristics For Enterprise Data - Kaustubh Vaman Sakhare
No ratings yet
Metaheuristics For Enterprise Data - Kaustubh Vaman Sakhare
159 pages
Private Blockchain Systems Explained
No ratings yet
Private Blockchain Systems Explained
69 pages
Overview of Classification Models in ML
No ratings yet
Overview of Classification Models in ML
52 pages
Dss 10
No ratings yet
Dss 10
47 pages
Machine Learning Applications Overview
No ratings yet
Machine Learning Applications Overview
39 pages
Smart Parking IoT Case Study Report
No ratings yet
Smart Parking IoT Case Study Report
17 pages
Big Data Analytics in Railways
No ratings yet
Big Data Analytics in Railways
11 pages
Understanding Logistic Regression in ML
No ratings yet
Understanding Logistic Regression in ML
18 pages
Classification Concepts and Methods
No ratings yet
Classification Concepts and Methods
83 pages
Mohanty IEEE-DL IoT
No ratings yet
Mohanty IEEE-DL IoT
102 pages
Deep Learning For IoT Big Data and Streaming Analytics A Survey
No ratings yet
Deep Learning For IoT Big Data and Streaming Analytics A Survey
40 pages
Data Mining Challenges in Meta's Transition
No ratings yet
Data Mining Challenges in Meta's Transition
6 pages
Untangle Whitepaper Internet of Things at Home
100% (1)
Untangle Whitepaper Internet of Things at Home
3 pages
Machine Learning Basics and Applications
No ratings yet
Machine Learning Basics and Applications
22 pages
Lect 4
No ratings yet
Lect 4
37 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
35 pages
Machine Learning Basics for Beginners
No ratings yet
Machine Learning Basics for Beginners
9 pages
Fundamentals of Machine Learning Unit 1
No ratings yet
Fundamentals of Machine Learning Unit 1
9 pages
Creating Bland-Altman Plots in Excel
No ratings yet
Creating Bland-Altman Plots in Excel
3 pages
Association Rules
No ratings yet
Association Rules
20 pages
FFT Analysis of Noisy Audio Signal
No ratings yet
FFT Analysis of Noisy Audio Signal
3 pages
Maths Methods Units 3 & 4 Trial Exam
No ratings yet
Maths Methods Units 3 & 4 Trial Exam
13 pages
Copulas - Course Notes
No ratings yet
Copulas - Course Notes
11 pages
Python Recursion and Sorting Algorithms
No ratings yet
Python Recursion and Sorting Algorithms
2 pages
Types of Signals in Signal Processing
No ratings yet
Types of Signals in Signal Processing
7 pages
Numerical Differentiation & Integration Guide
No ratings yet
Numerical Differentiation & Integration Guide
16 pages
Benchmark LOB Dataset for China Markets
No ratings yet
Benchmark LOB Dataset for China Markets
14 pages
Computer & Network Security Syllabus
No ratings yet
Computer & Network Security Syllabus
45 pages
Dual Multiplex Method (Duplex)
No ratings yet
Dual Multiplex Method (Duplex)
5 pages
Overview of Deep Belief Networks
No ratings yet
Overview of Deep Belief Networks
19 pages
Simulink®Simulink Control Design User's Guide
No ratings yet
Simulink®Simulink Control Design User's Guide
1,352 pages
K-Means Clustering: Iterative Process
No ratings yet
K-Means Clustering: Iterative Process
14 pages
Regression Analysis Summary Report
No ratings yet
Regression Analysis Summary Report
5 pages
Text-to-LEGO Design Generation
No ratings yet
Text-to-LEGO Design Generation
16 pages
TweepFake: Detecting Deepfake Tweets
No ratings yet
TweepFake: Detecting Deepfake Tweets
19 pages
Numerical Differentiation Techniques
No ratings yet
Numerical Differentiation Techniques
4 pages
Bayesian Belief Network
No ratings yet
Bayesian Belief Network
6 pages
Big Data Processing with Relational Databases
No ratings yet
Big Data Processing with Relational Databases
10 pages
ESPript: Visualizing Sequence Alignments
No ratings yet
ESPript: Visualizing Sequence Alignments
4 pages
Types of Probability Distributions
No ratings yet
Types of Probability Distributions
16 pages
Overview of Neural Networks History
No ratings yet
Overview of Neural Networks History
44 pages
Excel Solver Optimization Report
No ratings yet
Excel Solver Optimization Report
5 pages
Introduction to Relational Model
No ratings yet
Introduction to Relational Model
28 pages
Escape Room Clue Decoding Guide
No ratings yet
Escape Room Clue Decoding Guide
21 pages
Understanding PMTS in Work Measurement
No ratings yet
Understanding PMTS in Work Measurement
22 pages
Workload Balancing Case Analysis
100% (1)
Workload Balancing Case Analysis
3 pages
Machine Learning and Deep Learning Course
No ratings yet
Machine Learning and Deep Learning Course
5 pages
Numerical Methods for Fractional PDEs
No ratings yet
Numerical Methods for Fractional PDEs
464 pages

Data Sources for Machine Learning Models

Uploaded by

Data Sources for Machine Learning Models

Uploaded by

Machine Learning

Figure 1: Collecting Data

Kaggle (a large repository with diverse data types), FiveThirtyEight (curated

AWS Public Datasets:

EU Open Data Portal:

Pew Research Center:

Figure 2: Cleaning and Visualizing Data

Figure 3: Choosing a model

Figure 4: Training a model

Figure 5: Evaluating a model

How they are used:

Figure 6: Parameter Tuning

 Resulting Supervised learning algorithms are task-oriented.

 Linear Regression may be a supervised machine learning algorithm where the

 Gradient boosting may be a machine learning technique

Support Vector Machine:

 SVMs are powerful classifiers that are used

Clustering, also known as cluster analysis, is a technique of grouping similar sets of

This serves as the prototype of the cluster.

In this algorithm, we try to find the best possible decision boundary

Evaluation metrics like Mean Squared Error, R2-Score, and MAPE

Output is Categorical labels. Output is Continuous numerical values.

Objective is to Predict categorical/class labels. Objective is to Predicting continuous numerical values.

Examples of regression algorithms are:

You might also like