0% found this document useful (0 votes)
3 views35 pages

Introduction to Machine Learning Basics

Uploaded by

Amrita P
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views35 pages

Introduction to Machine Learning Basics

Uploaded by

Amrita P
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

INTRODUCTION TO MACHINE LEARNING

MODULE I
AI refers to the development of programs that behave intelligently
and mimic human intelligence through a set of algorithms.
Machine learning is a subset of AI, which uses algorithms that learn
from data to make predictions. These predictions can be generated
through supervised learning, where algorithms learn patterns from
existing data, or unsupervised learning, where they discover general
patterns in data. ML models can predict numerical values based on
historical data, categorize events as true or false, and cluster data
points based on commonalities.
ie. Machine learning (ML) is a discipline of artificial intelligence
(AI) that provides machines with the ability to automatically
learn from data and past experiences while identifying patterns to
make predictions with minimal human intervention.
Deep learning, on the other hand, is a subfield of machine learning
dealing with algorithms based essentially on multi-layered artificial
neural networks (ANN) that are inspired by the structure of the
human brain. Unlike conventional machine learning algorithms, deep
learning algorithms are less linear, more complex, and hierarchical,
capable of learning from enormous amounts of data, and able to
produce highly accurate results. Language translation, image
recognition, and personalized medicines are some examples of deep
learning applications.
Traditional Programming

Traditional programming is a manual process—meaning a person


(programmer) creates the program. But without anyone programming
the logic, one has to manually formulate or code rules.

In machine learning, on the other hand, the algorithm automatically


formulates the rules from the data.

Machine Learning Programming

Unlike traditional programming, machine learning is an automated


process. It can increase the value of your embedded analytics in many
areas, including data prep, natural language interfaces, automatic
outlier detection, recommendations, and causality and significance
detection. All of these features help speed user insights and reduce
decision bias.
For example, if you feed in customer demographics and transactions
as input data and use historical customer churn rates as your output
data, the algorithm will formulate a program that can predict if a
customer will churn or not. That program is called a predictive
model.

You can use this model to predict business outcomes in any situation
where you have input and historical output data:

1. Identify the business question you would like to ask.


2. Identify the historical input.
3. Identify the historically observed output (i.e., data samples for
when the condition is true and for when it’s false).
For instance, if you want to predict who will pay the bills late,
identify the input (customer demographics, bills) and the output (pay
late or not), and let the machine learning use this data to create your
model.
In summary, traditional programming is rule-based and deterministic,
relying on human-crafted logic, whereas machine learning is data-
driven and probabilistic, relying on patterns learned from data.

A machine can learn if it can gain more data to improve its


performance.

How does Machine Learning work

A machine learning system builds prediction models, learns from


previous data, and predicts the output of new data whenever it
receives it. The amount of data helps to build a better model that
accurately predicts the output, which in turn affects the accuracy of
the predicted output.

The Machine Learning algorithm's operation is depicted in the


following block diagram:

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
In supervised learning, sample labeled data are provided to the
machine learning system for training, and the system then predicts the
output based on the training data.

The system uses labeled data to build a model that understands the
datasets and learns about each one. After the training and processing
are done, we test the model with sample data to see if it can accurately
predict the output.

The mapping of the input data to the output data is the objective of
supervised learning.

Supervised learning can be grouped further in two categories of


algorithms:

o Classification
o Regression

Regression is when the variable to predict is numerical, whereas


classification is when the variable to predict is categorical. For
example, regression would use age to predict income, while
classification would use age to predicate a category like making a
specific purchase.

Within supervised learning, various algorithms are used-


 Linear regression
 Logistic regression
 Decision trees
 Random forest
 Gradient boosting

 Artificial neural networks

2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns
without any supervision.

The training is provided to the machine with the set of data that has
not been labeled, classified, or categorized, and the algorithm needs to
act on that data without any supervision. The goal of unsupervised
learning is to restructure the input data into new features or a group of
objects with similar patterns.

These algorithms are tasked with finding patterns and


relationships within the data without any prior knowledge of the
data’s meaning. Unsupervised machine learning algorithms find
hidden patterns and data without any human intervention, i.e.,
we don’t give output to our model. The training model has only
input parameter values and discovers the groups or patterns on
its own.

In unsupervised learning, we don't have a predetermined result. The


machine tries to find useful insights from the huge amount of data. It
can be further classifieds into two categories of algorithms:

o Clustering
o Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in
which a learning agent gets a reward for each right action and gets a
penalty for each wrong action. The agent learns automatically with
these feedbacks and improves its performance. In reinforcement
learning, the agent interacts with the environment and explores it. The
goal of an agent is to get the most reward points, and hence, it
improves its performance.

Unlike supervised learning, which relies on a training dataset with


predefined answers, RL involves learning through experience. In RL,
an agent learns to achieve a goal in an uncertain, potentially complex
environment by performing actions and receiving feedback through
rewards or penalties.(trial and error)
STEPS IN ML

Machine learning (ML) involves several steps, which help to create a


model that can learn from data and make predictions or decisions.
Here's a general overview of the key steps in the machine learning
process:

1. Define the Problem

 Understand the business or research problem.


 Determine the type of problem (e.g., classification, regression,
clustering).
 Set clear goals for the ML model (e.g., accuracy(Accuracy is a
specific metric that measures the proportion of correct
predictions made by the model out of all predictions),
performance(Performance" is a broader term that encompasses
various aspects of how well a model works. It often refers to the
overall effectiveness of the model in solving the problem,
including multiple metrics, not just accuracy-precision,
recall,F1score etc).
2. Collect and Prepare the Data

 Data Collection: Gather relevant data from different sources


(databases, APIs, sensors, etc.).
 Data Cleaning: Handle missing values, remove duplicates,
correct errors, and normalize or standardize the data(convert
feature into a common scale).
 Data Exploration: Perform exploratory data analysis (EDA) to
understand the dataset (e.g., summary statistics, visualizations).
 Feature Engineering: Create new features or modify existing
ones to improve model performance.

3. Split the Data

 Divide the data into training, validation, and testing datasets.


o Training Data: Used to train the model.
o Validation Data: Used to tune the model’s
hyperparameters and select the best model.
o Testing Data: Used to evaluate the final model's
performance.

4. Select a Model

 Choose an appropriate machine learning algorithm or model for


the problem.
o For supervised learning, models include linear regression,
decision trees, random forests, support vector machines,
etc.
o For unsupervised learning, models like k-means clustering,
hierarchical clustering, and PCA may be chosen.
o For deep learning, neural networks can be considered.

5. Train the Model

 Use the training dataset to fit the model and learn the
relationships between the features and target variables.
 The model parameters are adjusted to minimize error or
optimize a specific objective (e.g., minimize loss function).
6. Evaluate the Model

 Assess the model's performance using the validation dataset.


 Common evaluation metrics include accuracy, precision, recall,
F1 score, mean squared error (MSE), etc.
 Analyze if the model is underfitting (too simple) or overfitting
(too complex).

7. Tune the Model

 Adjust hyperparameters to improve model performance (e.g.,


learning rate, regularization parameters).
 Techniques like grid search, random search, or Bayesian
optimization can be used for hyperparameter tuning.

8. Test the Model

 Evaluate the final model on the test dataset to check its


performance on unseen data.
 Ensure the model generalizes well to new, real-world data.

9. Deploy the Model

 Integrate the trained model into a production environment for


real-time predictions or batch processing.
 Set up monitoring and logging to track the model’s performance
in production.

10. Monitor and Maintain the Model

 Continuously monitor the model’s performance, detect drift in


data or performance, and retrain the model if necessary.
 Update the model periodically as new data becomes available or
as business requirements evolve.

Each of these steps is iterative, and the process may involve revisiting
earlier steps based on insights gained from later ones.
Applications of ML
 Image Recognition
 Speech Recognition(AI based vehicle commands, alexa)
 Recommender Systems(An example of these services is very
common for example youtube. It recommends new videos and
content based on the user’s past search [Link]
recommendation systems)
 Self Driving Cars
 Online Fraud Detection
 Stock Market Trading
 Spam detection
 Medical Diagnosis
 Traffic Prediction
 Virtual Personnel System(Virtual assistance)
Feature Selection Techniques
Feature selection:
Feature selection is a process that chooses a subset of features from
the original features by removing the redundant, irrelevant, or
noisy features so that the feature space is optimally reduced
according to a certain criterion.
While developing the machine learning model, only a few
variables in the dataset are useful for building the model, and the
rest features are either redundant or irrelevant. If we input the
dataset with all these redundant and irrelevant features, it may
negatively impact and reduce the overall performance and accuracy
of the model. Hence it is very important to identify and select the
most appropriate features from the data and remove the irrelevant
or less important features, which is done with the help of feature
selection in machine learning.
A feature is an attribute that has an impact on a problem or is
useful for the problem, and choosing the important features for the
model is known as feature selection.
Each machine learning process depends on feature engineering,
which mainly contains two processes; which are
 Feature Selection
 Feature Extraction.
Feature selection is about selecting the subset of the original
feature set.
Feature extraction creates new features. Feature selection is a way
of reducing the input variable for the model by using only relevant
data in order to reduce overfitting in the model.(Overfitting
happens when a model learns too much from the training data,
including details that don’t matter (like noise or outliers)
Here given some set of features and we will try to assign a rate or a
weight for that particular feature with the help of entropy, variance
and capacity to maintain local similarity.
The one which is having more weight will be selected and others
will be removed from that feature set.
Here we will be given a set of features, from this set of feature we
have to generate the subset of features. Then we will give these
subset of features to the machine learning algorithms the same will
be repeated again and again for all subset of features and the one
which gives the maximum performance will be considered as the
final one .
To generate a subset of features there are 2 methods
Forward wrapper methods and Backward wrapper methods
Forward wrapper method
We start with an empty feature set and train the model after that
and add one to the feature set and create one model and measure
the performance of the model
Here given some set of features and we try to identify the
correlation of features(eg if x increases y also increases)
If it is highly correlated with the target variable, then we will retain
other wise it will be eliminated.

There are some machine learning algorithm which uses this


method. eg: Decision tree
In decision tree while building the tree we will start with the
feature having more information gain(Information gain is a
measure used to determine which feature should be used to split the
data at each internal node of the decision tree)
So when it comes to the next level, again we will select feature with
more information gain
So rather than considering all features, we will only consider the
feature which is having more importance and then we will start
building the ML model

Advantages and Disadvantages of Feature Selection

Entropy
Entropy is a concept that stems from information theory, which
measures the impurity of the sample values. It is defined with by
the following formula, where:
 S represents the data set that entropy is calculated
 c represents the classes in set, S
 p(c) represents the proportion of data points that belong to class
c to the number of total data points in set, S

Entropy values can fall between 0 and 1. If all samples in data set,
S, belong to one class, then entropy will equal zero. If half of the
samples are classified as one class and the other half are in another
class, entropy will be at its highest at 1. In order to select the best
feature to split on and find the optimal decision tree, the attribute
with the smallest amount of entropy should be used.
Information gain represents the difference in entropy before and
after a split on a given attribute. The attribute with the highest
information gain will produce the best split as it’s doing the best
job at classifying the training data according to its target
classification. Information gain is usually represented with the
following formula,

Imagine that we have the following arbitrary dataset:


For this dataset, the entropy is 0.94. This can be calculated by finding
the proportion of days where “Play Tennis” is “Yes”, which is 9/14,
and the proportion of days where “Play Tennis” is “No”, which is
5/14. Then, these values can be plugged into the entropy formula
above.

Entropy (Tennis) = -(9/14) log2(9/14) – (5/14) log2 (5/14) = 0.94

We can then compute the information gain for each of the attributes
individually. For example, the information gain for the attribute,
“Humidity” would be the following:

Gain (Tennis, Humidity) = (0.94)-(7/14)*(0.985) – (7/14)*(0.592) =


0.151

We can then compute the information gain for each of the attributes
individually. For example, the information gain for the attribute,
“Humidity” would be the following:

Gain (Tennis, Humidity) = (0.94)-(7/14)*(0.985) – (7/14)*(0.592) =


0.151

As a recap,
- 7/14 represents the proportion of values where humidity equals
“high” to the total number of humidity values. In this case, the
number of values where humidity equals “high” is the same as the
number of values where humidity equals “normal”.

- 0.985 is the entropy when Humidity = “high”

- 0.59 is the entropy when Humidity = “normal”

Then, repeat the calculation for information gain for each attribute in
the table above, and select the attribute with the highest information
gain to be the first split point in the decision tree. In this case, outlook
produces the highest information gain. From there, the process is
repeated for each subtree.

Descriptive Statistics
The first step of any data-related process is the collection of data.
After Data collection, data can be sorted, analyzed, and used in
various methods and formats, depending on the project’s needs.
While analyzing a dataset, We use statistical methods to arrive at a
conclusion. Two types of statistical methods are widely used in
data analysis: descriptive and inferential.
Descriptive statistics is a means of describing features of a data set by
yielding summaries about the data samples. It aids in improving data
analysis and identifying the dataset’s trend.
Measures of central tendency, including mean, median, and mode, are
important statistical concepts used in machine learning for several
purposes. They help summarize and understand the data, which can
influence how machine learning models are trained and evaluated.
Types of Descriptive Statistics

There are various dimensions in which this data can be described.


The three main dimensions used for describing data are the central
tendency, dispersion, and the shape of the data.

Descriptive Statistics Based on the Central Tendency of Data

The central tendency of data is the center of the distribution of


data. It describes the location of data and concentrates on where the
data is located. The three most widely used measures of the
“center” of the data are Mean, Median, and Mode.

Mean

The “Mean” is the average of the data. The average can be


identified by summing up all the numbers and then dividing them
by the number of observations.

Mean = X1 + X2 + X3 +… + Xn / n

Example:
Data – 10,20,30,40,50 and Number of observations = 5
Mean = [ 10+20+30+40+50 ] / 5
Mean = 30

The central tendency of the data may be influenced by outliers. An


outlier is a data point that differs significantly from other
observations. It can cause serious problems in analysis.

Example:

Data – 10,20,30,40,200
Mean = [ 10+20+30+40+200 ] / 5
Mean = 60

Solution for the outliers problem: Removing the outliers while


taking averages will give us better results.

Median

It is the 50th percentile of the data. In other words, it is exactly the


center point of the data. Neural networks identify the median by
ordering the data, splitting it into two equal parts, and then finding
the number in the middle. It is the best way to find the center of the
data.

Note that, in this case, the central tendency of the data is not
affected by outliers.

Example:

Odd number of Data – 10,20,30,40,50


Median is 30.
Even the number of data – 10,20,30,40,50,60

Find the middle 2 data and take the mean of those two values.
Here, 30 and 40 are middle values.

Now, add them and divide the result by 2


30+40 / 2 =35
Median is 35

Mode

The mode of the data is the most frequently occurring data or


elements in a dataset. If an element occurs the highest number of
times, it is the mode of that data. If no number in the data is
repeated, then that data has no mode. There can be more than one
mode in a dataset if two values have the same frequency, which is
also the highest frequency.

Example:

Data – 1,3,4,6,7,3,3,5,10, 3
Mode is 3, because 3 has the highest frequency (4 times)

Descriptive Statistics Based on the Dispersion of Data

The dispersion is the “spread of the data”. It measures how far the
data is spread. In most of the dataset, the data values are closely
located near the mean. The values in some other datasets spread
widely from the mean. You can measure these dispersions
of data using the Interquartile Range (IQR), range, standard
deviation, and variance.

Measures of variability, also called measures of dispersion, help

quantify the spread or distribution of observations in a dataset.


1. Inter Quartile Range (IQR)

The range between the first and third quartiles (Q3 - Q1) is called
the Interquartile Range (IQR),

The Inter Quartile Range is the difference between the third


quartile (Q3) and the first quartile (Q1)

IQR = Q3 – Q1

Quartiles are special percentiles.


1st Quartile Q1 is the same as the 25th percentile.
2nd Quartile Q2 is the same as 50th percentile.
3rd Quratile Q3 is same as 75th percentile

Quartiles

Quartiles are values that divide a dataset into four equal parts, which
makes them useful for understanding the spread and central tendency
of the data. There are three quartiles:
 First Quartile (Q1) or Lower Quartile: This is the 25th
percentile of the data. It represents the value below which 25%
of the data fall. It’s the median of the lower half of the dataset.
 Second Quartile (Q2) or Median: This is the 50th percentile of
the data. It represents the middle value of the dataset, splitting
the data into two halves. If the dataset has an odd number of
elements, it’s the middle value; if even, it’s the average of the
two middle values.
 Third Quartile (Q3) or Upper Quartile: This is the 75th
percentile of the data. It represents the value below which 75%
of the data fall, and above which 25% of the data fall. It's the
median of the upper half of the dataset.

Q1 = [(n+1)/4]th item
Q2 = [(n+1)/2]th item
Q3 = [3(n+1)/4]th item

Quartiles Examples
Question 1: Find the quartiles of the following data: 4, 6, 7, 8, 10,
23, 34.
Solution: Here the numbers are arranged in the ascending order and
number of items, n = 7
Lower quartile, Q1 = [(n+1)/4] th item
Q1= 7+1/4 = 2nd item = 6
Median, Q2 = [(n+1)/2]th item
Q2= 7+1/2 item = 4th item = 8
Upper Quartile, Q3 = [3(n+1)/4]th item
Q3 = 3(7+1)/4 item = 6th item = 23

The range between the first and third quartiles (Q3 - Q1) is called
the Interquartile Range (IQR),
The Inter Quartile Range is the difference between the third
quartile (Q3) and the first quartile (Q1)

IQR = Q3 – Q1

Percentile

In statistics, a percentile is a term that describes how a score compares


to other scores from the same set. It is expressed as the percentage of
values in a set of data scores that fall below a given value.

Ie. It is a measure of a position of a particular data point .


Ie the data points aer divided into 100 parts ie then 99 percentiles will
be there

Example 1: The scores obtained by 10 students are 38, 47, 49, 58, 60,
65, 70, 79, 80, 92. Using the percentile formula, calculate the
percentile for score 70?

Solution:

Given:
Scores obtained by students are 38, 47, 49, 58, 60, 65, 70, 79, 80, 92

Number of scores below 70 = 6

Using the percentile formula,

Percentile = (Number of Values Below “x” / Total Number of Values)


× 100

Percentile of 70

= (6/10) × 100

= 0.6 × 100 = 60

Therefore, the percentile for score 70 = 60%

Example 2: The weights of 10 people were recorded in kg as 35, 41,


42, 56, 58, 62, 70, 71, 90, 77. How to find percentile for the weight 58
kg?

Solution:

Given:

Weight of the people are 35, 41, 42, 56, 58, 62, 70, 71, 77, 90

Number of people with weight below 58 kg = 4

Using the formula for percentile,

Percentile = (Number of Values Below “x” / Total Number of Values)


× 100

Percentile for weight 58 kg

= (4/10) × 100

= 0.4 × 100 = 40%

Therefore, the percentile for weight 58 kg = 40%


Example 3: In a college, a list of scores of 10 students is announced.
The scores are 56, 45, 69, 78, 72, 94, 82, 80, 63, 59. Using the
percentile formula, find the 70th percentile.

Solution: Arrange the data in ascending order - 45, 56, 59, 63, 69, 72,
78, 80, 82, 94

Find the rank,

Rank = Percentile ÷ 100

Rank = 70 ÷ 100 = 0.7

So, the rank is 0.7

Using the formula to calculate the percentile,

Percentile = Rank × Total number of the data set

Percentile = 0.7 × 10

Percentile = 7

Now, counting 7 values from left to right we reach 80, and we can say
that all the values below 80 will come under the 70th percentile. In
other words, 70% of the values are below 80.

Therefore, the 70th percentile is 80.

2. Range : describes the difference between the largest and smallest


data point in our data set.

Range = Largest data value – smallest data value


Range of visits to the library in the past year
Ordered data set: 0, 3, 3, 12, 15, 24
Range: 24 – 0 = 24
3. Standard deviation
The standard deviation (s or SD) is the average amount of variability
in your dataset. It tells you, on average, how far each score lies from
the mean. The larger the standard deviation, the more variable the
data set is

There are six steps for finding the standard deviation:

1. List each score and find their mean.


2. Subtract the mean from each score to get the deviation from the
mean.
3. Square each of these deviations.
4. Add up all of the squared deviations.
5. Divide the sum of the squared deviations by N – 1.
6. Find the square root of the number you found.
From learning that s = 9.18, you can say that on average, each score
deviates from the mean by 9.18 points.

4. Variance
The variance is the average of squared deviations from the mean.
Variance reflects the degree of spread in the data set. The more spread
the data, the larger the variance is in relation to the mean.

To find the variance, simply square the standard deviation. The


symbol for variance is s2.
Variance of visits to the library in the past yearData set: 15, 3, 12, 0,
24, 3
s = 9.18

s2 = 84.3

In ML measures of central tendency are used for


b. if there is a significant difference between the mean and
median, it could indicate the presence of skewed data or
outliers.
c. When you have missing values in your dataset, you might
replace missing data with the mean or median (for
numerical data) or the mode (for categorical data). This
helps preserve the overall distribution of the data.
d. The mean is used in normalization or standardization to
scale features in a dataset. Subtracting the mean from each
data point and dividing by the standard deviation ensures
that the data is centered around zero and has a consistent
scale.
e. Measures of central tendency can be used to create new
features that capture the central value of different subsets
of the data. For instance, you might create features like
"average purchase amount per customer" or "average
temperature for a region.”
f. Assessing Model Performance: Measures like mean
squared error (MSE) or mean absolute error (MAE)
are based on the central tendency (mean) and help in
evaluating how well a model fits the data.
MODULE II
Regression
Regression in machine learning refers to a supervised learning
technique where the goal is to predict a continuous numerical
value based on one or more independent features. It finds
relationships between variables so that predictions can be made.
we have two types of variables present in regression:

Dependent Variable (Target): The variable we are trying to


predict e.g house price.
Independent Variables (Features): The input variables that
influence the prediction e.g locality, number of rooms.
Variables: Variable is any characteristic, number, or
quantity that can be measured or counted.

Types of variables:

1. Numerical variables.

2. Categorical Variables.

3. Mixed Variables.

1. Numerical Variables:

Obviously numerical variables will store numerical values.


Numerical variables further divided into 2 categories
based on the type of numerical values are stored.
 Continuous Variable: This variable stores continuous
numerical values. like Salary(10000 $), Height(5.8 feet),
Price(10.50 $)

 Discrete Variable: This variable stores the whole


number or count. This does not store floating-point
numbers. Examples: Number of apples, Number of
items.

2. Categorical Variables:

It stores categorical or string values. It further divided into


3 categories.

 Ordinal variable: Value stored in this has some order.


Examples: Grades(A, B, C) Grades have some order
associated with it A>B>C. Size(S,M,L) S<M<L.

 Nominal Variable: In this all values are equal.


Example: City(Mumbai, Delhi, Pune)

 Date Time variable: These variable stores Date Only,


Time Only, or Date&Time both.

3. Mixed Variables:

This variable stores data which is combination of both


numeric and categorical values. Example: Seat
Number(A10), Postal Code(XX123).
Regression analysis problem works with if output variable is a
real or continuous value such as “salary” or “weight”.

You might also like