0% found this document useful (0 votes)
6 views42 pages

Exploratory Data Analysis Techniques

Unit 2 notes of Data Science.

Uploaded by

send2masir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views42 pages

Exploratory Data Analysis Techniques

Unit 2 notes of Data Science.

Uploaded by

send2masir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 1: Data Analysis

Exploratory Data Analysis (EDA)


Exploratory Data Analysis (EDA) examines and visualizes data to understand its main
characteristics, identify patterns, spot anomalies, and test hypotheses. It helps summarize the data
and uncover insights before applying more advanced data analysis techniques.

Steps Involved in Exploratory Data Analysis


1. Understand the Data
Familiarize yourself with the data set, understand the domain, and identify the objectives of the
analysis.
2. Data Collection
Collect the required data from various sources such as databases, web scraping, or APIs.
3. Data Cleaning
●​ Handle missing values: Impute or remove missing data.
●​ Remove duplicates: Ensure there are no duplicate records.
●​ Correct data types: Convert data types to appropriate formats.
●​ Fix errors: Address any inconsistencies or errors in the data.
4. Data Transformation
●​ Normalize or standardize the data if necessary.
●​ Create new features through feature engineering.
●​ Aggregate or disaggregate data based on analysis needs.
5. Data Integration
Integrate data from various sources to create a complete data set.
6. Data Exploration
●​ Univariate Analysis: Analyze individual variables using summary statistics and
visualizations (e.g., histograms, box plots).
●​ Bivariate Analysis: Analyze the relationship between two variables with scatter plots,
correlation coefficients, and cross-tabulations.
●​ Multivariate Analysis: Investigate interactions between multiple variables using pair plots
and correlation matrices.
7. Data Visualization
Visualize data distributions and relationships using visual tools such as bar charts, line charts,
scatter plots, heatmaps, and box plots.
8. Descriptive Statistics
Calculate central tendency measures (mean, median, mode) and dispersion measures (range,
variance, standard deviation).
9. Identify Patterns and Outliers
Detect patterns, trends, and outliers in the data using visualizations and statistical methods.
10. Hypothesis Testing
Formulate and test hypotheses using statistical tests (e.g., t-tests, chi-square tests) to validate
assumptions or relationships in the data.
11. Data Summarization
Summarize findings with descriptive statistics, visualizations, and key insights.
12. Documentation and Reporting
Document the EDA process, findings, and insights clearly and structured.
Create reports and presentations to convey results to stakeholders.

EDA tools
Specific statistical functions and techniques you can perform with EDA tools include:
●​ Clustering and dimension reduction techniques, which help create graphical displays of
high-dimensional data containing many variables.​

●​ Univariate visualization of each field in the raw dataset, with summary statistics.​

●​ Bivariate visualizations and summary statistics that allow you to assess the relationship
between each variable in the dataset and the target variable you’re looking at.​

●​ Multivariate visualizations, for mapping and understanding interactions between different


fields in the data.​

●​ K-means clustering, which is a clustering method in unsupervised learning where data


points are assigned into K groups, i.e. the number of clusters, based on the distance from
each group’s centroid. The data points closest to a particular centroid will be clustered
under the same category. K-means clustering is commonly used in market segmentation,
pattern recognition, and image compression.​

●​ Predictive models, such as linear regression, use statistics and data to predict outcomes.
Types of EDA
There are four primary types of EDA:
●​ Univariate non-graphical
●​ Univariate graphical
●​ Multivariate non-graphical
●​ Multivariate graphical
Univariate non-graphical
This is the simplest form of data analysis, where the data being analyzed consists of just one
variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main
purpose of univariate analysis is to describe the data and find patterns that exist within [Link]
can include measures of central tendency (like the mean or median), measures of spread (like the
range or standard deviation), and measures of shape (like skewness or kurtosis).

Univariate graphical
Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore
required. Common types of univariate graphics include:
●​ Stem-and-leaf plots, which show all data values and the shape of the distribution.

●​ Histograms, a bar plot in which each bar represents the frequency (count) or proportion
(count/total count) of cases for a range of values.

●​ Box plots, which graphically depict the five-number summary of minimum, first quartile,
median, third quartile, and maximum.
Multivariate nongraphical
Multivariate data arises from more than one variable. Multivariate non-graphical EDA
techniques generally show the relationship between two or more variables of the data through
cross-tabulation or statistics.
Examples include calculating correlation and covariance matrices to understand how variables
change together, or using techniques like cross-tabulation to examine relationships between
categorical variables.

Multivariate graphical
Multivariate data uses graphics to display relationships between two or more sets of data. The
most used graphic is a grouped bar plot or bar chart with each group representing one level of
one of the variables and each bar within a group representing the levels of the other variable.
Other common types of multivariate graphics include:
●​ Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show
how much one variable is affected by another.

●​ Multivariate chart, which is a graphical representation of the relationships between


factors and a response.​
●​ Run chart, which is a line graph of data plotted over time.​

●​ Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a
two-dimensional plot.


●​ Heat map, which is a graphical representation of data where values are depicted by color.

●​
Exploratory data analysis languages
Some of the most common data science programming languages used to create an EDA include:
​ Python: An interpreted, object-oriented programming language with dynamic semantics.
Its high-level, built-in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for rapid application development, as well as for use as a
scripting or glue language to connect existing components together. Python and EDA can
be used together to identify missing values in a data set, which is important so you can
decide how to handle missing values for machine learning.​

​ R: An open-source programming language and free software environment for statistical


computing and graphics supported by the R Foundation for Statistical Computing. The R
language is widely used among statisticians in data science in developing statistical
observations and data analysis.

Techniques to analyze data

What is Hypothesis Testing?


Defining Hypotheses
Hypothesis testing is a statistical method that allows us to evaluate assumptions about a
population based on sample data. It involves two key hypotheses:
1.​ Null Hypothesis (H₀): Represents the default assumption or status quo. It assumes
no significant effect or relationship exists in the data.
2.​ Alternative Hypothesis (H₁): Represents the claim or effect we aim to test. It
contradicts the null hypothesis, suggesting there is a significant effect or
relationship.
Key Terms in Hypothesis Testing
●​ P-value: The probability of obtaining results at least as extreme as the observed
results, assuming the null hypothesis is true. A smaller p-value indicates stronger
evidence against H₀.
●​ Significance Level (α): A threshold (commonly 0.05) for deciding whether to
reject H₀. If p-value < α, H₀ is rejected.
●​ Test Statistic: A value calculated from the sample data that helps determine
whether to reject H₀. Examples include t-statistics and z-scores.
●​ Power of the Test: The probability of correctly rejecting H₀ when H₁ is true. A
higher power indicates a more reliable test.
Why do we use Hypothesis Testing?
Hypothesis testing is a fundamental tool in data science and statistics that helps make data-driven
decisions by objectively evaluating assumptions. Here’s why it is essential:
1. Objective Decision-Making
Hypothesis testing provides a structured and mathematical approach to validate claims. Instead
of relying on intuition or guesswork, decisions are based on evidence derived from data.
2. Real-World Applications
●​ Medicine: Determining the effectiveness of a new drug compared to an existing
treatment.
●​ Business: Evaluating the impact of a marketing campaign on sales.
●​ Social Sciences: Investigating behavioral trends or societal changes.
3. Reducing Uncertainty
Data often contains random variations. Hypothesis testing helps separate significant effects from
random noise, leading to more reliable conclusions.
4. Model Evaluation in Data Science
In machine learning, hypothesis testing is used to compare models, select features, and validate
assumptions about data distributions. It ensures that data-driven models are built on robust
statistical principles.

One-Tailed and Two-Tailed Tests


Definition and Differences
Hypothesis testing can be classified into two types based on the direction of the test:
1.​ One-Tailed Test
●​ Used when the research hypothesis specifies a direction of the effect or
relationship (greater than or less than).
●​ Example: Testing if a new marketing strategy increases sales compared
to the old strategy.
2.​ Two-Tailed Test
●​ Used when the research hypothesis does not specify a direction,
focusing only on whether a difference exists.
●​ Example: Testing if a new teaching method has any impact (positive or
negative) on student performance compared to the old method.
Key Differences
●​ A one-tailed test is directional and checks for effects in a specific direction (e.g.,
greater than).
●​ A two-tailed test is non-directional and checks for any significant difference,
regardless of the direction.
Visual Illustration
●​ One-Tailed Test: The rejection region is on one end of the distribution curve.
●​ Two-Tailed Test: The rejection regions are on both ends of the distribution curve.
When to Use Each Test
●​ Use a one-tailed test when prior research or domain knowledge suggests a
specific direction.
●​ Use a two-tailed test when the direction of the effect is uncertain or both
directions are important to investigate.
What are Type I and Type II Errors in Hypothesis Testing?
Type I Error (False Positive)
●​ A Type I error occurs when the null hypothesis (H₀) is rejected even though it is
true.
●​ Example: A medical test wrongly concludes that a patient has a disease when they
do not.
●​ Implication: This error leads to false alarms and incorrect conclusions about the
presence of an effect or relationship.
Type II Error (False Negative)
●​ A Type II error occurs when the null hypothesis (H₀) is not rejected even though it
is false.
●​ Example: A medical test fails to detect a disease that the patient actually has.
●​ Implication: This error can cause missed opportunities to identify significant
effects or relationships.
Key Differences
●​ Type I Error: Mistakenly concluding there is an effect when none exists.
●​ Type II Error: Failing to detect an effect when one exists.
Balancing the Two Errors
●​ Reducing one type of error often increases the other.
●​ The choice of significance level (α) impacts the likelihood of each error:
●​ Lower α reduces Type I error but increases Type II error.
●​ Higher α reduces Type II error but increases Type I error.
Real-World Importance
●​ In medicine, minimizing Type I errors is crucial for ensuring patient safety.
●​ In business, minimizing Type II errors can help identify impactful strategies.
How does Hypothesis Testing work?
Hypothesis testing involves a systematic process to evaluate assumptions about a dataset. Here’s
a step-by-step breakdown:
Step 1: Define Null and Alternative Hypotheses
●​ Formulate the null hypothesis (H₀) and alternative hypothesis (H₁) based on the
research question.
●​ Example: H₀: The new teaching method has no effect on student
performance.
●​ H₁: The new teaching method improves student performance.
Step 2: Choose the Significance Level (α)
●​ Select the significance level (commonly 0.05 or 5%), which determines the
threshold for rejecting H₀.
●​ Lower α values make the test more stringent, reducing the likelihood of
a Type I error.
Step 3: Collect and Analyze Data
●​ Gather a random and representative sample from the population.
●​ Perform preliminary analysis to clean and summarize the data.
Step 4: Calculate the Test Statistic
●​ Compute the test statistic (e.g., t-statistic, z-statistic) based on the selected
hypothesis test.
●​ The test statistic quantifies how far the sample data deviates from H₀.
Step 5: Compare Test Statistic to Critical Value or P-Value
●​ Compare the test statistic to a critical value or calculate the p-value:
●​ If p-value < α: Reject H₀.
●​ If p-value ≥ α: Do not reject H₀.
Step 6: Interpret the Results
●​ Determine whether the results support the alternative hypothesis (H₁).
●​ Example: If p-value = 0.03 and α = 0.05, reject H₀ and conclude the
new teaching method improves student performance.
●​ Consider both statistical significance (p-value) and practical significance (effect
size and real-world impact).
Step 7: Calculating Test Statistics
Calculating test statistics is a crucial step in hypothesis testing, as it determines whether the null
hypothesis should be rejected. Different types of tests are used depending on the nature of the
data and the research question.
1. Z-Statistics
When to Use: Z-tests are used for large sample sizes (n>30) with a known population variance.
Formula:

Where:

2. T-Statistics
When to Use: T-tests are used for small sample sizes (n≤30) or when the population variance is
unknown.
Formula:

Where:
●​ S : Sample standard deviation
3. Chi-Square Test
When to Use: Chi-square tests are used for categorical data to test the relationship between
variables or the goodness of fit.
Formula:

Where:
●​ O : Observed frequency
●​ E : Expected frequency
Real-Life Examples of Hypothesis Testing
1. Healthcare: Testing the Effectiveness of a New Drug
A pharmaceutical company wants to determine if a new drug reduces blood pressure more
effectively than an existing drug.
●​ Hypotheses:
H0 ​: The new drug has the same effect on blood pressure as the existing drug.
H1: The new drug reduces blood pressure more effectively.
●​ Method:
●​ Collect blood pressure data from two groups: one taking the new drug
and the other taking the existing drug.
●​ Perform a two-sample t-test to compare the means of the two groups.
●​ Result:​
If the p-value < 0.05, reject ​and conclude that the new drug is more effective.
2. Business: Evaluating a Marketing Campaign
A company launches a new marketing campaign and wants to know if it increases sales
compared to the previous quarter.
●​ Hypotheses:
H0 ​: The new marketing campaign does not increase sales.
H1 ​: The new marketing campaign increases sales.
●​ Method:
●​ Analyze sales data before and after the campaign.
●​ Perform a one-tailed z-test if the sample size is large.
●​ Result:​
If the test statistic exceeds the critical value or the p-value < 0.05, reject H0 and
conclude that the campaign has boosted sales.
Limitations of Hypothesis Testing
While hypothesis testing is a powerful tool for data analysis, it has certain limitations that users
must consider to avoid misinterpretation or misuse.
1. Misuse of P-Values
●​ A p-value indicates the probability of observing the data assuming the null
hypothesis is true. However, it does not measure the size or importance of an
effect.
●​ Misinterpreting a small p-value as proof of practical significance can lead to
erroneous conclusions.
2. Susceptibility to P-Hacking
●​ Researchers might deliberately manipulate data or perform repeated tests to obtain
significant results (a practice known as p-hacking).
●​ This undermines the integrity of the analysis and increases the risk of Type I
errors.
3. Dependency on Sample Size
●​ Small sample sizes can result in unreliable conclusions due to insufficient
statistical power.
●​ Conversely, very large samples may detect insignificant effects as statistically
significant.
4. Simplistic Binary Decision-Making
●​ Hypothesis testing often reduces conclusions to a binary decision (reject or fail to
reject H0H_0H0​), which oversimplifies the nuanced nature of real-world data.
●​ Focusing solely on statistical significance can overlook practical relevance.

What is an ANOVA Test?


ANOVA stands for Analysis of Variance, a statistical test used to compare the means of
three or more groups. It analyzes the variance within the group and between groups.
The primary objective is to assess whether the observed variance between group
means is more significant than within the groups. If the observed variance between
group means is significant, it suggests that the differences are meaningful.
Mathematically, ANOVA breaks down the total variability in the data into two
components:
●​ Within-Group Variability: Variability caused by differences within individual
groups, reflecting random fluctuations.
●​ Between-Group Variability: Variability caused by differences between the
means of the different groups.
F-statistic to compute ANOVA. Image by Author​

The test produces an F-statistic, which shows the ratio between between-group and
within-group variability. If the F-statistic is sufficiently large, it indicates that at least one
of the group means is significantly different from the others.
To understand this better, consider a scenario where you are asked to assess a
student’s performance (exam scores) based on three teaching methods: lecture,
interactive workshop, and online learning. ANOVA can help us assess whether the
teaching method statistically impacts the student’s exam performance.
The Two Types of ANOVA Test
There are two types of ANOVA: one-way and two-way. Depending on the number of
independent variables and how they interact with each other, both are used in different
scenarios.
1. One-way ANOVA
A one-way ANOVA test is used when there is one independent variable with two or
more groups. The objective is to determine whether a significant difference exists
between the means of different groups.
In our example, we can use one-way ANOVA to compare the effectiveness of the three
different teaching methods (lecture, workshop, and online learning) on student exam
scores. The teaching method is the independent variable with three groups, and the
exam score is the dependent variable.
●​ Null Hypothesis (H₀): The mean exam scores of students across the three
teaching methods are equal (no difference in means).
●​ Alternative Hypothesis (H₁): At least one group’s mean significantly differs.
Comparison of the null and alternative hypothesis. Image by Author​

The one-way ANOVA test will tell us if the variation in student exam scores can be
attributed to the differences in teaching methods or if it’s likely due to random chance.
One-way ANOVA is effective when analyzing the impact of a single factor across
multiple groups, making it simpler to interpret. However, it does not account for the
possibility of interaction between multiple independent variables, where two-way
ANOVA becomes necessary.
2. Two-way ANOVA
Two-way ANOVA is used when there are two independent variables, each with two or
more groups. The objective is to analyze how both independent variables influence the
dependent variable.
Let’s assume you are interested in the relationship between teaching methods and
study techniques and how they jointly affect student performance. The two-way ANOVA
is suitable for this scenario. Here we test three hypotheses:
●​ The main effect of factor 1 (teaching method): Does the teaching method
influence student exam scores?
●​ The main effect of factor 2 (study technique): Does the study technique affect
exam scores?
●​ Interaction effect: Does the effectiveness of the teaching method depend on the
study technique used?
For example, two-way ANOVA could reveal that students using the lecture method
perform better in group study, and those using online learning might perform better in
individual study. Understanding these interactions gives a deeper insight into how
different factors together impact outcomes.
ANOVA vs. T-Test
You might be wondering: When should I choose an ANOVA over a t-test? The t-test and
ANOVA are used to compare means between groups, but the choice between them
depends on the number of groups being compared and the complexity of the data
structure.
When to use a T-Test
A t-test is appropriate when comparing the means of two groups. For instance, if we
wanted to compare the exam scores of students using just two teaching
methods — lecture and workshop — a t-test would be enough. There are two types of
t-tests:
●​ Independent T-Test: Compares two independent groups (e.g., lecture vs.
workshop).
●​ Paired T-Test: Compares means from the same group at different times (e.g.,
student performance before and after using a particular teaching method).
When to use ANOVA
On the other hand, ANOVA is used when comparing the means of three or more
groups. Our study includes three teaching methods (lecture, workshop, and online
learning), so something more than a t-test is required. Using multiple t-tests for each
pair of groups would increase the risk of Type I error (false positives), whereas ANOVA
handles the comparison in one test and controls for this error.
ANOVA Test Assumptions
All statistical tests have assumptions that must be met to ensure valid results.
Here are the assumptions that need to be satisfied for ANOVA:
1. Independence of observations
The observations (data points) must be independent of each other. In the example,
students’ exam scores in one teaching method should not influence the scores of
students in another method.
2. Homogeneity of variances
The variances within each group should be approximately equal. ANOVA assumes that
the variability of exam scores within each teaching method group is roughly the same.
This can be tested using Levene’s test, which checks for equal variances.
3. Normal distribution
The data within each group should follow a normal distribution. In our teaching method
example, the exam scores for students in each teaching group (Lecture, Workshop,
Online learning) should ideally be normally distributed.
If any assumptions are violated, the test results may be invalid. In such cases, it is
essential to consider using a non-parametric test.
One way ANOVA steps:
Null Hypothesis (H0): All group means are equal -> u1=u2=u3…uk, where k is number
of groups.
Alternative hypothesis (H1): At least one of the groups has a different mean.
Decision Rule: If p-value > significance level then accept the null hypothesis .

The steps to perform the one way ANOVA test are given below:
Step 1: Calculate the mean for each group. -> g1,g2,g3…gk
Step 2: Calculate the total mean. This is done by adding all the means and
dividing it by the total number of means. -> Og = ((g1+g2+g3+....gk)/k)
Step 3: Calculate the SSB.->
(g1-Og)2 +(g2-Og)2 +(g3-Og)2 +....(gk-Og)2 ,
where k is the total no. of groups.
Step 4: Calculate the between groups degrees of freedom.(df=k-1)
Step 5: Calculate the SSE.->
(i1-g1)2 +(i2-g1)2 +(i3-g1)2 +...(in-g1)2 ,
where n = no. of samples in a group and i represents individual values of the
group g1. Repeat this step for each [Link] the results of all groups.
Step 6: Calculate the degrees of freedom of errors.-> k(n-1)
Step 7: Determine the MSB and the MSE.
MSB=Step 3/Step 4 = SSB/k-1, MSE= Step 5/ Step 6 -> SSE/k(n-1)
Step 8: Find the f test statistic.
F= MSB/MSE
Step 9: Find p_value(f).
If the p_value>=0.05, accept null or else reject null.

Chapter 2 : What is Machine Learning?

Machine learning is a subset of artificial intelligence (AI). It is focused on teaching


computers to learn from data and to improve with experience – instead of being
explicitly programmed to do so. In machine learning, algorithms are trained to find
patterns and correlations in large data sets and to make the best decisions and
predictions based on that analysis. Machine learning applications improve with use and
become more accurate the more data they have access to.

Applications of machine learning are all around us –in our homes, our shopping carts,
our entertainment media, and our healthcare.

How does machine learning work?


Machine learning comprises different types of machine learning models, using various algorithmic
techniques. Depending upon the nature of the data and the desired outcome, one of four learning models
can be used: supervised, unsupervised, semi-supervised, or reinforcement. Within each of those models,
one or more algorithmic techniques may be applied – relative to the data sets in use and the intended
results. Machine learning algorithms are basically designed to classify things, find patterns, predict
outcomes, and make informed decisions. Algorithms can be used one at a time or combined to achieve
the best possible accuracy when complex and more unpredictable data is involved.

What is Supervised learning?


Supervised learning as the name suggests, works like a teacher or supervisor guiding
the machine. In this approach we teach or train the machine using the labelled
data(correct answers or classifications) which means each input has the correct output
in the form of answer or category attached to it. After that machine is provided with a
new set of examples (data) so that it can analyses the training data and produces a
correct outcome from labeled data.
For example, a labeled dataset of images of Elephant, Camel and Cow would have
each image tagged with either "Elephant", "Camel" or "Cow."

Example to Understand
Imagine we have a basket full of different fruits that we want the machine to identify.
The machine first looks at the image of a fruit and extracts features like its shape, color
and texture. Then it compares these features to the fruits it has already learned during
training. If the new fruit’s features closely match those of an apple, the machine will
predict that the fruit is an apple.
For example, suppose we train the machine by showing it fruits one by one:
●​ If the fruit is round, has a small depression at the top and is red, it is labeled
as an Apple.
●​ If the fruit is long, curved and greenish-yellow, it is labeled as a Banana.
Now after this training, if we give the machine a new fruit (say a banana) from the
basket and ask it to identify it, the machine will use what it has learned during training.
It will analyze the shape and color of the new fruit and classify it as a Banana placing it
in the correct category. In this way, the machine learns from the training data (the
basket with labeled fruits) and applies that knowledge to recognize new, unseen fruits.

Applications of Supervised learning


It can be used to solve variety of problems which includes:
1.​ Image classification: It can automatically classify images into different
categories such as animals, objects or scenes helps in the tasks like image
search, content moderation and image-based product recommendations.
2.​ Medical diagnosis: It can assist in medical diagnosis by analyzing patient
data such as medical images, test results and patient history to identify
patterns that suggest specific diseases or conditions.
3.​ Fraud detection: They can analyze financial transactions and identify
patterns that shows fraudulent activity which helps financial institutions
prevent fraud and protect their customers.
4.​ Natural language processing (NLP): It plays a important role in NLP tasks
including sentiment analysis, machine translation and text summarization
which enables machines to understand and process human language
effectively.

Advantages of Supervised learning


1.​ It learns from labeled examples to make accurate predictions on new, unseen
data.
2.​ With more data and training, these models increases their accuracy which
leads to better performance and more reliable predictions.
3.​ It works well for many tasks from detecting spam emails to predicting house
prices as it has the ability to handle various computational challenges.
4.​ It can handle both classification (sorting data into categories) and regression
(predicting numbers) which makes it flexible for different problems.
Disadvantages of Supervised learning
1.​ It requires a well-labeled dataset where each input has a corresponding
output. Creating such datasets takes a lot of time, money and effort and can
sometimes have mistakes, this makes supervised learning hard to use.
2.​ It works well on many tasks but can struggle with very complex or
unstructured problems like understanding patterns or abstract ideas that
doesn't relate to what it was trained on.
3.​ These models can sometimes overfit the training data which means they
perform well on training data but poor on new, unseen data.
4.​ These models often need constant updating with new labeled data to stay
accurate as real-world data changes over time.
What is Unsupervised learning?
Unsupervised learning is a part of machine learning which works differently from
supervised because there is no teacher(supervisor) involved to guide the machine. In
this approach the machine is given with data that has no labels or categories. It
analyzes the data on its own to find patterns, groups or relationships without any prior
knowledge. The machine learns by discovering hidden structures within the data
without being told what the correct output should be.
For example, unsupervised learning can analyze animal data and group the animals by
their traits and behavior. These groups might represent different species which allows
the machine to organize animals without any prior labels or categories.

Example to understand
Imagine we have a machine learning model trained on many unlabeled images of dogs
and cats. The model has never seen any labeled example that says “dog” or “cat”
before so it doesn’t know how these animals look.
Now, if we give the model a new image that contains both dogs and cats it won’t be
able to directly label them as “dog” or “cat.” It will group parts of the image based on
similarities and differences in features like shape or texture. It might separate the
image into two groups one with dog-like features and other with cat-like features.
This happens because unsupervised learning doesn’t rely on prior knowledge or
training with labeled data. It finds patterns and organizes data on its own helps in
discovering information that wasn’t given before.
Application of Unsupervised learning
Unsupervised learning can be used to solve a variety of problems which includes:
1.​ Anomaly detection: It can identify unusual patterns or behaviors in data
helps in the detection of fraud, security breaches or system problems.
2.​ Scientific discovery: It can show hidden relationships and patterns in
scientific data which leads to new insights and ideas.
3.​ Recommendation systems: It finds similarities in user behavior and
preferences to recommend products, movies or music that align with their
interests.
4.​ Customer segmentation: It can identify groups of customers with similar
characteristics which allows businesses to target marketing campaigns and
improve customer service more effectively.
Advantages of Unsupervised learning
1.​ It doesn’t need labeled data so we can start working with large datasets
more easily and quickly.
2.​ This handles large amounts of data and reduces it into simpler forms without
losing important patterns which makes it manageable and efficient.
3.​ It discovers patterns and relationships in the data that were previously
unknown which offers valuable insights.
4.​ By analyzing unlabeled data, it shows meaningful trends and groups that
help us to understand our data deeply.
Disadvantages of Unsupervised learning
1.​ Without labeled answers, it’s difficult to tell how accurate or effective the
model is.
2.​ Lack of clear guidance can lead to less precise results for complex problems.
3.​ After grouping the data, we may need to check and label these groupings
which can be time-consuming.
4.​ Missing data, outliers or noise in the data can easily affect the quality of the
results.

What is a Cluster?
A cluster is when you combine similar things and keep different things apart. Clusters
help organize and categorize items by their similarities.

For example, let's consider a bunch of different vegetables as one group. Similarly,
there could be another bunch of various fruits as a separate group. Here, in this case,
fruits and vegetables are two clusters with similar items.

What is Clustering in Data Mining?

Clustering is a method used to categorize similar data points according to their


attributes or characteristics. It discovers patterns present in a large dataset. While doing
cluster analysis, we divide the data set into groups based on data similarity and then set
the tags to the groups.
Suppose you wish to arrange a large number of books. Your objective is used to organize
it to make it simple for individuals to find the related book based on their requirements.
Using data analysis methods such as clustering, you can examine the attributes of books,
such as their category, writer, language, and topic. Using clustering algorithms, you can
group similar books based on these qualities.
People typically use clustering in data mining for various objectives such as data
exploration, pattern recognition, anomaly detection, customer segmentation, and
recommendation systems.

Features of Clustering in Data Mining

There are many features of Clustering in Data Mining such as:


●​ Clustering in data mining is significant because it can work with large
amounts of data without becoming slow.
●​ Data clustering produces clear and relevant findings that provide valuable
insights.
●​ Clustering algorithms can recognize and adapt clusters with irregular or
complicated shapes, unlike simple geometrical forms.
●​ Clustering is extremely useful and versatile because it may group many data
kinds, such as numbers and categories.
●​ It can also effectively manage noisy or missing data.

Applications of Clustering in Data Mining

Clustering in data mining has various applications across different domains. Here are
some key applications:

●​ Customer Segmentation: It is used to group customers based on similar


purchasing behavior. It helps businesses tailor marketing strategies for
specific customer segments, enhancing customer satisfaction and loyalty.
●​ Anomaly Detection: It is used to identify unusual patterns or outliers in
datasets. It is useful for fraud detection, network security, or any scenario
where abnormal behavior needs to be flagged.
●​ Image Segmentation: It is used to group pixels in an image with similar
attributes. It facilitates object recognition, image editing, and computer
vision applications.
●​ Document Clustering: It is used to group documents based on content
similarity. It streamlines information retrieval, categorizes documents, and
aids in organizing large document collections.
●​ Genomic Data Analysis: It is used to group genes or DNA sequences with
similar characteristics. It assists in understanding genetic patterns,
identifying potential disease markers, and enhancing biomedical research.

Example for Clustering Algorithm


Explanation

We have performed k-means clustering. It generates sample data using “make_blobs”


from the “[Link]”. We have defined 5 clusters with 100 data points. We
have used the k-means clustering algorithm in this code. In this algorithm, we have
assigned data points to the nearest centroid and updated the centroid positions. After
clustering is done, we obtain the cluster labels and centroid coordinates. We have added
some color to identify the clusters correctly. We have marked the centroid with an “X”
symbol.

Data Mining Clustering Methods

There are different types of clustering methods:

1.​ Partitioning Based Method


2.​ Density Based Method
3.​ Hierarchical Method
4.​ Fuzzy Clustering

1. K-means Clustering(Partitioning Based Method)


This method divides the data into non-hierarchical groups. One of the most common
examples of partitioning-based methods is K-Means Clustering. It is also known as the
centroid-based method.
In this, if there are 'n' objects or data items, they will be divided into 'k' partitions or
clusters. In this case, each partition acts like a single cluster. In this case, k should be
less than or equal to n. The partitions that are formed should satisfy two conditions:

●​ Each partition should have at least one object.


●​ Each object should belong to only one partition only.
For example

Let's take an example to understand this better. Suppose you have to organize a large
amount of books. Instead of sorting them out at once, you used a partitioning-based
method. In this approach, you start dividing the books into groups based on genre, like
fiction, non-fiction or mystery, and so on. After doing this, you obtain subsets of books.
You can further divide them according to the name of the authors. It will further divide
the problem into subsets. Using a partitioning-based method, you can break down a
significant problem into smaller ones.

2. Density Based Method

DBSCAN is an unsupervised machine learning algorithm that clusters data points based
on their density. The clustering method based on density connects areas with lots of
points together to form groups, regardless of their shapes. This algorithm finds clusters
in the data set and links regions with many points to create clusters. Sparse areas
separate the dense parts from each other in the data space.

Let's understand this by an example. Suppose you have 100 balloons of different colors.
They are randomly scattered in an area. Some are closely spaced to each other, while
others are farther apart. Now, if we apply Density-based clustering to this case, we will
start selecting a random balloon, and if other balloons are at an arm's length distance,
we will include them in that cluster. We will keep expanding this cluster until there are
no more balloons nearby.

3. Hierarchical Method

Hierarchical clustering is an unsupervised learning method that identifies the next


clusters based on previously defined ones. It starts by thinking of each data point as its
group. Then, it combines groups to create new ones with similar things but different
from the other groups.
Let’s understand this with the help of an example.
Explanation
We have performed hierarchical clustering in this code using the
“AgglomerativeClustering” algorithm. It has been imported from
“[Link]”. We have defined 5 clusters with 100 data points using
“make_blobs” from “[Link]”. In Agglomerative clustering, first, all data
points are considered individual clusters. It then iteratively merges the closest data
points into clusters. It forms a dendrogram that shows how clusters are merged. We
have specified the number of clusters we want. We have added some color to identify the
clusters correctly.

4. Fuzzy Clustering

Fuzzy clustering is a method used in data analysis to group data points into clusters
based on their similarities. Unlike regular clustering methods that assign each data
point to only one cluster, fuzzy clustering allows for a more flexible and uncertain
approach.
It gives each data point a membership value that shows how much it belongs to each
cluster. The fuzzy clustering algorithm calculates these membership values and adjusts
them gradually to get the best clustering outcome.
For example
Consider a simple example of fuzzy clustering with fruits. You have decided that you
want to group fruits based on two conditions. First is sweetness, and second is color.
You have four fruits: an apple, an orange, a pear, and a pineapple.
Apple is extremely sweet and red; orange is moderately sweet and orange in color; pear
is extremely sweet and green; and last, pineapple is sweet and yellow.
Using fuzzy clustering, we assigned a degree of membership to each fruit. As you can
observe, the apple is associated with the extremely sweet fruit group and red fruits. We
can group fruits according to it. Fuzzy clustering allows us to consider various features
of the fruits and express their memberships to different groups.

Why is Clustering Required in Data Mining?

Clustering in data mining is essential for several reasons:

●​ Pattern Identification: Clustering helps identify patterns or groups


within large datasets, revealing inherent structures and relationships among
data points.
●​ Data Summarization: It simplifies the complexity of datasets by grouping
similar data points, making it easier to understand and analyze.
●​ Anomaly Detection: Clustering aids in detecting outliers or anomalies by
highlighting data points that do not conform to the patterns of their assigned
clusters.
●​ Data Compression: It can compress large datasets by representing them
with cluster prototypes, reducing the storage space needed.
●​ Decision Making: Clustering assists in decision-making processes by
providing insights into the natural groupings present in the data.

Classification
●​ Classification algorithms are used to categorize data into a class or category.
●​ It can be performed on both structured or unstructured data.
●​ Classification can be of three types: binary classification, multiclass classification,
multilabel classification.

In the above image, you can see that emails are being categorized as spam or not spam.
So, it is an example of classification (binary classification).
The algorithms we are going to cover are:
1. Logistic regression
2. Naive Bayes
3. K-Nearest Neighbors
[Link] Vector Machine
5. Decision Tree
1. Logistic Regression
It is a very basic yet important classification algorithm in machine learning that uses
one or more independent variables to determine an outcome. Logistic regression tries to
find a best-fitting relationship between the dependent variable and a set of independent
variables. The best-fitting line in this algorithm looks like S-shape as shown in the
figure.
Source: [Link]
Pros:
●​ It is a very simple and efficient algorithm.
●​ Low variance.
●​ Provides probability score for observations.
Cons:
●​ Bad at handling a large number of categorical features.
●​ It assumes that the data is free of missing values and predictors are independent
of each other.
Example:
2. Naive Bayes
Naive Bayes is based on Bayes’s theorem which gives an assumption of independence
among predictors. This classifier assumes that the presence of a particular feature in a
class is not related to the presence of any other​
feature/variable.
Naive Bayes Classifier are of three types: Multinomial Naive Bayes, Bernoulli Naive
Bayes, Gaussian Naive Bayes.
Bayes' Theorem:
○​ Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
○​ The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of
a hypothesis is [Link] Skip 10s
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.

Pros:
●​ This algorithm works very fast.
●​ It can also be used to solve multi-class prediction problems as it’s quite useful
with them.
●​ This classifier performs better than other models with less training data if the
assumption of independence of features holds.
Cons:
●​ It assumes that all the features are independent. While it might sound great in​
theory, but in real life, anyone can hardly find a set of independent features.

3. K-Nearest Neighbor Algorithm


You must have heard of a popular saying:
“Birds of a feather flock together.”
KNN works on the very same principle. It classifies the new data points depending upon
the class of the majority of data points amongst the K neighbor, where K is the number
of neighbors to be considered. KNN captures the idea of similarity (sometimes called
distance, proximity, or closeness) with some basic mathematical distance formulas like
euclidean distance, Manhattan distance, etc.

Source: [Link]
Choosing the right value for K
To select the K that’s right for the data you want to train, run the KNN algorithm several
times with different values of K and choose that value of K which reduces the number of
errors on unseen data.
Pros:
●​ KNN is simple and easiest to implement.
●​ There’s no need to build a model, tuning several parameters, or make additional
assumptions like some of the other classification algorithms.
●​ It can be used for classification, regression, and search. So, it is flexible.
Cons:
●​ The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.

4. SVM
SVM stands for Support Vector Machine. This is a supervised machine learning
algorithm that is very often used for both classification and regression challenges.
However, it is mostly used in classification problems. The basic concept of the Support
Vector Machine and how it works can be best understood by this simple example. So,
just imagine you have two tags: green and blue, and our data has two features: x and y.
We want a classifier that, given a pair of (x,y) coordinates, outputs if it’s either green or
blue. Plot labeled training data on a plane and then try to find a plane (hyperplane of
dimensions increases) that segregates data points of both colors very clearly.

​ Source: [Link]
But this is the case with data that is linear. But what if data is non-linear, then it uses a
kernel trick. So, to handle this we increase dimension, this brings data in space and now
data becomes linearly separable in two groups.
Pros:
●​ SVM works relatively well when there is a clear margin of separation between
classes.
●​ SVM is more effective in high-dimensional spaces.
Cons:
●​ SVM is not suitable for large data sets.
●​ SVM does not perform very well when the data set has more noise i.e. when
target classes are overlapping. So, it needs to be handled.

[Link] Tree
The decision tree is one of the most popular machine learning algorithms used. They are
used for both classification and regression problems. Decision trees mimic human-level
thinking so it’s so simple to understand the data and make some good intuitions and
interpretations. They actually make you see the logic for the data to interpret. Decision
trees are not like black-box algorithms like SVM, Neural Networks, etc.


For example, if we are classifying a person as fit or unfit then the decision tree looks
somewhat like this above in the image.
So, in short, a decision tree is a tree where each node represents a​
feature/attribute, each branch represents a decision, a rule, and each leaf represents an
outcome. This outcome may be categorical or continuous. Categorical in case of
classification and continuous in case of regression applications.
Pros:
●​ When compared to other algorithms, decision trees require less effort for data
preparation while pre-processing.
●​ They do not require normalization of data and scaling as well.
●​ Model made on the decision tree is very intuitive and easy to explain to technical
teams as well as to stakeholders also.
Cons:
●​ If even a small change is done in the data, that can lead to a large change in the
structure of the decision tree causing instability.
●​ Sometimes calculation can go far more complex compared to other algorithms.
●​ Decision trees often take higher time to train the model.

Overfitting, underfitting, and bias-variance tradeoff are foundational concepts


in machine learning. They are important because they explain the state of a model based
on their performance. The best way to understand these terms is to see them as a
tradeoff between the bias and the variance of the model. Let's understand the
phenomenon of overfitting and underfitting.

Overfitting occurs when a statistical model or machine learning algorithm captures


the noise of the data. Intuitively, overfitting occurs when the model or the algorithm
fits the data too well. Specifically, overfitting occurs if the model or algorithm shows
low bias but high variance. Overfitting is often a result of an excessively complicated
model, and it can be prevented by fitting multiple models and using validation or
cross-validation to compare their predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot


capture the underlying trend of the data. Intuitively, underfitting occurs when the
model or the algorithm does not fit the data well enough. Specifically, underfitting
occurs if the model or algorithm shows low variance but high bias. Underfitting is
often a result of an excessively simple model.
Both overfitting and underfitting lead to poor predictions on new data sets.
Well, let's understand the Bias and variance in simpler terms. (Very Simpler Terms!)

What is Bias?
Bias is the difference between the average prediction of our model and the correct value
which we are trying to predict. A model with high bias pays very little attention to the
training data and oversimplifies the model.
Simple definition: “Resulted Error from Training Data!”

What is a Variance?
Variance is the variability of model prediction for a given data point or a value that tells
us the spread of our data. A model with high variance pays a lot of attention to training
data and does not generalize on the data which it hasn’t seen before.
Simple definition: “Resulted Error from Test Data!”
Well, to understand the concepts more clear and better, I have divided concepts into
Two parts, Bias and variance in the case of Regression as well as Classification
models.
Considering Regression models:
Figure 1: Bias and Variance for Regression Model
We can see clearly that the Model-1 and Model-3 are Underfitting and Overfitting
respectively.
Model-1 has not captured the trends properly, or the model is too simple, hence it's
obvious that the training and test accuracy will be hampered!
As we discussed earlier, “Bias is Error resulted from Training set, while
Variance is error resulted from Test set!”. The Model-1 will have less train and
test accuracy, I.e. Will have High Bias(High Training error) and High Variance(High
Testing error).
Similarly, for Model-3, The model has trained too good on training data, the reason it
fails for testing data(Low test accuracy). Since the training accuracy for Model-3 is
High and Test accuracy is low, Model-3 will have Low Bias( Low Training error) and
High Variance(High Testing error).
Considering Model-2, As the Model-2 is in the “Just Right” condition, the model has
trained well on training as well as a test set respectively. The reason, model has High
training accuracy (Low Bias-low training error) and High testing accuracy( Low
Variance-low testing error).

Chapter 3: Regression and its types

Simple Linear Regression vs Multiple Linear Regression


Regression analysis is a statistical method used to examine the relationship between two
or more variables. It is a powerful tool for predicting future outcomes based on past
data. There are two main types of regression analysis: simple linear regression and
multiple linear regression. In this article, we will explore the differences between these
two methods, using examples to illustrate the key concepts.
Simple Linear Regression
Simple linear regression is a statistical technique used to model the relationship
between two variables: a dependent variable and an independent variable. The
dependent variable is the variable we want to predict, while the independent variable is
the variable that we use to make the prediction. In simple linear regression, we assume
that there is a linear relationship between the two variables, which means that the
change in the independent variable is directly proportional to the change in the
dependent variable.
For example, let’s say we want to predict a person’s weight based on their height. In this
case, weight is the dependent variable, and height is the independent variable. We
would collect data on the heights and weights of a sample of individuals and use this
data to create a regression model. The model would allow us to predict a person’s weight
based on their height.
The equation for a simple linear regression model is:
Y = a + bX + e
where Y is the dependent variable, X is the independent variable, a is the intercept (the
value of Y when X = 0), b is the slope (the change in Y for a one-unit change in X), and e
is the error term (the difference between the predicted value of Y and the actual value of
Y).

Multiple Linear Regression


Multiple linear regression is a statistical technique used to model the relationship
between two or more independent variables and a dependent variable. The idea behind
multiple linear regression is similar to simple linear regression, except that we now have
multiple independent variables that we use to make our prediction.
For example, let’s say we want to predict a person’s salary based on their age, education,
and years of experience. In this case, salary is the dependent variable, while age,
education, and years of experience are the independent variables. We would collect data
on these variables for a sample of individuals and use this data to create a regression
model. The model would allow us to predict a person’s salary based on their age,
education, and years of experience.
The equation for a multiple linear regression model is:
Y = a + b1X1 + b2X2 + b3X3 + … + bnXn + e
Where Y is the dependent variable, X1, X2, X3, … Xn are the independent variables, a is
the intercept, b1, b2, b3,... bn are the slopes (the change in Y for a one-unit change in
each independent variable), and e is the error term.

Differences between Simple Linear Regression and Multiple Linear Regression


The main difference between simple linear regression and multiple linear regression is
the number of independent variables used in the model. In simple linear regression, we
use one independent variable, while in multiple linear regression, we use two or more
independent variables.
Another difference is the complexity of the model. Simple linear regression models are
relatively simple and easy to interpret, as they involve only two variables. Multiple linear
regression models, on the other hand, are more complex and require more
computational power. They also require more careful interpretation, as the relationships
between the independent variables and the dependent variable can be more difficult to
understand.
Example
To illustrate the differences between simple linear regression and multiple linear
regression, let’s consider an example. Suppose we want to predict a person’s score on a
math test based on their study time and their IQ score. We collect data on study time (in
hours) and IQ scores (on a scale of 0 to 100) for a sample of 50 students, along with
their scores on a math test (out of 100). We can then use this data to create both a
simple linear regression model and a multiple linear regression model.
First, let’s create a simple linear regression model. We can plot the data on a scatter plot
to visualize the relationship between study time and math scores.
Press enter or click to view image in full size

Simple linear regression model


From the scatter plot, we can see that there appears to be a positive linear relationship
between study time and math scores. We can then fit a linear regression line to the data
to estimate the relationship between the two variables.
The equation for the simple linear regression model is:
Math Score = 32.55 + 1.89 x Study Time
This means that for every one-hour increase in study time, we expect the student’s math
score to increase by 1.89 points, on average.
Now, let’s create a multiple linear regression model that includes both study time and IQ
score as independent variables. The equation for the multiple linear regression model is:
Math Score = 17.62 + 1.68 x Study Time + 0.26 x IQ Score
Multiple linear regression model
This means that for every one-hour increase in study time, we expect the student’s math
score to increase by 1.68 points, on average, holding the IQ score constant. Similarly, for
every one-point increase in IQ score, we expect the student’s math score to increase by
0.26 points, on average, holding study time constant.
Mean Squared error: Mean Squared Error (MSE) in linear regression measures the
average of the squared differences between actual and predicted values, quantifying how
well the regression line fits the data. A lower MSE indicates a better fit, with the
squaring of errors penalizing larger deviations and ensuring positive and negative errors
don't cancel each other out.

R-Squared Value: In linear regression, R-squared is a statistical measure of


"goodness of fit" that tells you how much of the variation in the dependent variable is
explained by your model's independent variables, ranging from 0% to 100%. A higher
R-squared value indicates that the data points are closer to the regression line, meaning
the model's predictions are a better fit for the actual data.

***For Logistic Regression,K Means, Decision Tree (Refer chapter 2 notes).

Principle Component Analysis

What is PCA and how do we use it?


Every dataset is different, and PCA usually comes into the picture when you have
a dataset with many different variables!
The idea of PCA is simple — reduce the number of variables of a data set, while
preserving as much information as possible.
Let us take a look at this dataset about Dogs~!
Press enter or click to view image in full size
Dataset of Dog’s Characteristics. Image by Author
This dataset shows you the different physical characteristics of various dog
breeds. There are many different variables like Body Length, Weight, Average
Life Span, Bark Loudness, etc. The list goes on…
So the question you might ask is, “There are too many variables to consider… Is
there a way I can just look at the most important variables?”
Translating that into technical terms, that would be “reducing the dimension
of your feature space.” Reducing the dimension of the feature space is also
called “dimensionality reduction”.
Then you might be wondering, “Can we just remove variables like that?”
Admittedly, there is a trade-off that we are making here. Reducing the number of
variables of a data set naturally comes at the expense of accuracy, but the trick in
dimensionality reduction is to trade a little accuracy for simplicity.
Just imagine you have to work with 10, 20 or even 50 different variables. You
cannot possibly be working around so many variables! Not all variables are as
relevant either!
Take note that it is important to apply business/common sense when removing
variables too!
What are the benefits of PCA?
1.​ Captures the most “Important” Variables/Features. By The
Law of Parsimony, or Occam’s razor, the simplest explanation of
an event or observation is the preferred explanation. By reducing
the dimension of your feature space, you have fewer relationships
between variables to consider and you are less likely to overfit your
model.
2.​ “De-noise” your data and reduce redundancy. PCA identifies
components that explain the greatest amount of variance, hence it can
capture the most significant signal in the data and omits the less
relevant variables that are noise.
3.​ Better Visualization of your Data. It would make your life so
much better visualizing a plot on a 2- or 3-dimensional plane.
4.​ Better Data Storage & Computational Time of your Data.
PCA is used to compress information to store and transmit data more
efficiently. Think beyond conventional data points! You can even use
PCA to compress images without losing too much quality, or in signal
processing.
The 2 Main Applications of PCA
1. Feature Elimination
This needs no further introduction. You eliminate variables/features that are not
as significant. The advantages of feature elimination methods include simplicity
and maintaining the interpretability of your variables.
2. Feature Extraction
This is the main course of PCA itself. Let’s take the above Dog Dataset for
example. Say for example we have their Body Length, Weight, Body Height, Body
Width, Body Mass Index etc. Some of these variables may be just combinations of
other attributions!

Step 1 - Data normalization


By considering the example in the introduction, let’s consider, for instance, the following
information for a given client.
●​ Monthly expenses: $300
●​ Age: 27
●​ Rating: 4.5
This information has different scales and performing PCA using such data will lead to a
biased result. This is where data normalization comes in. It ensures that each attribute
has the same level of contribution, preventing one variable from dominating others. For
each variable, normalization is done by subtracting its mean and dividing by its standard
deviation.
Step 2 - Covariance matrix
As the same suggests, this step is about computing the covariance matrix from the
normalized data. This is a symmetric matrix, and each element (i, j) corresponds to the
covariance between variables i and j.
Step 3 - Eigenvectors and eigenvalues
Geometrically, an eigenvector represents a direction such as “vertical” or “90 degrees”.
An eigenvalue, on the other hand, is a number representing the amount of variance
present in the data for a given direction. Each eigenvector has its corresponding
eigenvalue.
Step 4 - Selection of principal components
There are as many pairs of eigenvectors and eigenvalues as the number of variables in
the data. In the data with only monthly expenses, age, and rate, there will be three
pairs. Not all the pairs are relevant. So, the eigenvector with the highest eigenvalue
corresponds to the first principal component. The second principal component is the
eigenvector with the second highest eigenvalue, and so on.
Step 5 - Data transformation in new dimensional space
This step involves re-orienting the original data onto a new subspace defined by the
principal components This reorientation is done by multiplying the original data by the
previously computed eigenvectors.
It is important to remember that this transformation does not modify the original data
itself but instead provides a new perspective to better represent the data.

You might also like