0% found this document useful (0 votes)
28 views100 pages

Introduction to Data Science Course

The document outlines a course on Basic Data Science covering key topics such as descriptive and inferential statistics, hypothesis testing, regression analysis, classification, and clustering. It emphasizes the importance of data science in decision-making, efficiency, and innovation across various industries, including healthcare, finance, and retail. The document also includes previous year questions and answers to aid in understanding and applying data science concepts.

Uploaded by

RITESH MANNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views100 pages

Introduction to Data Science Course

The document outlines a course on Basic Data Science covering key topics such as descriptive and inferential statistics, hypothesis testing, regression analysis, classification, and clustering. It emphasizes the importance of data science in decision-making, efficiency, and innovation across various industries, including healthcare, finance, and retail. The document also includes previous year questions and answers to aid in understanding and applying data science concepts.

Uploaded by

RITESH MANNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Basic Data Science (MCAN-E304F)

UNITS COURSE CONTENT

1 Introduction to Data Science


Define Data Science, why data science, data science in business

2 Descriptive Statistics
Matrix, Matrix operations, Sample, Population, Descriptive statistics,Central
tendency, outlier detection

3 Inferential Statistics
Basics of probability, probability distribution, Central Limit theorem

4 Hypothesis testing
Null and Alternate Hypothesis, Making a Decision, and Critical Value Method,
p-Value Method and Types of Errors, Two-Sample Mean and Proportion Test

5 Regression Analysis
Fundamentals of Regression analysis, assumption of regression analysis, accuracy,
validity, Dealing with categorical data

6 Classification
Introduction, Logistic regression, model building and evaluation

7 Clustering
Introduction to clustering, k-means clustering, hierarchical clustering

8 Decision tree and kNN


Introduction to decision tree, regression tree, truncation & pruning, random
forest, kNN for regression, classification, weighted kNN
Introduction to Data Science
What is Data Science?
Data science is the study of data that helps us derive useful insight for business decision
making. Data Science is all about using tools, techniques, and creativity to uncover insights
hidden within data. It combines math, computer science, and domain expertise to tackle
real-world challenges in a variety of fields.
Data Science processes the raw data and solves business problems and even makes predictions
about the future trend or requirement. For example, from the huge raw data of a company, data
science can help answer following question:
●​ What do customers want?
●​ How can we improve our services?
●​ What will be the upcoming trend in sales?
●​ How much stock they need for the upcoming festival.

Data science involves these key steps:


●​ Data Collection: Gathering raw data from various sources, such as databases,
sensors, or user interactions.
●​ Data Cleaning: Ensuring the data is accurate, complete, and ready for analysis.
●​ Data Analysis: Applying statistical and computational methods to identify patterns,
trends, or relationships.
●​ Data Visualization: Creating charts, graphs, and dashboards to present findings
clearly.
●​ Decision-Making: Using insights to inform strategies, create solutions, or predict
outcomes.

Why Is Data Science Important?


In a world flooded with user-data, data science is crucial for driving progress and innovation in
every industry. Here are some key reasons why it is so important:
●​ Helps Business in Decision-Making: By analyzing data, businesses can
understand trends and make informed choices that reduce risks and maximize
profits.
●​ Improves Efficiency: Organizations can use data science to identify areas where
they can save time and resources.
●​ Personalizes Experiences: Data science helps create customized
recommendations and offers that improve customer satisfaction.
●​ Predicts the Future: Businesses can use data to forecast trends, demand, and
other important factors.
●​ Drives Innovation: New ideas and products often come from insights discovered
through data science.
●​ Benefits Society: Data science improves public services like healthcare, education,
and transportation by helping allocate resources more effectively.

Industry where data science is used


Data science is transforming every industry by unlocking the power of data. Here are some key
sectors where data science plays a vital role:
●​ Healthcare: Data science improves patient outcomes by using predictive analytics
to detect diseases early, creating personalized treatment plans and optimizing
hospital operations for efficiency.
●​ Finance: Data science helps detect fraudulent activities, assess and manage financial
risks, and provide tailored financial solutions to customers.
●​ Retail: Data science enhances customer experiences by delivering targeted
marketing campaigns, optimizing inventory management, and forecasting sales
trends accurately.
●​ Technology: Data science powers cutting-edge AI applications such as voice
assistants, intelligent search engines, and smart home devices.
●​ Transportation: Data science optimizes travel routes, manages vehicle fleets
effectively, and enhances traffic management systems for smoother journeys.
●​ Manufacturing: Data science predicts potential equipment failures, streamlines
supply chain processes, and improves production efficiency through data-driven
decisions.
●​ Energy: Data science forecasts energy demand, optimizes energy consumption, and
facilitates the integration of renewable energy resources.
●​ Agriculture: Data science drives precision farming practices by monitoring crop
health, managing resources efficiently, and boosting agricultural yields.

Data Science in Business


Data science is essential in modern business because it provides the tools and techniques to
transform raw data into actionable insights that drive better decision-making, efficiency, and
profitability.
●​ Decision Making: It moves business strategy from intuition to data-driven
decision-making.
●​ Customer Insights: It helps businesses understand their customers deeply (e.g., in
Retail, by delivering targeted marketing and enhancing customer experiences).
●​ Risk Management: It is crucial for assessing and mitigating risks (e.g., in Finance, by
helping to detect fraudulent activities and manage financial risks).
●​ Optimization: It is used to streamline operations and increase efficiency (e.g., in
Healthcare, by optimizing hospital operations; in Manufacturing, by streamlining supply
chain processes).
●​ Innovation: It powers the creation of new, intelligent products and services (e.g., in
Technology, by powering AI applications like voice assistants).
Previous year questions
Very Short Answer (VSA) Questions (1 Mark)
1.​ The independent variable is also called ______ variable.
○​ Answer: Predictor (or Explanatory) variable.
2.​ What kind of distance metric(s) are suitable for categorical variables to find the closest
neighbors?
○​ Answer: Hamming Distance (or Overlap Metric).
3.​ Which theorem of probability is used by Naive Bayes Algorithm?
○​ Answer: Bayes' Theorem (or Bayes' Rule).
4.​ A tool not for Statistical Data Analysis is _______.
○​ Answer: Microsoft Word (or any non-statistical software like Adobe Photoshop,
Notepad, etc.).
5.​ _______ type of analytics describes what happened in the past.
○​ Answer: Descriptive.

Short Answer (SA) Questions (5 Marks)


1. Discuss differences between Divisive and Agglomerative clustering.
These are the two main approaches in Hierarchical Clustering, which creates a
tree-like structure called a dendrogram.
Feature Agglomerative Clustering Divisive Clustering
(Bottom-Up) (Top-Down)

Starting Point Each data point is an individual All data points belong to one single,
cluster. large cluster.

Process Merges the closest pairs of Splits the largest cluster into
clusters sequentially. smaller, dissimilar sub-clusters
recursively.

Operation Cohesive (Joining/Fusing). Separative (Breaking


up/Partitioning).

Complexity Generally simpler to implement. More complex to implement as it


Can be computationally expensive requires a method to determine the
for very large datasets. optimal split.

2. What are the functions of Data Warehouse tools and Utilities?


Data Warehouse tools and utilities primarily manage the ETL (Extract, Transform,
Load) process to populate and maintain the data warehouse. Their main functions are:
●​ Extraction: Reading and gathering data from various operational source systems
(databases, flat files, APIs, etc.).
●​ Cleansing/Scrubbing: Detecting and correcting errors, inconsistencies, or missing
values in the extracted data to ensure data quality and reliability.
●​ Transformation: Converting the raw data into the consistent, integrated, and
summarized format required by the data warehouse's dimensional model. This includes
tasks like data aggregation, calculating derived values, and resolving key differences.
●​ Loading and Refreshing: Physically inserting the transformed data into the data
warehouse tables. Refreshing is the ongoing process of updating the warehouse with new
or changed source data (incremental loading).

3. What Is 'naive' in the Naive Bayes Classifier?


The term "naive" refers to the classifier's fundamental, simplifying assumption of
conditional independence among the features (predictor variables) given the class label.
●​ The Assumption: The algorithm assumes that the presence or absence of one feature is
independent of the presence or absence of any other feature, given that the class label is
already known.
●​ Example: In text classification, it assumes the probability of the word "buy" appearing
is independent of the word "now" appearing, given that the email is classified as "Spam."
●​ The Trade-off: This assumption is often violated in real-world data but greatly
simplifies the complex probability calculations, making the Naive Bayes model highly
efficient, fast, and often effective, especially for problems with many features (like text
data).

4. What is the goal of A/B Testing?


The goal of A/B Testing (or split testing) is to statistically determine which of two versions (A
and B) of a single variable performs better in achieving a specific business goal.
1.​ Comparison: To compare a control version (A) against a single variant (B) by showing
each version to a similar segment of users.
2.​ Hypothesis Validation: To validate a specific hypothesis (e.g., "Changing the button
color to green (B) will increase the click-through rate compared to the current red
button (A).")
3.​ Optimization: To use the collected data and statistical significance to make an
informed, data-driven decision on which version to implement permanently, thereby
optimizing a website, product, or campaign metric (e.g., conversion rate, revenue, or user
engagement).

♦️
Long Answer (LA) Question (4 Marks)​
How can the dependency of two variables be interpreted?
The dependency (or relationship) between two variables (X and Y) is interpreted by examining
three key aspects: Direction, Strength, and Form.
1. Direction of Dependency
This defines how Y changes when X changes.
●​ Positive Dependency: As X increases, Y also tends to increase (e.g., hours studied and
exam score).
●​ Negative Dependency: As X increases, Y tends to decrease (e.g., altitude and air
temperature).
●​ No Dependency: Changes in X have no systematic impact on Y.​

2. Strength of Dependency (Correlation)


The strength of a linear relationship is measured by the Correlation Coefficient (r), which
ranges from −1 to +1.
●​ |r| close to 1: Indicates a strong linear relationship (points cluster tightly around a line).
●​ |r| close to 0: Indicates a weak or no linear relationship.​
3. Form of Dependency (Regression)
This involves modeling the functional relationship between the variables:
●​ Linear Form: The relationship is best described by a straight line​
Ŷ = β₀ + β₁X​
Here, β₁ is interpreted as the change in Y for every one-unit change in X.
●​ Non-linear Form: The relationship follows a curve (e.g., quadratic, exponential).​
This indicates that the impact of X on Y is not constant across all values of X.​
Descriptive Statistics
Descriptive Statistics
Descriptive statistics are the foundation of data science. They involve methods to summarize,
organize, and interpret data to understand its main features.​
They describe:
●​ The center of the data (typical value)
●​ The spread of the data (how much variation exists)
●​ The shape/distribution of the data
Descriptive statistics do not make predictions or inferences — they simply describe what the
data shows.
Types of Descriptive Statistics
Descriptive statistics are broadly classified into three categories:
1.​ Measures of Central Tendency – show where the data centers.
2.​ Measures of Variability (Dispersion) – show how data spreads.
3.​ Measures of Frequency Distribution – show how data values are distributed across
intervals or categories.

1. Measures of Central Tendency


These indicate the central or typical value in a dataset.
a) Mean (Arithmetic Average)
The mean is the sum of all observations divided by the total number of observations.


Where:
●​ x = observations
●​ n = number of data points
Example (Data: 2, 3, 3, 5, 7)
Mean=(2+3+3+5+7) / 5 =4

Python Example:
import numpy as np arr = [5, 6, 11]
mean = [Link](arr) print("Mean = ", mean)
Output:​
Mean = 7.333333333333333

b) Mode
The most frequently occurring value in a dataset.​
It’s particularly useful for categorical or discrete data.

Example (Data: 2, 3, 3, 5, 7)
The value 3 appears twice. Mode is 3

Python Example:
python
import [Link] as stats arr = [1, 2, 2, 3]
mode = [Link](arr)
print("Mode = ", mode)
Output:​
Mode = ModeResult(mode=array([2]), count=array([2]))

c) Median
The middle value of a sorted dataset.
●​ If the number of observations is odd, the median is the central value.
●​ If even, it’s the average of the two middle values.
Example (Data: 2, 3, 3, 5, 7)
Sorted: 2, 3, 3, 5, 7. Median is 3

Python Example:
python
import numpy as np arr = [1, 2, 3, 4]
median = [Link](arr)
print("Median = ", median)
Output:​
Median = 2.5
Usefulness:​
Median is less sensitive to outliers and is often more representative in skewed data.

2. Measures of Variability (Dispersion)


These show how much the data values differ from the central value (mean/median).
a) Range
The difference between the maximum and minimum values.
Range=Max value−Min value
Range=Max value−Min value
Python Example:
python
arr = [1, 2, 3, 4, 5]
Range = max(arr) - min(arr)
print("Range =", Range)
Output:​
Range = 4

b) Variance (σ²)
Variance measures the average squared deviation of each value from the mean.


Python Example:
python
import statistics arr = [1, 2, 3, 4, 5]
print("Variance =", [Link](arr))
Output:​
Variance = 2.5
c) Standard Deviation (σ)
It is the square root of variance, showing the average deviation of values from the mean.


Python Example:
python
import statistics arr = [1, 2, 3, 4, 5]
print("Standard Deviation =", [Link](arr))
Output:​
Standard Deviation = 1.5811388300841898
Interpretation:
●​ A low standard deviation indicates that values are close to the mean.
●​ A high standard deviation indicates large variation.

3. Measures of Frequency Distribution


These describe how data points are distributed across categories or intervals.​
They are often represented using tables or charts like histograms or pie charts.
Components of a Frequency Distribution Table:
●​ Data Intervals or Categories – grouping of data points
●​ Frequency Count – number of observations in each group
●​ Relative Frequency – percentage of total observations
●​ Cumulative Frequency – running total of frequencies
Usefulness:​
Helps identify patterns, outliers, and overall distribution shape before applying advanced
analysis.

[Link] (The Whole Group)


A Population is the complete set of all items or individuals under study.​
Its defining characteristics are called parameters.
Definition:
A parameter is a numerical value that describes a characteristic of the entire population, such
as population mean (μ), population variance (σ2), or population proportion (P).

Mathematical Example of a Population Parameter


Imagine a small company with a population of 5 employees, whose salaries (in thousands) are:
Employee Salary (in ₹000)

A 50

B 60

C 70

D 80

E 90
To find the Population Mean (μ), we use:


Substitute values:
μ=(50+60+70+80+90)/5=350/5=70
Hence, the Population Mean (Parameter) is: μ=70

Python Verification
python
import numpy as np population = [50, 60, 70, 80, 90]
mu = [Link](population)
print("Population Mean (μ) =", mu)
Output:
cpp
Population Mean (μ) = 70.0

2. Sample (The Subset)


A Sample is a smaller, representative subset drawn from the population.​
Its defining characteristics are called statistics.
Definition:
A statistic is a numerical value that describes a characteristic of the sample, such as sample
mean (xˉ), sample variance (s2), or sample proportion (p).

Mathematical Example of a Sample Statistic


From the company’s employee population above, select a sample of 3 employees with salaries:​
50, 70, 90.
The Sample Mean (xˉ) is given by:


Substitute values:
xˉ=50+70+90/3=210/3=70
xˉ=3
Hence, the Sample Mean (Statistic) is:
xˉ=70
Python Verification
python
sample = [50, 70, 90]
x_bar = [Link](sample)
print("Sample Mean (x̄) =", x_bar)

Output:
cpp
Sample Mean (x̄) = 70.0
3. The Core Relationship: Inference
The goal of statistical inference is to use information from the sample to make conclusions or
predictions about the population.
Term Symbol Meaning

Population Mean μ True (unknown) average of the population

Sample Mean xˉ Estimated average based on the sample

In the above example:


Population Parameter: μ=70
Sample Statistic: xˉ=70
In a real-world situation, the population data (μ) is unknown, so we rely on the sample data (xˉ)
to estimate it.
Thus, we infer:
μ≈xˉ

Matrix and Matrix Operations


1. Introduction to Matrices
A matrix is a rectangular arrangement of numbers, symbols, or expressions in rows and
columns that represents data or a system of equations. Matrices are widely used in
mathematics, computer science, physics, and engineering to simplify complex linear
relationships.

2. Matrix Representation
A matrix with m rows and n columns is called an m × n matrix and is represented as:
a11 a12 a13
a21 a22 a23
a31 a32 a33

3. Matrix Addition
Matrix addition is defined only for matrices of the same order. Each element of the resulting
matrix is obtained by adding corresponding elements of the given matrices.
Example:
A = [[1, 2], [3, 4]]​
B = [[5, 6], [7, 8]]​
A + B = [[6, 8], [10, 12]]
Python Implementation:
import numpy as np​
A = [Link]([[1, 2], [3, 4]])​
B = [Link]([[5, 6], [7, 8]])​
print(A + B)

4. Matrix Subtraction
A - B = [[-4, -4], [-4, -4]]
Python:​
import numpy as np​
A = [Link]([[1, 2], [3, 4]])​
B = [Link]([[5, 6], [7, 8]])​
print(A - B)

5. Scalar Multiplication
Each element of a matrix is multiplied by a scalar value.​
Example:​
2 × [[1, 2], [3, 4]] = [[2, 4], [6, 8]]
Python:​
import numpy as np​
A = [Link]([[1, 2], [3, 4]])​
print(2 * A)

6. Matrix Multiplication
Matrix multiplication is performed by taking the dot product of rows and columns.​
Example:​
A = [[1, 2], [3, 4]]​
B = [[2, 0], [1, 2]]​
A × B = [[4, 4], [10, 8]]
Python:​
import numpy as np​
A = [Link]([[1, 2], [3, 4]])​
B = [Link]([[2, 0], [1, 2]])​
print([Link](A, B))

7. Transpose of a Matrix
The transpose of a matrix is obtained by interchanging its rows and columns.​
Example:​
A = [[1, 2, 3], [4, 5, 6]] → Aᵀ = [[1, 4], [2, 5], [3, 6]]
Python:​
import numpy as np​
A = [Link]([[1, 2, 3], [4, 5, 6]])​
print(A.T)

8. Determinant and Inverse


For a square matrix A = [[a, b], [c, d]]:​
Determinant |A| = ad - bc​
If |A| ≠ 0, then the inverse of A is given by (1/|A|) × [[d, -b], [-c, a]]
Python:​
import numpy as np​
A = [Link]([[1, 2], [3, 4]])​
print([Link](A))​
print([Link](A))
Central Tendency and Outlier Detection
1. Measures of Central Tendency
Measures of Central Tendency are statistical values that describe the central position within a
dataset. The three main measures are Mean, Median, and Mode. They help summarize a dataset
with a single representative value.
a) Mean (Arithmetic Average)
The mean is obtained by dividing the sum of all data points by the number of observations.​
Formula: x̄ = Σx / n
Mathematical Example:​
Data = [5, 6, 11]​
Mean = (5 + 6 + 11)/3 = 7.33
Python:​
import numpy as np​
arr = [5, 6, 11]​
mean = [Link](arr)​
print('Mean =', mean)

b) Median (Middle Value)


The median is the middle value when data is arranged in ascending or descending order.​
If the number of values is odd, it's the center value. If even, it's the average of two middle
values.
Mathematical Example:​
Data = [1, 2, 3, 4]​
Median = (2 + 3)/2 = 2.5
Python:​
import numpy as np​
arr = [1, 2, 3, 4]​
median = [Link](arr)​
print('Median =', median)
c) Mode (Most Frequent Value)
The mode is the most frequently occurring value in a dataset. It is useful for categorical or
discrete data.
Mathematical Example:​
Data = [1, 2, 2, 3]​
Mode = 2
Python:​
import [Link] as stats​
arr = [1, 2, 2, 3]​
mode = [Link](arr)​
print('Mode =', mode)

2. Outlier Detection
Outliers are data points that differ significantly from other observations. They can distort
statistical analyses and must be identified and treated carefully.
Common methods to detect outliers include:​
1. Using Standard Deviation​
2. Using Interquartile Range (IQR)
a) Using Standard Deviation
A data point is considered an outlier if it lies more than 3 standard deviations away from the
mean.
Mathematical Example:​
Data = [10, 12, 13, 12, 11, 100]​
Mean = 26.33, Std Dev = 36.31​
Since 100 is more than 3σ away from the mean, it is an outlier.
Python:​
import numpy as np​
arr = [Link]([10, 12, 13, 12, 11, 100])​
mean = [Link](arr)​
std = [Link](arr)​
outliers = [x for x in arr if abs(x - mean) > 3 * std]​
print('Outliers =', outliers)

b) Using Interquartile Range (IQR)


The IQR method identifies outliers based on the spread of the middle 50% of data.​
Formula:​
IQR = Q3 - Q1​
Lower Bound = Q1 - 1.5 × IQR​
Upper Bound = Q3 + 1.5 × IQR
Mathematical Example:​
Data = [10, 12, 13, 12, 11, 100]​
Q1 = 11, Q3 = 13, IQR = 2​
Upper Bound = 13 + 1.5(2) = 16, Lower Bound = 11 - 1.5(2) = 8​
Values outside [8, 16] are outliers → 100 is an outlier.
Python:​
import numpy as np​
arr = [Link]([10, 12, 13, 12, 11, 100])​
Q1 = [Link](arr, 25)​
Q3 = [Link](arr, 75)​
IQR = Q3 - Q1​
lower = Q1 - 1.5 * IQR​
upper = Q3 + 1.5 * IQR​
outliers = [x for x in arr if x < lower or x > upper]​
print('Outliers =', outliers)

Previous year solved


VSA (1 MARK QUESTIONS)
1. The age of a person is categorical. True/False? (2025)
Answer: False
Explanation: Age is a quantitative (numerical) variable. You can perform meaningful
mathematical operations on it, such as calculating an average. Categorical variables represent
groups or labels (e.g., gender, car brand).
2. A list of 5 pulse rates is: 70, 64, 80, 74, 92. What is the median for this list?
(2025, 2023)
Answer: 74
Explanation:
• First, arrange the data in ascending order: 64, 70, 74, 80, 92.
• The median is the middle value. For n=5 observations, the median is the value at the (5+1)/2 =
3rd position.
• The 3rd value in the ordered list is 74.

3. The range is a simple measure of ______. (2024)


Answer: Dispersion (or Variation/Spread)
Explanation: Measures of dispersion describe how spread out the data points are. The range,
calculated as (Maximum value - Minimum value), is the simplest measure of dispersion.

4. If the mean and the mode are given as 35 and 30. Find the Median. (2024)
Answer: 33.33 (or 100/3)
Explanation: We use the empirical relationship for a moderately skewed distribution: Mode ≈
3Median - 2Mean.
• 30 ≈ 3Median - 2(35)
• 30 ≈ 3Median - 70
• 3 Median ≈ 100
• Median ≈ 100 / 3 ≈ 33.33

5. Choose the correct keyword for the term: A graphical representation of a data
set. (2024)
Answer: Graph (or Chart/Plot)
Explanation: This is the general term for visual representations like histograms, bar charts, and
scatter plots.

6. The value of a correlation is reported to be r = -0.5. Which of the following


statements is correct? (2025)
(A) The x-variable explains 25% of the variability in the y-variable.
(B) The x-variable explains -25% of the variability in the y-variable.
(C) The x-variable explains 50% of the variability in the y-variable.
(D) The x-variable explains -50% of the variability in the y-variable.
Answer: (A)
Explanation: The proportion of variance explained is given by the coefficient of determination,
r².
• r = -0.5
• r² = (-0.5)² = 0.25
• This means 25% of the variability in y is explained by its linear relationship with x. The
negative sign of r only indicates the direction of the relationship.

7. In a frequency distribution, the last cumulative frequency is 500. Q₃ (Third


Quartile, must be in... (2023))
Answer: The value at the 375th observation.
Explanation: The third quartile (Q₃) is the value below which 75% of the data lies.
• Total number of observations, N = 500.
• Position of Q₃ = (75/100) * N = 0.75 * 500 = 375.
• Therefore, Q₃ corresponds to the value of the 375th observation.

8. What is the order of following sampling schemes from best to worst?


1. stratified 2. simple random 3. cluster (2023)
Answer: 1, 2, 3 (Stratified > Simple Random > Cluster)
Explanation:
• Stratified Sampling: Divides the population into homogenous groups (strata) and samples
from each. This ensures representation from all groups and generally provides the most precise
estimates.
• Simple Random Sampling: Every sample has an equal chance of selection. It is unbiased
but can be less precise than stratified sampling.
• Cluster Sampling: Divides the population into clusters and randomly selects entire clusters.
It is cost-effective but can introduce more sampling error, making it the least precise of the
three.

SA (5 MARKS QUESTIONS)
1. Calculate the sample mean and standard deviation for a list of 20 students
marks. (2025)
Data: 29, 26, 13, 23, 23, 25, 17, 22, 17, 19, 12, 26, 30, 30, 18, 14, 12, 26, 17, 18
Solution:
Let the data points be xᵢ.
Step 1: Calculate the Sample Mean (x̄)
• Sum of all observations, Σxᵢ = 416
• Number of observations, n = 20
• Sample Mean, x̄ = Σxᵢ / n = 416 / 20 = 20.8
Step 2: Calculate the Sample Standard Deviation (s)
The formula is: s = √[ Σ(xᵢ - x̄)² / (n - 1) ]

We calculate the sum of squared differences:


Σ(xᵢ - x̄)² = 710.8
• Sample Standard Deviation, s = √[ 710.8 / (20 - 1) ] = √[ 710.8 / 19 ] = √37.4105 ≈ 6.12

Final Answer:
• Sample Mean = 20.8
• Sample Standard Deviation ≈ 6.12

2. Explain Mean, Median and Mode. (2024)


These are the three primary Measures of Central Tendency:
●​ Mean (Arithmetic Average):
○​ Calculation: Sum of all values divided by the number of values.
○​ Property: Uses all data points. It is heavily influenced by extreme values
(outliers).
●​ Median (Middle Value):
○​ Calculation: The value that splits the ordered dataset into two equal halves
(50th percentile).
○​ Property: Robust and not affected by outliers, making it better for skewed
distributions.
●​ Mode (Most Frequent Value):
○​ Calculation: The value that occurs with the highest frequency.
○​ Property: Can be used for categorical (non-numerical) data. A dataset can have
one, multiple, or no mode.

3. What are measures of central tendency & dispersion? (2023)


• Measures of Central Tendency: These are summary statistics that represent the center
point or typical value of a dataset. They provide a single value that describes the entire data. The
main measures are the Mean, Median, and Mode.
• Measures of Dispersion: These are summary statistics that quantify the spread, variability,
or scatter of the data points. They indicate how much the data varies. Common measures
include Range, Variance, Standard Deviation, and Interquartile Range (IQR).

4. Point out how the shape of distribution depends with respect to central
tendency. (2025)
The relative positions of the mean, median, and mode reveal the skewness (asymmetry) of a
distribution.
• Symmetric Distribution: The mean, median, and mode are all approximately equal. The
distribution forms a balanced, mirror-image shape on both sides of the center.
• Right-Skewed (Positively Skewed) Distribution: The tail extends to the right. The
mean is pulled in the direction of the tail and is the largest. The mode is the smallest, and the
median lies in between. The relationship is Mode < Median < Mean.
• Left-Skewed (Negatively Skewed) Distribution: The tail extends to the left. The mean is
pulled to the left and is the smallest. The mode is the largest, and the median lies in between.
The relationship is Mean < Median < Mode.

LA (3 - 10 MARKS QUESTIONS)
1. & 2. Calculate the number of observations that are two or more sample standard
deviations from the sample mean. (3 Marks - 2025, 5 Marks - 2023)
Using the results from SA Q1: x̄ = 20.8 and s ≈ 6.12.

Step 1: Calculate the boundaries.


• Lower Boundary = x̄ - 2s = 20.8 - 2(6.12) = 20.8 - 12.24 = 8.56
• Upper Boundary = x̄ + 2s = 20.8 + 2(6.12) = 20.8 + 12.24 = 33.04

Step 2: Identify observations outside these boundaries.


We are looking for values less than 8.56 or greater than 33.04.
From the data: 12, 12, 13, 14, 17, 17, 17, 18, 18, 19, 22, 23, 23, 25, 26, 26, 26, 29, 30, 30.
• There are no values less than 8.56.
• There are no values greater than 33.04.

Final Answer: 0 observations are two or more sample standard deviations from the sample
mean.

3. Calculate the sample mean and standard deviation for a list of 20 students
marks. (8 Marks) (2023)
This question is identical to SA Q1.
Answer:
• Sample Mean = 20.8
• Sample Standard Deviation ≈ 6.12

4. Calculate the mean number of children per family for the sample from the given
table. (Also calculate the standard deviation). (10 Marks) (2023)
Data Table:
Number of Number of
children (x) families (f)

0 8

1 16

2 22

3 14

4 6

5 4

6 2
Solution:
We will treat this as sample data.

Step 1: Calculate the Sample Mean (x̄)


We need to find Σfᵢxᵢ and the total number of families (n).
x f f*x

0 8 0

1 16 16

2 22 44
3 14 42

4 6 24

5 4 20

6 2 12

Σ n=72 Σfx=158

Sample Mean, x̄ = Σfx / n = 158 / 72 ≈ 2.194

Step 2: Calculate the Sample Standard Deviation (s)


The formula is: s = √[ Σfᵢ(xᵢ - x̄)² / (n - 1) ]
x f (x - x̄) (x - x̄)² f*(x - x̄)²

0 8 -2.194 4.8136 38.5088

1 16 -1.194 1.4256 22.8096

2 22 -0.194 0.0376 0.8272

3 14 0.806 0.6496 9.0944

4 6 1.806 3.2616 19.5696

5 4 2.806 7.8736 31.4944

6 2 3.806 14.4856 28.9712

Σ 72 151.2752

●​ Sum of weighted squared differences, Σfᵢ(xᵢ - x̄)² = 151.2752


●​ Sample Standard Deviation,
s = √[ 151.2752 / (72 - 1) ] = √[ 151.2752 / 71 ] = √2.1306 ≈ 1.460

Final Answer:
• Mean number of children per family ≈ 2.19
• Standard Deviation ≈ 1.46

5. Describe Pearson's correlation coefficient? (7 Marks) (2025)


Pearson's correlation coefficient (denoted by r) is a measure of the strength and
direction of the linear relationship between two quantitative variables.

Purpose: It quantifies how closely two variables are related to each other in a straight-line
fashion.
• Range and Interpretation: The value of r always lies between -1 and +1.
- r = +1: Perfect positive linear relationship.
- r > 0: Positive correlation (as one variable increases, the other tends to increase).
- r = 0: No linear correlation.
- r < 0: Negative correlation (as one variable increases, the other tends to decrease).
- r = -1: Perfect negative linear relationship.
- The closer |r| is to 1, the stronger the linear relationship.
• Key Properties:
1. It is a dimensionless quantity; it has no units.
2. It is symmetric: the correlation between x and y is the same as between y and x.
3. It measures only linear relationships. A value of 0 does not necessarily mean no relationship,
only no linear relationship.
4. Crucially, correlation does not imply causation. A strong correlation between two variables
does not mean that one causes the other.
Inferential Statistics
What is Inferential Statistics?
Inferential statistics is an important tool that allows us to make predictions and conclusions
about a population based on sample data. Unlike descriptive statistics, which only summarize
data, inferential statistics let us test hypotheses, make estimates, and measure the uncertainty
about our predictions. These tools are essential for evaluating models, testing assumptions, and
supporting data-driven decision-making.
For example, instead of surveying every voter in a country, we can survey a few thousand and
still make reliable conclusions about the entire population’s opinion. Inferential statistics
provides the tools to do this systematically and mathematically.

Why Do We Need Inferential Statistics?


In real-world scenarios, analyzing an entire population is often impossible. Instead, we collect
data from a sample and use inferential statistics to:
●​ Conclude the whole population.
●​ Test claims or hypotheses.
●​ Calculate confidence intervals and p-values to measure uncertainty.
●​ Make predictions with statistical models.

Techniques in Inferential Statistics


Inferential statistics offers several key methods for testing hypotheses, estimating population
parameters, and making predictions. Here are the major techniques:
1. Confidence Intervals: It gives us a range of values that likely includes the true population
parameter. It helps quantify the uncertainty of an estimate. The formula for calculating a
confidence interval for the mean is:
CI=xˉ±Zα/2×σn​
Where:
●​ xˉ is the sample mean
●​ Za/2​ is the Z-value from the standard normal distribution (e.g., 1.96 for a 95%
confidence interval)
●​ σ is the population standard deviation
●​ n is the sample size
For example, if we measure the average height of 100 people, a 95% confidence interval gives us
a range where the true population mean height is likely to fall. This helps gauge the precision of
our estimate and compare models (like in A/B testing).

2. Hypothesis Testing: Hypothesis testing is a formal procedure for testing claims or


assumptions about data. It involves the following steps:
●​ Null Hypothesis (H₀): The default assumption, such as “there’s no difference
between two models.”
●​ Alternative Hypothesis (H₁): The claim you aim to prove, such as “Model A
performs better than Model B.”
We collect data and compute a test statistic (such as Z for a Z-test or t for a T-test):
Z=xˉ−μ0σn​
​Where:
●​ xˉ is the sample mean
●​ μ0 is the hypothesized population mean
●​ σ is the population standard deviation
●​ n is the sample size
After calculating the test statistic, we compare it with a critical value or use a p-value to decide
whether to reject or accept the null hypothesis. If the p-value is smaller than the significance
level α\alphaα (usually 0.05), we reject the null hypothesis.
p-value=2⋅P(Z>∣zobs∣)​
​ here zobs is the observed test statistics? A small p-value suggests strong evidence against the
W
null hypothesis.

3. Central Limit Theorem: It states that the distribution of the sample mean will
approximate a normal distribution as the sample size increases, regardless of the original
population distribution. This is crucial because many statistical methods assume that data is
normally distributed. The CLT can be mathematically expressed as:
Xˉ∼N(μ,σn)
Where:
●​ μis the population mean
●​ σ is the population standard deviation
●​ n is the sample size
This theorem allows us to apply normal distribution-based methods even when the original data
is not normally distributed, such as in cases with skewed income or shopping behavior data.
Errors in Inferential Statistics
In hypothesis testing, Type I Error and Type II Error are key concepts:
●​ Type I Error occurs when we wrongly reject a true null hypothesis. The probability
of making a Type I error is denoted by α (the significance level).
●​ Type II Error occurs when we fail to reject a false null hypothesis. The probability
of making a Type II error is denoted by β and the power of the test is given by 1−β

The goal is to minimize these errors by carefully selecting sample sizes and significance levels.

Parametric and Non-Parametric Tests


Statistical tests help decide if the data support a hypothesis. They calculate a test statistic that
shows how much the data differs from the assumption (null hypothesis). This is compared to a
critical value or p-value to accept or reject the null.
1.​ Parametric Tests: These tests assume that the data follows a specific distribution
(often normal) and has consistent variance. They are typically used for continuous
data. Examples include the Z-test, T-test, and ANOVA. These tests are effective for
comparing models or measuring performance when the assumptions are met.
2.​ Non-Parametric Tests: Non-parametric tests do not assume a specific distribution
for the data, making them ideal for small samples or non-normal data, including
categorical or ranked data. Examples include the Chi-Square test, Mann-Whitney U
test, and Kruskal-Wallis test. They are useful when data is skewed or categorical, such
as customer ratings or behaviors.

Example: Evaluating a New Delivery Algorithm Using Inferential Statistics


A quick commerce company wants to check if a new delivery algorithm reduces delivery times
compared to the current system.
Experiment Setup:
●​ 100 orders split into two groups: 50 with the new algorithm, 50 with the current
system.
●​ Delivery times for both groups are recorded.
Steps
[Link]:
●​ Null (H0): The New algorithm does not reduce delivery time.
●​ Alternative (H1): New algorithm reduces delivery time.
[Link] Level:
Set at 0.05 (5% risk of wrongly rejecting H0).
●​ Type I error: Thinking the new system is better when it isn’t.
●​ Type II error: Missing a real improvement.
[Link] Statistic: Compare average delivery times between the two groups
[Link]:
●​ Calculate means and differences.
●​ Check if the data is roughly normal.
Perform a t-test or z-test.
If p-value < 0.05, reject H0 and conclude the new algorithm is better. Otherwise, no clear
improvement.
Confidence Interval: For example, a range of -5 to -2 minutes means deliveries are 2 to 5
minutes faster with the new system.
Basics of probability
Probability is one of the foundational pillars of statistics and data science. It helps us quantify
uncertainty, make predictions, and even assess the validity of hypotheses. As data scientists, we
deal with probabilistic models and make decisions based on data-driven insights. In this blog,
we’ll break down the essential concepts of basic probability and discuss how they are applied in
the world of data science.

1. What is Probability?
Probability is a numerical measure of the likelihood of an event happening. It’s expressed as a
number between 0 and 1, where:
●​ 0 means the event will never occur.
●​ 1 means the event will always occur.

Interactive Question:
If you roll a fair six-sided die, what is the probability of getting an even number?​
(Choices: 0, 1/6, 1/3, 1/2)
Answer: There are three even numbers (2, 4, 6) on a six-sided die. So the probability is:
P(even) = 3/6 = 1/2
Example: Flipping a Coin
Let’s consider a simple example:
Flipping a fair coin. The sample space is S={H,T}, where H stands for heads and T stands for
tails.
●​ The probability of flipping heads is P(H)=0.5P(H) = 0.5P(H)=0.5.
●​ Similarly, the probability of flipping tails is P(T)=0.5P(T) = 0.5P(T)=0.5.
You can visualize this by thinking of the coin as a 50/50 chance for either side. You may also use
this concept when modeling binary classification problems in machine learning.

2. Probability Space
To understand probability fully, we need to establish the probability space. This is a
mathematical framework for defining the probabilities of different outcomes and events. A
probability space consists of three components:

a) Sample Space (S)


The sample space is the set of all possible outcomes of an experiment. It contains every possible
result that can occur.
For instance, if you’re flipping a coin, the sample space is:
S = {H,T}

b) Event (E)
An event is any subset of the sample space. It represents a particular outcome or a set of
outcomes. For example:
●​ If you’re interested in the probability of getting a head on a coin flip, the event E is
{H}.
●​ If you’re interested in the coin landing either heads or tails, the event is the entire
sample space: E = S = {H,T}.
c) Probability Function (P):
The probability function PPP assigns a probability to each event in the sample space. This
probability must be between 0 and 1 for each event, and the sum of all event probabilities must
equal 1:
P(S) = 1
For our coin example:
P(H) = 0.5, P(T) = 0.5,
P(S) = P(H) + P(T) = 0.5+0.5 = 1

3. Basic Probability Rules


Here we will explore two important probability rules that can help you solve more complex
problems: the Addition Rule and the Multiplication Rule.
a) Addition Rule
The Addition Rule is used when we want to find the probability of either one event or another
event occurring. If two events are mutually exclusive, the probability of either event happening
is simply the sum of their individual probabilities.
Addition Rule Formula:
P(A∪B) = P(A) + P(B) − P(A∩B)
Where:
●​ P(A∪B) is the probability that either event A or event B occurs.
●​ P(A∩B) is the probability that both events A and B occur at the same time.
Example: Rolling a Die
If you roll a fair six-sided die and want to know the probability of getting a 3 or a 5, we have two
mutually exclusive events. Since rolling a 3 and rolling a 5 cannot happen at the same time:
P(3 or 5) = P(3) + P(5) = 1/6 + 1/6 = 2/6 =1/3

Interactive Question:
What is the probability of rolling a 4 or a 6 on a six-sided die?​
(Choices: 1/6, 1/3, 1/2, 2/3)
Answer: Since the events are mutually exclusive:
P(4 or 6) = P(4) + P(6) = 1/6 + 1/6 = 2/6 = 1/3
b) Multiplication Rule
The Multiplication Rule is used when we want to calculate the probability that two independent
events will both occur. If the events are independent (i.e., the occurrence of one event does not
affect the other), the probability of both events happening is the product of their individual
probabilities.
Multiplication Rule Formula:
P(A∩B) = P(A) × P(B)
Example: Flipping Two Coins
Let’s say you flip two fair coins.
The events A (getting heads on the first coin) and B (getting heads on the second coin) are
independent. To find the probability that both coins land heads, we use the multiplication rule:
P(A∩B) = P(Heads on 1st coin) × P(Heads on 2nd coin) = 1/2 × 1/2 =¼

Interactive Question:If you roll two fair dice, what is the probability that both dice show a 6?​
(Choices: 1/36, 1/12, 1/6, 1/3)
Answer: The probability of rolling a 6 on the first die is 1/6​, and the probability of rolling a 6
on the second die is also 1/6. So, the probability is:P(6 on both dice) = 1/6 × 1/6 = 1/36
Probability Data Distributions
Understanding how data behaves is one of the first steps in data science. Before we dive into
building models or running analysis, we need to understand how the values in our dataset are
spread out and that’s where probability distributions come in.

Let us start with a simple example: If you roll a fair die, the chance of getting a 6 is 1 out of 6, or
16.67%. This is a basic example of a probability distribution, a way to describe the likelihood of
different outcomes.

When dealing with complex data like customer purchases, stock prices, or weather, probability
distributions help answer:
●​ What is most likely to happen?
●​ What are rare or unusual outcomes?
●​ Are values close together or spread out?
This helps us make better predictions and understand uncertainty.

Why Are Probability Distributions Important?


●​ Explain how data behaves (clustered or spread)
●​ Form the basis of machine learning models
●​ Used in statistical tests (e.g., p-value)
●​ Help identify outliers and make predictions
Before this, we need to understand random variables, which assign numbers to outcomes of
random events (e.g., rolling a die).
Random variables are:
Discrete: Only specific values (e.g., number of people)
Continuous: Any value in a range (e.g., height, temperature)

Key Components of Probability Distributions


Now that we understand random variables let's explore how we describe their probabilities
using three key concepts:
1. Probability Mass Function (PMF): Used for discrete variables (e.g., number of products
bought). It gives the probability of each exact value. For example, 25% of customers buy exactly
3 products.
2. Probability Density Function (PDF): Used for continuous variables (e.g., amount
spent). It shows how probabilities spread over a range but not the chance of one exact value
since values can be infinite.
3. Cumulative Distribution Function (CDF): Used for both types, it shows the probability
that a value is less than or equal to a certain number.
For example,
CDF(3) = 0.75 means 75% buy 3 or fewer products;
CDF($50) = 0.80 means 80% spend $50 or less.
To find the CDF we can use the formula given below:

Where F(x) is the CDF and f(t) is the PDF.

Types of Probability Distributions


Probability distributions can be divided into two main types based on the nature of the random
variables: discrete and continuous.
Discrete Data Distributions
A discrete distribution is used when the random variable can take on countable, specific values.
For example, when predicting the number of products a customer buys in a single order the
possible outcomes are whole numbers like 0, 1, 2, 3, etc. You can't buy 2.5 products so this is a
discrete random variable. It includes various distributions Let's understand them one by one:
1. Binomial Distribution
The binomial distribution calculates the chance of getting a certain number of successes in a
fixed number of trials. For example, flipping a coin 10 times and counting heads.
●​ Number of trials: 10
●​ Two outcomes per trial: heads (success) or tails (failure)
●​ Probability of success (heads): 0.5
●​ Shows likelihood of getting 0 to 10 heads

2. Bernoulli Distribution
The Bernoulli distribution describes experiments with only one trial and two possible outcomes:
success or failure. It’s the simplest probability distribution. For example, flipping a coin once
and checking if it lands on heads.
●​ One trial only
●​ Two outcomes: heads (success) or tails (failure)
●​ Probability of success: 0.5
●​ Graph shows two bars representing success (1) and failure (0) with equal probabilities
3. Poisson Distribution
The Poisson distribution models the number of random events happening in a fixed time or
area. For example, counting how many customers enter a coffee shop per hour. It helps predict
the probability of seeing a specific number of events based on the average rate.
●​ Counts events in a fixed interval
●​ Average rate (e.g., 5 customers/hour) is known
●​ Calculates probability of exact counts (e.g., exactly 3 customers)
●​ Graph shows a curve centered around the average rate, tapering off for less likely
counts

4. Geometric Distributions
The geometric distribution models the number of trials needed to get the first success in
repeated independent attempts. For example, how many emails you must send before a
customer makes a purchase. It helps predict the chance of success happening at each trial.
●​ Counts trials until first success
●​ Each trial is independent with fixed success probability
●​ Useful for questions like “How many emails until first purchase?”
●​ Graph shows a decreasing curve—fewer trials are more likely
Continuous Data Distributions
A continuous distribution is used when the random variable can take any value within a
specified range like when we analyze how much money a customer spends in a store then the
amount can be any real number including decimals like 25.75, 50.23, etc.
In continuous distributions the Probability Density Function (PDF) shows how the probabilities
are spread across the possible values. The area under the curve of this PDF represents the
probability of the random variable falling within a certain range. Now let's look at some types of
continuous probability distributions that are commonly used in data science:
1. Normal Distribution
The normal distribution, or bell curve, is one of the most common data distributions. Most
values cluster around the mean, with fewer values farther away, forming a symmetrical shape.
It’s perfect for modeling things like people’s heights.
●​ Mean is the center of the curve
●​ Symmetrical distribution (left and right sides mirror each other)
●​ Standard deviation shows how spread out the data is
●​ Smaller standard deviation means data is closer to the mean
2. Exponential Distribution
The exponential distribution models the time between events happening independently and
continuously. For example, the time between customer arrivals at a store. It helps predict how
long you might wait for the next event.
●​ Models waiting time between events
●​ Average time (e.g., 10 minutes between customers) defines the rate (λ)
●​ Events occur independently and continuously
●​ Useful for predicting time until next event

While the exponential distribution focuses on waiting times sometimes we just need to model
situations where every outcome is equally likely. In that case we use the uniform distribution.

3. Uniform Distribution
The uniform distribution means every outcome in a range is equally likely. For example, rolling
a fair six-sided die or picking a random number between 0 and 1. It applies to both discrete and
continuous cases.
●​ All outcomes have equal probability
●​ Discrete example: rolling a die (1 to 6)
●​ Continuous example: random number between 0 and 1

4. Beta Distribution
In real-world problems, probabilities often change as we learn more. The Beta distribution
helps model this uncertainty and update beliefs with new data. For example, it can estimate the
chance a customer clicks an ad.
●​ Models changing probabilities between 0 and 1
●​ Parameters (α and β) control confidence and shape
●​ Commonly used in Bayesian stats and A/B testing
5. Gamma Distribution
The Gamma distribution models the total time needed for multiple independent events to
happen. It extends the exponential distribution to cover several tasks or events. For example,
estimating the total time to finish three project tasks with varying durations.
●​ Models total time for multiple events
●​ Shape parameter (κ) controls event count
●​ Scale parameter (θ) controls event duration

6. Chi-Square Distribution
The Chi-Square distribution is used in hypothesis testing to check relationships between
categorical variables. For example, testing if gender affects preference for coffee or tea. It helps
determine if observed differences are due to chance.
●​ Used for testing independence between categories
●​ Works with contingency tables
●​ Degrees of freedom depend on number of categories
7. Log-Normal Distribution
The Log-Normal distribution models data that grows multiplicatively over time, like stock
prices or income. If the logarithm of the data is normally distributed, the original data follows a
log-normal distribution. It only models positive values.
●​ Models multiplicative growth processes
●​ Data can’t be negative
●​ Commonly used for stock prices and incomes
Distributions Key Features Usage

Normal This is used to adjust data to make it Used for feature scaling , model
Distributions easier to analyze and to find unusual assumptions and anomaly
values like errors or outliers. detection

Exponential It measures how long it takes for Helps to predict when a server
Distributions something to happen like waiting for might crash or how long it will
an event. take for customers to arrive at a
store.

Uniform In this every possible outcome is It is used for picking random


Distributions equally likely; no outcome is more samples from a group.
likely than another.

Beta Helps us to update our guesses about This is useful for A/B testing
Distributions chances based on new information. (comparing two options) and
figuring out how often people
click on links.

Gamma Gamma measures the total time takes Helps to predict when systems
Distributions for several events to happen one after might fail and assess risks in
another. various situations.

Chi-Square It checks if there is a relationship helps in analyzing customer


Distributions between different categories of data. survey results to see if different
groups have different opinions or
behaviors.

Log-Normal It shows how things grow over time Used for predicting stock prices
Distributions especially when growth happens in and understanding how income
steps rather than all at once. levels are distributed among
people.

Binomial This models the number of successes Useful for determining the
Distributions in multiple trials. probability of a certain number of
successes in a fixed number of
trials

Bernoulli Bernoulli models a single trial with Mostly used in quality control to
Distributions two outcomes (success/failure). assess pass/fail situations.
Poisson It finds the number of events Helps to predict the number of
Distributions occurring in a fixed interval of time or customer arrivals at a store
space. during an hour.

Geometric It helps to find the number of trials Useful for understanding how
Distributions until the first success occurs. many attempts it takes before
achieving the first success e.g.,
how many times you need to flip a
coin before getting heads.
Central Limit Theorem (CLT)
The Central Limit Theorem (CLT) is a fundamental statistical principle that explains why the
normal distribution appears so frequently in nature and data analysis.
Core Idea
If you take large enough random samples from any population (regardless of its original shape)
and calculate the mean of each sample, the distribution of these sample means will form a
normal distribution (a bell curve).
Why is the CLT Important?
It allows statisticians to make inferences about a population using sample data, even when they
know nothing about the population's distribution. This is the foundation for confidence
intervals, hypothesis testing, and much of classical statistics.

Key Characteristics of the CLT


1.​ Mean of the Sample Means: The average of all the sample means will be equal to the
population mean.
○​ Formula: μ_X̄ = μ
2.​ Standard Error: The standard deviation of the sample means (called the "standard
error") gets smaller as the sample size increases. It is calculated as the population
standard deviation divided by the square root of the sample size.
○​ Formula: σ_X̄ = σ / √n
3.​ Shape: The distribution of sample means approximates a normal distribution as the
sample size (n) grows larger. This holds true even if the original population is not
normally distributed.

Central Limit Theorem Formula


Given a population with:
●​ Mean (μ)
●​ Standard Deviation (σ)
If you take samples of size n, the distribution of the sample means (X̄) will be approximately
normal:
X̄ ~ N(μ, σ/√n)
To work with this distribution, we convert a sample mean to a Z-score, which tells us how many
standard errors it is away from the population mean.
Z-Score Formula for Sample Means:​
Z = (X̄ - μ) / (σ / √n)
Where:
●​ X̄ is the sample mean.
●​ μ is the population mean.
●​ σ is the population standard deviation.
●​ n is the sample size.

Assumptions and Conditions


For the CLT to hold, the following conditions must be met:
●​ Random Sampling: The data must be collected using a random method.
●​ Independence: Sample observations must be independent of each other. (Often satisfied
if the sample size is less than 10% of the population).
●​ Sample Size: The sample size should be "sufficiently large." A common rule of thumb is n
> 30, but if the population is already normal, any sample size works. For highly skewed
populations, a larger n might be needed.
Solved Examples
Let's revisit a few examples to highlight the application of the formulas.
Example 1: Population weight: μ = 70 kg, σ = 15 kg. Sample size: n = 50.
●​ Sample Mean (μ_X̄): μ_X̄ = μ = 70 kg
●​ Standard Error (σ_X̄): σ_X̄ = σ / √n = 15 / √50 ≈ 2.12 kg
Example 2: Finding required sample size for a margin of error.
●​ Given: Confidence Level 95% (Z = 1.96), σ = 50, Margin of Error (E) = 5.
●​ Formula: n = ( (Z * σ) / E )²
●​ Calculation: n = ( (1.96 * 50) / 5 )² = (98 / 5)² = (19.6)² ≈ 384.16
●​ Answer: Always round up for sample size. n = 385.
Example 3: Standard Error for Proportions.
●​ Given: p = 0.40, n = 100.
●​ Formula for Proportion SE: σ_p̂ = √[ p(1 - p) / n ]
●​ Calculation: σ_p̂ = √[ (0.40 * 0.60) / 100 ] = √(0.24 / 100) = √0.0024 ≈ 0.049

Applications in Computer Science (Summary)


Your applications section is excellent. Here it is in a more concise list:Performance Analysis:
Average latency/response times from many requests converge to normal, allowing use of
confidence intervals to compare systems.
●​ A/B Testing: Conversion rates (which are proportions) become normally distributed
for large samples, validating Z-tests to determine if differences are significant.
●​ Monte Carlo Simulations: The average output of many random samples converges to
a normal distribution around the true value, providing reliable error bounds.
●​ Machine Learning:
○​ Model Evaluation: Metrics like accuracy over different test sets follow a normal
distribution, enabling comparison via confidence intervals.
○​ Stochastic Gradient Descent (SGD): The noise in gradients calculated from
random mini-batches is approximately normal.
Problem 1: Z-Score Calculation
Given:
●​ Population Mean (μ) = 50
●​ Population Standard Deviation (σ) = 10
●​ Sample Mean (X̄) = 52
●​ Sample Size (n) = 25
Find: The Z-score for this sample mean.
Solution:​
The Z-score formula for a sample mean according to the CLT is:


Step 1: Calculate the Standard Error (σ_X̄).


Step 2: Plug the values into the Z-score formula.

Final Answer:

Interpretation: This sample mean of 52 is exactly 1 standard error above the population
mean of 50.

Problem 2: Standard Error


Given:
●​ Population Standard Deviation (σ) = 15
●​ Sample Size (n) = 50
Find: The Standard Error of the sample mean (σ_X̄).
Solution:​
The formula for the standard error is:

Step 1: Plug the values into the formula.

Step 2: Simplify and calculate.


Final Answer:

Interpretation: The average amount that sample means (from samples of size 50) are
expected to vary from the true population mean is about 2.12 units.
Problem 3: 95% Confidence Interval
Given:
●​ Population Mean (μ) = 100
●​ Population Standard Deviation (σ) = 20
●​ Sample Size (n) = 36
●​ Z-score for 95% Confidence Level = 1.96
Find: The 95% confidence interval for the sample mean.
Solution:​
The formula for a confidence interval is:

(Since we are given the population mean μ and are building the interval around the sample
mean's expected value, we use μ in the center).
Step 1: Calculate the Standard Error (σ_X̄).

Step 2: Calculate the Margin of Error (E).

Step 3: Construct the Confidence Interval.


Final Answer:

Interpretation: We are 95% confident that the mean of a random sample of size 36 taken
from this population will fall between 93.47 and 106.53.

Problem 4: Probability Using CLT


Given:
●​ Population Mean (μ) = 160 cm
●​ Population Standard Deviation (σ) = 10 cm
●​ Sample Size (n) = 25
●​ We are looking for: P(X̄ > 162)
Find: The probability that a random sample of 25 women has a mean height greater than 162
cm.
Solution:
Step 1: Find the Z-score for the sample mean of 162 cm.​
First, calculate the Standard Error.

Now, calculate the Z-score.

Step 2: Understand what P(X̄ > 162) means in terms of Z.



Step 3: Find the probability using the Standard Normal Distribution (Z-table).
●​ The Z-table gives the area to the left of a Z-score.
●​ P(Z < 1.0) ≈ 0.8413
●​ Therefore, the area to the right is:

Visual Aid:

Final Answer:


Previous year solved:
VSA (1 Mark) Questions
1. Inferential statistics is a branch of statistics. True/False? (2023)​
Answer: True​
Explanation: Statistics is broadly divided into descriptive statistics (summarizing data) and
inferential statistics (making predictions/inferences about populations from samples).
2. What is the probability of getting a sum as 3 if two dice are thrown? (2024)​
Solution:​
Total outcomes = 6 × 6 = 36​
Favorable outcomes: (1,2), (2,1) = 2 outcomes​
Probability = 2/36 = 1/18​
Answer: 1/18
3. If two dice are thrown together, what is the probability of getting an even
number on one die and an odd number on the other die? (2024)​
Solution:​
Total outcomes = 36​
Case 1: First die even, second die odd​
Even numbers: 2,4,6 (3 options)​
Odd numbers: 1,3,5 (3 options)​
Combinations: 3 × 3 = 9
Case 2: First die odd, second die even​
Odd numbers: 1,3,5 (3 options)​
Even numbers: 2,4,6 (3 options)​
Combinations: 3 × 3 = 9
Total favorable = 9 + 9 = 18​
Probability = 18/36 = 1/2​
Answer: 1/2
4. The variance of binomial distribution is ______. (2024)​
Answer: npq or np(1-p)​
Explanation: For binomial distribution B(n,p), variance = np(1-p)
5. Normal distribution is continuous or discrete? (2024)​
Answer: Continuous
6. It is suitable to use binomial distribution only for ______. (2024)​
Answer: Discrete random variables
7. In binomial distribution, successive trials are ______. (2024)​
Answer: Independent
8. Classification of data according to location or areas is called ______. (2023)​
Answer: Geographical classification
SA (5 Marks) Questions - Detailed Solutions
9. What is p.m.f. and p.d.f.? Give one example for each. (2023)
Detailed Solution:
Probability Mass Function (PMF):
●​ Used for discrete random variables
●​ Gives the probability that a discrete random variable equals exactly some value
●​ Properties:
1.​ 0 ≤ P(X = x) ≤ 1 for all x
2.​ ΣP(X = x) = 1 (sum over all possible x)
Example of PMF:​
For a fair six-sided die:​
P(X = 1) = 1/6, P(X = 2) = 1/6, ..., P(X = 6) = 1/6​
This is a uniform discrete distribution.
Probability Density Function (PDF):
●​ Used for continuous random variables
●​ Doesn't give probability at a point (P(X = x) = 0)
●​ Gives probability over an interval: P(a ≤ X ≤ b) = ∫[a,b] f(x)dx
●​ Properties:
1.​ f(x) ≥ 0 for all x
2.​ ∫f(x)dx = 1 (over entire range)
Example of PDF:​
Standard normal distribution:​
f(x) = (1/√(2π))e^(-x²/2)​
P(0 ≤ X ≤ 1) = ∫[0,1] (1/√(2π))e^(-x²/2)dx ≈ 0.3413

10. Difference between discrete & continuous random variables. (2023)


Aspect Discrete Random Continuous Random
Variables Variables

Definition Takes countable values Takes uncountable values

Values Specific, isolated points Any value in an interval

Probability at a P(X = x) can be > 0 P(X = x) = 0


point

Probability Summation: ΣP(X = x) Integration: ∫f(x)dx


calculation

Distribution type Probability Mass Function Probability Density Function


(PMF) (PDF)

Examples Number of students, dice rolls Height, weight, time

Graph Probability histogram Smooth curve

Cumulative F(x) = ΣP(X ≤ x) F(x) = ∫f(t)dt


distribution

Mean E[X] = Σx·P(X = x) E[X] = ∫x·f(x)dx


Key Differences:
1.​ Countability: Discrete variables have countable outcomes; continuous variables have
uncountable outcomes.
2.​ Probability at points: For discrete variables, we can talk about probability at specific
points; for continuous variables, probability at any single point is zero.
3.​ Mathematical tools: Discrete uses summation; continuous uses integration.

11. How to check if a function is probability distribution or not? (2023)


Detailed Solution:
For Discrete Probability Distribution (PMF):​
A function P(x) is a valid probability distribution if:
1.​ Non-negativity: P(x) ≥ 0 for all possible values of x
2.​ Sum to 1: ΣP(x) = 1 (sum over all possible x values)
For Continuous Probability Distribution (PDF):​
A function f(x) is a valid probability distribution if:
1.​ Non-negativity: f(x) ≥ 0 for all x in its domain
2.​ Integral equals 1: ∫f(x)dx = 1 (over entire range)
Verification Process:​
Step 1: Check if all probabilities are non-negative​
Step 2: Check if sum/integral equals 1​
Step 3: If both conditions satisfied, it's a valid probability distribution
Example Verification:​
Check if P(x) = x/10 for x = 1, 2, 3, 4 is a valid distribution:
1.​ P(1) = 1/10 ≥ 0, P(2) = 2/10 ≥ 0, P(3) = 3/10 ≥ 0, P(4) = 4/10 ≥ 0 ✓
2.​ Sum = 1/10 + 2/10 + 3/10 + 4/10 = 10/10 = 1 ✓​
Therefore, it's a valid probability distribution.

12. Find the probability that a leap year has 53 Sundays. (2024)
Detailed Solution:​
Step 1: Understand leap year structure​
A leap year has 366 days​
366 ÷ 7 = 52 weeks + 2 days​
So a leap year has 52 complete weeks plus 2 extra days.
Step 2: Identify the extra days combinations​
The 2 extra days can be any of these 7 combinations:
1.​ Sunday & Monday
2.​ Monday & Tuesday
3.​ Tuesday & Wednesday
4.​ Wednesday & Thursday
5.​ Thursday & Friday
6.​ Friday & Saturday
7.​ Saturday & Sunday
Step 3: Determine favorable cases for 53 Sundays​
For there to be 53 Sundays, at least one of the extra days must be Sunday.​
This happens in cases:
●​ Sunday & Monday (has Sunday)
●​ Saturday & Sunday (has Sunday)
So favorable cases = 2
Step 4: Calculate probability​
Total possible combinations of extra days = 7​
Favorable combinations = 2​
Probability = 2/7
Verification:​
We can think of it as: the extra days are equally likely to be any pair of consecutive days, and 2
out of 7 such pairs contain Sunday.
Answer: 2/7

13. A fair coin is tossed n times... find the value of n. (2024)


Detailed Solution:​
Step 1: Understand the problem​
We have a fair coin tossed n times​
P(at least one head) = 63/64​
We need to find n.
Step 2: Use complementary probability​
P(at least one head) = 1 - P(no heads)​
P(no heads) = P(all tails)
Step 3: Calculate P(all tails)​
For a fair coin, P(tail) = 1/2​
P(all tails in n tosses) = (1/2)^n
Step 4: Set up equation​
1 - (1/2)^n = 63/64​
(1/2)^n = 1 - 63/64 = 1/64
Step 5: Solve for n​
(1/2)^n = 1/64​
(1/2)^n = (1/2)^6​
Therefore, n = 6
Verification:​
For n = 6:​
P(all tails) = (1/2)^6 = 1/64​
P(at least one head) = 1 - 1/64 = 63/64 ✓
Answer: n = 6
LA (3 – 7 Marks) Questions - Detailed Solutions
14. A die is thrown twice. What is the probability that (i) 5 will not come up any
time? (ii) 5 will come up at least once? (5 Marks) (2024)
Detailed Solution:​
Step 1: Total possible outcomes​
When a die is thrown twice:​
Total outcomes = 6 × 6 = 36
(i) P(5 not coming any time)​
Step 2: Outcomes without 5​
If 5 cannot appear, each die has 5 possible outcomes (1,2,3,4,6)​
Favorable outcomes = 5 × 5 = 25
Step 3: Calculate probability​
P(no 5) = 25/36
(ii) P(5 comes at least once)​
Step 4: Use complementary probability​
P(at least one 5) = 1 - P(no 5) = 1 - 25/36 = 11/36
Alternative approach for (ii):​
Cases where 5 appears:
●​ 5 on first die only: 1 × 5 = 5 outcomes
●​ 5 on second die only: 5 × 1 = 5 outcomes
●​ 5 on both dice: 1 × 1 = 1 outcome​
Total = 5 + 5 + 1 = 11 outcomes​
P(at least one 5) = 11/36
Verification:​
25/36 + 11/36 = 36/36 = 1 ✓
Answer: (i) 25/36, (ii) 11/36

15. A die is thrown once. What is the probability of getting a number less than 3?
(5 Marks) (2024)
Detailed Solution:​
Step 1: Total possible outcomes​
When a fair die is thrown once:​
Sample space = {1, 2, 3, 4, 5, 6}​
Total outcomes = 6
Step 2: Identify favorable outcomes​
Numbers less than 3: 1 and 2​
Favorable outcomes = 2
Step 3: Calculate probability​
P(number < 3) = 2/6 = 1/3
Alternative approach:​
We can also calculate using probability distribution:​
P(1) = 1/6, P(2) = 1/6​
P(1 or 2) = 1/6 + 1/6 = 2/6 = 1/3
Answer: 1/3
16. The probability of selecting a blue marble... find the total number of marbles in
the jar. (5 Marks) (2024)
Detailed Solution:​
Step 1: Set up the given information​
Let total marbles = n​
Number of blue marbles = 5​
P(selecting blue marble) = 1/3
Step 2: Use probability formula​
P(blue) = (Number of blue marbles) / (Total marbles)​
1/3 = 5/n
Step 3: Solve for n​
1/3 = 5/n​
n = 5 × 3 = 15
Step 4: Verify​
Total marbles = 15, blue marbles = 5​
P(blue) = 5/15 = 1/3 ✓
Answer: 15 marbles

17. Two coins are tossed 500 times, and we get: Two heads: 105, One head: 275, No
head: 120. Find the probability of each event. (5 Marks) (2024)
Detailed Solution:​
Step 1: Verify total trials​
105 + 275 + 120 = 500 ✓
Step 2: Calculate empirical probabilities​
P(two heads) = 105/500 = 0.21​
P(one head) = 275/500 = 0.55​
P(no head) = 120/500 = 0.24
Step 3: Compare with theoretical probabilities​
Theoretical probabilities for two fair coins:
●​ P(two heads) = 1/4 = 0.25
●​ P(one head) = 2/4 = 0.50
●​ P(no head) = 1/4 = 0.25
Step 4: Interpretation​
The empirical probabilities are close to theoretical probabilities, showing the law of large
numbers in action.
Answer: 0.21, 0.55, 0.24

18. The probability of Sanger winning... What is the probability that Rashmi will
win? (5 Marks) (2024)
Detailed Solution:​
Step 1: Define variables​
Let P(Sanger wins) = p​
Let P(Rashmi wins) = r
Step 2: Use given information​
"Sanger is twice as likely to win as Rashmi"​
This means: p = 2r
Step 3: Use total probability​
Since only two players: p + r = 1
Step 4: Solve system of equations​
Substitute p = 2r into p + r = 1:​
2r + r = 1​
3r = 1​
r = 1/3
Step 5: Find p​
p = 2r = 2 × (1/3) = 2/3
Step 6: Verify​
P(Sanger) + P(Rashmi) = 2/3 + 1/3 = 1 ✓​
P(Sanger) = 2 × P(Rashmi) ✓
Answer: P(Rashmi wins) = 1/3

19. In any 15-minute interval... What is the probability that you see at least one
shooting star in the period of an hour? (5 Marks) (2024)
Detailed Solution:​
Step 1: Understand the time intervals​
1 hour = 4 intervals of 15 minutes
Step 2: Need more information​
To solve this completely, we need:
●​ The probability of seeing at least one shooting star in a 15-minute interval, OR
●​ The average rate of shooting stars per 15 minutes
Step 3: General approach (assuming Poisson distribution)​
If λ = average number in 15 minutes, then for 1 hour: λ_hour = 4λ​
P(at least one in hour) = 1 - P(none in hour) = 1 - e^(-4λ)
Step 4: With specific numbers​
If given P(at least one in 15 min) = p, then:​
P(none in 15 min) = 1 - p​
P(none in hour) = (1 - p)^4​
P(at least one in hour) = 1 - (1 - p)^4
Answer: [Cannot be fully solved without the specific probability for 15-minute interval]

20. Determine the proportion of students with a 33 or higher (Normal


Distribution). (7 Marks) (2025, 2023)
Detailed Solution:​
Step 1: Interpret the given information​
"8 jumps and 8 ytgums=65" likely means:​
μ = 65 (mean)​
σ = 8 (standard deviation)
Step 2: Standardize the value​
We want P(X ≥ 33)​
Z = (X - μ)/σ = (33 - 65)/8 = -32/8 = -4
Step 3: Find probability using standard normal​
P(X ≥ 33) = P(Z ≥ -4)​
From standard normal table:​
P(Z ≥ -4) ≈ 0.99997
Step 4: Interpretation​
This means approximately 99.997% of students scored 33 or higher.
Verification:​
In normal distribution, almost all values (99.99+%) lie within μ ± 4σ​
μ - 4σ = 65 - 32 = 33​
So P(X ≥ 33) should be very close to 1.
Answer: Approximately 0.99997 or 99.997%

21. What is the Central Limit theorem? (4 Marks) (2025)


Detailed Solution:
Central Limit Theorem Statement:​
Regardless of the shape of the population distribution, the sampling distribution of the sample
mean approaches a normal distribution as the sample size increases.
Key Points:
1.​ For large samples (n ≥ 30 typically), the distribution of sample means is approximately
normal
2.​ Works for any population distribution - normal, skewed, uniform, etc.
3.​ Mean of sampling distribution equals population mean: μ_x̄ = μ
4.​ Standard error = σ/√n (standard deviation of sampling distribution)
Mathematical Form:​
If X₁, X₂, ..., X are random samples from any distribution with mean μ and variance σ², then:​
√n(X̄ - μ)/σ → N(0,1) as n → ∞
Importance:
●​ Foundation for inferential statistics
●​ Justifies use of normal distribution for confidence intervals and hypothesis tests
●​ Explains why normal distribution appears so frequently in nature

23. Probability Distribution Table & "at least 12 ships" probability (5 Marks)
(2023)
Detailed Solution:​
From the table:
Ships 10 11 12 13 14
(X)

P(X) 0.4 0.2 0.2 0.1 0.1


Step 1: Verify it's a valid distribution​
Sum of probabilities = 0.4 + 0.2 + 0.2 + 0.1 + 0.1 = 1.0 ✓​
All probabilities ≥ 0 ✓
Step 2: Calculate P(X ≥ 12)​
P(X ≥ 12) = P(X = 12) + P(X = 13) + P(X = 14)​
= 0.2 + 0.1 + 0.1 = 0.4
Step 3: Additional calculations (if needed)​
Expected number of ships:​
E[X] = 10×0.4 + 11×0.2 + 12×0.2 + 13×0.1 + 14×0.1​
= 4 + 2.2 + 2.4 + 1.3 + 1.4 = 11.3
Answer: P(at least 12 ships) = 0.4
24. Mention difference between Point estimation and Interval estimation. (5
Marks) (2023)
Detailed Solution:
Aspect Point Estimation Interval Estimation

Definition Single value estimate of Range of values estimating parameter


parameter

Result Specific number Interval (L, U)

Precision No measure of precision Provides measure of precision

Reliability Less reliable More reliable

Example μ = 25 μ ∈ [23, 27] with 95% confidence

Usage Quick estimates Scientific research, detailed analysis

Formula Sample mean x̄, sample x̄ ± Z_(α/2)×(σ/√n)


proportion p̂

Interpretation "The mean is about 25" "We're 95% confident the mean is
between 23 and 27"
Key Differences:
1.​ Specificity vs Range: Point estimates give one number; interval estimates give a range.
2.​ Precision Information: Interval estimates show how precise the estimate is through the
margin of error.
3.​ Confidence Level: Interval estimates include a confidence level; point estimates don't.
4.​ Applications: Point estimates for quick decisions; interval estimates for rigorous
scientific work.

25. What is the difference between Point Estimates and Confidence Interval? (5
Marks) (2024)
Detailed Solution:​
This is essentially the same as the previous question. The key differences are:
Point Estimate:
●​ Single value that estimates a population parameter
●​ Examples: Sample mean (x̄), sample proportion (p̂)
●​ No information about reliability or precision
●​ Simple to calculate and understand
Confidence Interval:
●​ Range of values that likely contains the population parameter
●​ Form: Point estimate ± Margin of error
●​ Includes confidence level (90%, 95%, 99%)
●​ Provides information about precision through the interval width
Example:​
If we sample 100 students' test scores:
●​ Point estimate: "The average score is 75"
●​ Confidence interval: "We're 95% confident the average score is between 72 and 78"
The confidence interval gives us both an estimate and information about how good that
estimate is.
All solutions include detailed step-by-step reasoning, verification steps, and multiple
approaches where applicable to ensure comprehensive understanding.
Hypothesis testing
Definition and Purpose:
Hypothesis testing is a systematic statistical procedure used to make decisions about population
parameters based on sample data. It provides a structured framework for testing claims and
theories about real-world phenomena using empirical evidence.

Detailed Explanation:
●​ Statistical Inference: Hypothesis testing falls under statistical inference, where we
draw conclusions about populations from samples
●​ Decision Making: It helps researchers and analysts make objective decisions rather
than relying on intuition
●​ Scientific Method: Forms the backbone of the scientific method in experimental
research
●​ Risk Management: Allows quantification of decision risks through probability

Key Characteristics:
1. Evidence-Based: Relies on empirical data rather than assumptions
2. Probabilistic: Conclusions are stated in terms of probability, not certainty
3. Structured: Follows a well-defined step-by-step procedure
4. Objective: Provides standardized criteria for decision making

Real-World Applications:
- Pharmaceutical Industry: Testing drug efficacy and safety
- Manufacturing: Quality control and process improvement
- Marketing: Evaluating campaign effectiveness
- Healthcare: Medical treatment comparisons
- Social Sciences: Studying behavioral patterns
- Economics: Policy impact assessment

Example Scenario:
A company claims their battery lasts 100 hours. To verify this claim:
- We collect sample data from multiple batteries
- Use hypothesis testing to determine if evidence supports the claim
- Make a statistically sound decision about the claim's validity

Defining Hypotheses
[Link] Hypothesis (H₀)
Definition and Role:
The null hypothesis represents the default position or status quo that there is no effect, no
difference, or no relationship between variables. It serves as the starting assumption that we
test against.

Characteristics:
- Presumption of Innocence: Similar to "innocent until proven guilty" in legal systems
- Conservative Stance: Requires strong evidence to be rejected
- Equality Statement: Always contains equality operators (=, ≤, ≥)
- Testable: Must be specific and mathematically testable
Types of Null Hypotheses:
1. Simple Null Hypothesis: Specifies exact parameter value
- H₀: μ = 100 (population mean equals 100)
2. Composite Null Hypothesis: Specifies range of values
- H₀: μ ≤ 100 (population mean less than or equal to 100)
- H₀: μ ≥ 50 (population mean greater than or equal to 50)

Formulation Guidelines:
- Must be stated before data collection
- Should be clear and unambiguous
- Based on existing theory or previous research
- Must be falsifiable

Examples with Context:


1. Education: "New teaching method has no effect on test scores" (H₀: μ_new = μ_traditional)
2. Medicine: "Drug has no effect on blood pressure" (H₀: μ_drug = μ_placebo)
3. Business: "Price change has no effect on sales" (H₀: μ_sales_before = μ_sales_after)
4. Engineering: "New material has same strength as old material" (H₀: μ_strength_new =
μ_strength_old)

[Link] Hypothesis (H₁)


Definition and Purpose:
The alternative hypothesis represents the research hypothesis or the claim we want to prove. It
contradicts the null hypothesis and is accepted only when sufficient evidence exists against H₀.

Characteristics:
- Research Claim: Represents what the investigator wants to demonstrate
- Inequality Statement: Always contains inequality operators (≠, <, >)
- Complementary: Directly opposes the null hypothesis
- Evidence-Based: Requires statistical evidence for support

Types of Hypothesis Testing


One-Tailed Test - Comprehensive Analysis
A one-tailed test (directional test) examines whether a parameter is significantly greater than or
less than a hypothesized value, but not both.
When to Use:
●​ Theoretical predictions specify direction
●​ Only one direction is practically meaningful
●​ Previous research suggests specific direction

Types of One-Tailed Tests:


Left-Tailed Test:
Checks if the value is less than expected
●​ H₀: μ ≥ μ₀
●​ H₁: μ < μ₀
●​ Rejection Region: Left tail of distribution
●​ Critical Value: Negative value
Real-world Example - Quality Improvement:
●​ Context: Implementing new process to reduce manufacturing defects
●​ H₀: μ_defects ≥ 5% (No improvement or worse)
●​ H₁: μ_defects < 5% (Defects reduced)
●​ Rationale: Only reduction is beneficial

Right-Tailed Test:
Checks if the value is greater than expected.
●​ H₀: μ ≤ μ₀
●​ H₁: μ > μ₀
●​ Rejection Region: Right tail of distribution
●​ Critical Value: Positive value
Real-world Example - Sales Campaign:
●​ Context: New advertising campaign expected to increase sales
●​ H₀: μ_sales ≤ $100,000 (No improvement)
●​ H₁: μ_sales > $100,000 (Sales increased)
●​ Rationale: Only increase is desirable
Advantages:
●​ More powerful for detecting effects in specified direction
●​ Requires smaller sample size for same power
●​ Matches directional research questions
Disadvantages:
●​ Cannot detect effects in opposite direction
●​ Requires strong theoretical justification
●​ May miss important findings in unexpected directions

Two-Tailed Test
Definition and Purpose:​
A two-tailed test (non-directional test) examines whether a parameter differs from a
hypothesized value in either direction.
When to Use:
●​ No specific directional prediction
●​ Both directions are theoretically interesting
●​ Exploratory research
●​ Quality control applications
Test Structure:
●​ H₀: μ = μ₀
●​ H₁: μ ≠ μ₀
●​ Rejection Region: Both tails of distribution
●​ Critical Values: Both positive and negative values
Real-world Examples:
Example 1: Machine Calibration
●​ Context: Testing if production machine is properly calibrated
●​ H₀: μ_length = 10cm (Machine calibrated)
●​ H₁: μ_length ≠ 10cm (Machine needs adjustment)
●​ Rationale: Both over-size and under-size are problematic
Example 2: Drug Side Effects
●​ Context: Testing if drug affects blood pressure
●​ H₀: μ_bp_change = 0 (No effect)
●​ H₁: μ_bp_change ≠ 0 (Affects blood pressure)
●​ Rationale: Both increase and decrease are clinically important
Advantages:
●​ Detects effects in both directions
●​ More conservative approach
●​ Appropriate for exploratory research
Disadvantages:
●​ Less powerful for directional effects
●​ Requires larger sample size for same power
●​ May split evidence between two tails

Comparison Table: One-tailed vs Two-tailed Tests


Aspect One-Tailed Test Two-Tailed Test

Hypothesis H₁: μ < μ₀ or μ > μ₀ H₁: μ ≠ μ₀

Direction Specific direction Any direction

Power Higher for specified direction Lower for specific direction

Critical Region One tail Both tails

Sample Size Smaller for same power Larger for same power

Application Confirmatory research Exploratory research

Briefly point out errors in Hypothesis testing. (2025)


Introduction:​
Hypothesis testing involves making decisions about population parameters based on sample
data. Due to sampling variability, two types of errors can occur: Type I and Type II errors.
1. Type I Error (False Positive):​
Rejecting the null hypothesis when it is actually true.
Probability:
●​ Denoted by α (alpha)
●​ Equal to significance level of the test
●​ Typically set at 0.05, 0.01, or 0.10
Consequences:
●​ Concluding an effect exists when it doesn't
●​ Wasted resources on false findings
●​ Can lead to incorrect policy decisions
●​ Example: Approving ineffective drug, convicting innocent person
Control Methods:
●​ Set appropriate α level based on consequences
●​ Use multiple testing corrections
●​ Replicate findings in independent studies

2. Type II Error (False Negative):​


Failing to reject the null hypothesis when it is actually false.
Probability:
●​ Denoted by β (beta)
●​ Power of test = 1 - β
Consequences:
●​ Missing real effects or relationships
●​ Lost opportunities for discovery
●​ Can have serious implications in medical testing
●​ Example: Failing to detect effective treatment, missing security threats
Factors Affecting Type II Error:
●​ Sample size (larger n reduces β)
●​ Effect size (larger effects reduce β)
●​ Variability (less variability reduces β)
●​ Significance level (higher α reduces β)
●​
3. Strategies to Minimize Errors:
For Type I Error:
●​ Use conservative significance levels
●​ Apply Bonferroni correction for multiple tests
●​ Pre-register study hypotheses
●​ Replicate findings
For Type II Error:
●​ Conduct power analysis for sample size
●​ Increase sample size
●​ Use more sensitive measurement tools
●​ Reduce variability through better design

[Link] Relationship and Trade-off:


Inverse Relationship:
●​ For fixed sample size, decreasing α increases β
●​ For fixed sample size, increasing α decreases β
●​ Both errors cannot be minimized simultaneously
Balancing Errors:
●​ High-stakes decisions: Use lower α (0.01)
●​ Exploratory research: May use higher α (0.10)
●​ Consider relative costs of each error type
●​
5. Other Potential Errors:
Implementation Errors:
●​ Incorrect test selection
●​ Violation of assumptions
●​ Data collection errors
●​ Computational mistakes
Interpretation Errors:
●​ Confusing statistical and practical significance
●​ Overgeneralizing results
●​ Ignoring effect sizes
●​ Misunderstanding p-values

Hypothesis testing procedure:


The 5-Step Hypothesis Testing Procedure
Step 1: Define the Hypotheses
Null Hypothesis (H₀): State the "no effect" or "no difference" claim. (e.g., H₀: μ = 100).
Alternative Hypothesis (H₁): State the research hypothesis you want to prove. It can be
two-tailed (≠) or one-tailed (> or <).

Step 2: Choose the Significance Level (α)


●​ Select a threshold for rejecting H₀. The common choice is α = 0.05.
●​ This is the probability of a Type I error (falsely rejecting a true H₀).

Step 3: Collect Data and Calculate the Test Statistic


●​ Collect a sample and compute a sample statistic (e.g., mean).
●​ Calculate the relevant **test statistic** (e.g., t-score, z-score) using your data and the
formula for your test.

Step 4: Make a Decision


Use one of two methods:
p-value Method:
●​ If p-value ≤ α: Reject the null hypothesis (H₀).
●​ If p-value > α: Fail to reject the null hypothesis (H₀).
Critical Value Method:
●​ Compare your test statistic to the critical value from a statistical table.
●​ If the test statistic is in the "rejection region," reject H₀.

Step 5: State a Conclusion


Interpret the statistical decision in the context of the original research question.
Example: "There is significant evidence to conclude that the new teaching method improves
scores," or "There is not enough evidence to say the average battery life is different from 100
hours."

Making a Decision in Hypothesis Testing


The decision-making step in hypothesis testing involves comparing the calculated test statistic
or p-value with predetermined thresholds to determine whether to reject or fail to reject the null
hypothesis. This step transforms statistical calculations into actionable conclusions.
Decision Framework:
●​ Based on evidence from sample data
●​ Uses predetermined significance level (α)
●​ Considers both statistical and practical significance
●​ Leads to one of two possible conclusions: reject H₀ or fail to reject H₀
Key Decision Rules:
1.​ If test statistic falls in critical region → Reject H₀
2.​ If p-value ≤ α → Reject H₀
3.​ If test statistic does not fall in critical region → Fail to reject H₀
4.​ If p-value > α → Fail to reject H₀
Interpretation Guidelines:
●​ Reject H₀: Sufficient evidence against null hypothesis
●​ Fail to reject H₀: Insufficient evidence against null hypothesis
●​ Never "accept" H₀ - only fail to find evidence against it
●​ Conclusions are probabilistic, not absolute
Practical Considerations:
●​ Consider effect size alongside statistical significance
●​ Evaluate practical importance of findings
●​ Check assumptions were met
●​ Consider study limitations and power
Example Decision Statements:​
"At α=0.05, we reject the null hypothesis (t=2.45, p=0.018), concluding there is significant
evidence that the new method improves performance."
"At α=0.05, we fail to reject the null hypothesis (t=1.23, p=0.224), finding insufficient evidence
that the treatment has an effect."
Common Mistakes to Avoid:
●​ Confusing statistical significance with practical importance
●​ Claiming to "prove" the null hypothesis
●​ Ignoring effect size and confidence intervals
●​ Not considering study power and limitations

Critical Value Method


The critical value method involves comparing the calculated test statistic to predetermined
critical values that define the rejection region boundaries based on the chosen significance level
and sampling distribution.

Key Components:
1.​ Critical Value: Threshold value from statistical distribution
2.​ Rejection Region: Range of values leading to H₀ rejection
3.​ Acceptance Region: Range of values leading to failing to reject H₀
Procedure:
1.​ Determine significance level α
2.​ Identify appropriate sampling distribution
3.​ Find critical value(s) from statistical tables
4.​ Calculate test statistic from sample data
5.​ Compare test statistic to critical value(s)
6.​ Make decision based on comparison

Critical Values for Common Tests:


Z-test (Two-tailed, α=0.05):
●​ Critical values: ±1.96
●​ Rejection region: Z < -1.96 or Z > 1.96

t-test (Two-tailed, α=0.05, df=20):


●​ Critical values: ±2.086
●​ Rejection region: t < -2.086 or t > 2.086
One-tailed Tests:
●​ Left-tailed: Reject if test statistic < -critical value
●​ Right-tailed: Reject if test statistic > critical value
Advantages:
●​ Clear, predetermined decision boundaries
●​ Easy to implement with statistical tables
●​ Provides visual understanding of rejection regions
●​ Works well with standard significance levels
Disadvantages:
●​ Requires access to statistical tables
●​ Less precise than p-value method
●​ Difficult for non-standard α levels
●​ Doesn't provide strength of evidence measure
Example Application:​
Testing H₀: μ = 100 vs H₁: μ ≠ 100
●​ α = 0.05, two-tailed test
●​ Critical values: ±1.96
●​ Calculated Z = 2.15
●​ Decision: Since 2.15 > 1.96, reject H₀
Practical Implementation:
●​ Use statistical software for accurate critical values
●​ Consider degrees of freedom for t-tests
●​ Verify distributional assumptions are met
●​ Report both test statistic and critical value

p-Value Method
The p-value method involves calculating the probability of obtaining test results at least as
extreme as the observed results, assuming the null hypothesis is true, and comparing this
probability to the significance level.
Interpretation:
●​ p-value: Probability of observed data (or more extreme) if H₀ true
●​ Small p-value: Strong evidence against H₀
●​ Large p-value: Weak evidence against H₀
Decision Rules:
●​ If p-value ≤ α → Reject H₀
●​ If p-value > α → Fail to reject H₀
p-Value Calculation:
●​ For two-tailed tests: p = 2 × P(Test statistic ≥ |observed|)
●​ For one-tailed tests: p = P(Test statistic ≥ observed) [or ≤ for left-tailed]
●​ Obtained from statistical software, tables, or calculations
Interpretation Guidelines:
●​ p > 0.10: Little or no evidence against H₀
●​ 0.05 < p ≤ 0.10: Weak evidence against H₀
●​ 0.01 < p ≤ 0.05: Evidence against H₀
●​ p ≤ 0.01: Strong evidence against H₀
●​ p ≤ 0.001: Very strong evidence against H₀
Advantages:
●​ Provides exact measure of evidence against H₀
●​ Allows comparison across different studies
●​ More informative than critical value method
●​ Works with any significance level
●​ Facilitates meta-analysis
Disadvantages:
●​ Often misinterpreted as probability H₀ is true
●​ Can lead to "p-hacking" behaviors
●​ Doesn't indicate effect size or practical importance
●​ Sensitive to sample size
Common Misinterpretations:
●​ p-value is NOT the probability that H₀ is true
●​ p-value is NOT the probability that H₁ is false
●​ p-value is NOT the effect size magnitude
●​ p-value is NOT the clinical importance
Example Application:​
Testing H₀: μ = 50 vs H₁: μ > 50
●​ α = 0.05
●​ Calculated p-value = 0.032
●​ Decision: Since 0.032 < 0.05, reject H₀
●​ Interpretation: There is significant evidence that μ > 50

P-value in Hypothesis testing


●​ The table given below shows the importance of p-value and shows the various kinds of
errors that occur during hypothesis testing.

Truth /Decision Accept h0 Reject h0

Correct decision based on the given


h0 -> true Type I error (α)
p-value (1-α)

Incorrect decision based on the


h0 -> false Type II error (β)
given p-value (1-β)
Two-Sample Mean Test
A Two-Sample Mean Test is a statistical method used to compare the means of two independent
groups to determine whether there is a significant difference between them.
When to Use
• Comparing two distinct groups (e.g., male vs. female, control vs. treatment)​
• Groups are independent of each other​
• The variable of interest is continuous (e.g., weight, marks, temperature)​
• The goal is to test if the mean difference is statistically significant
Common Tests
1. Independent Samples t-test – for equal variances​
2. Welch’s t-test – for unequal variances​
3. Z-test – for large samples (n > 30)
Hypotheses
H₀: μ₁ = μ₂ → No difference between group means​
H₁: μ₁ ≠ μ₂, μ₁ > μ₂, or μ₁ < μ₂
Assumptions
• Independent observations​
• Normally distributed data in both groups​
• Equal population variances (for standard t-test)​
• Random sampling from populations
Test Statistic Formula
t = (x̄₁ - x̄₂) / √[s²p(1/n₁ + 1/n₂)]​
where s²p = [(n₁-1)s²₁ + (n₂-1)s²₂] / (n₁ + n₂ - 2)​

df = n₁ + n₂ - 2
Example Scenario
Comparing exam scores of students under two teaching methods.​

Traditional: n=30, x̄=75, s=8​
New Method: n=35, x̄=82, s=9​

H₀: μ₁ = μ₂​
H₁: μ₁ ≠ μ₂​

Calculation:​
s²p = [(29)(8²) + (34)(9²)] / 63 = 72.93​
t = (75 - 82)/√[72.93(1/30 + 1/35)] = -3.16​

Interpretation: Since |t| > t-critical, reject H₀. There is a significant difference between the two
teaching methods.
Effect Size Measures
Cohen’s d = (x̄₁ - x̄₂)/s_pooled​
Glass’s Δ = (x̄₁ - x̄₂)/s_control​

Example: If d = 0.8, it indicates a large effect size, meaning the difference is practically
meaningful.
Practical Considerations
• Verify equal variance assumption (Levene’s Test)​
• Use Welch’s t-test if variances are unequal​
• Report confidence intervals for the mean difference​
• Discuss both statistical and practical significance

Two-Sample Proportion Test


A Two-Sample Proportion Test compares the proportions (percentages) of a categorical
outcome between two independent groups to determine if they differ significantly.
When to Use
• Comparing binary outcomes (success/failure, yes/no)​
• Groups are independent​
• Each sample is sufficiently large (np ≥ 5, n(1−p) ≥ 5)​
• Interest lies in proportion differences
Common Tests
1. Z-test for Two Proportions – for large samples​
2. Chi-square test of independence – for categorical data​
3. Fisher’s Exact Test – for small samples
Hypotheses
H₀: p₁ = p₂ → No difference in proportions​
H₁: p₁ ≠ p₂, p₁ > p₂, or p₁ < p₂
Assumptions
• Independent samples​
• Binary or categorical outcomes​
• Random sampling​
• Large enough sample sizes
Test Statistic Formula
Z = (p̂₁ - p̂₂) / √[p̂(1-p̂)(1/n₁ + 1/n₂)]​
where p̂ = (x₁ + x₂) / (n₁ + n₂)

Example Scenario
Comparing conversion rates between two website designs.​

Design A: 120 conversions out of 1000 visitors (p̂=0.12)​
Design B: 150 conversions out of 1000 visitors (p̂=0.15)​

p̂ = (120 + 150)/2000 = 0.135​
Z = (0.12 - 0.15)/√[0.135(1-0.135)(1/1000 + 1/1000)] = -1.76​

Interpretation: If |Z| < 1.96 (for α = 0.05), fail to reject H₀ → no significant difference between
conversion rates.
Effect Size Measures
Risk Difference = p̂₁ - p̂₂​
Relative Risk = p̂₁ / p̂₂​
Odds Ratio = [p̂₁/(1-p̂₁)] / [p̂₂/(1-p̂₂)]​

Example: If Relative Risk = 0.8 → Group 1’s success rate is 80% of Group 2’s.
Confidence Interval for Difference
CI = (p̂₁ - p̂₂) ± Zα/2 √[p̂₁(1-p̂₁)/n₁ + p̂₂(1-p̂₂)/n₂]​
Provides a range of plausible differences, giving more insight than p-value alone.
Applications
• Marketing: Comparing conversion or click rates​
• Medicine: Comparing recovery rates between treatments​
• Quality Control: Comparing defect rates between machines​
• Social Sciences: Comparing response or agreement rates
Special Cases
• Small samples: Use Fisher’s Exact Test​
• Multiple categories: Use Chi-square test​
• Paired samples: Use McNemar’s Test
Reporting Results
• Sample sizes and proportions​
• Test statistic and p-value​
• Confidence intervals​
• Effect size measures​
• Any assumption violations or limitations

Previous Year Questions


VSA (1 Mark Questions)
1. The probability of Type 1 error is referred as ________. (2025)​
Answer: Significance level (α)
Explanation:
●​ Type I error occurs when we reject a true null hypothesis
●​ The probability of committing Type I error is denoted by α (alpha)
●​ Commonly set at 0.05, 0.01, or 0.10
●​ Represents the risk of false positive conclusions
2. Failing to reject the null hypothesis when it is false is called ________. (2023)​
Answer: Type II error (β)
Explanation:
●​ Type II error occurs when we fail to reject a false null hypothesis
●​ Denoted by β (beta)
●​ Represents the risk of false negative conclusions
●​ Power of test = 1 - β (probability of correctly rejecting false H₀)

SA (5 Marks Questions)
3. What is the significance of p-value? (2024)
Definition:​
The p-value is the probability of obtaining test results at least as extreme as the observed
results, assuming the null hypothesis is true.
Significance and Interpretation:
1. Measure of Evidence Against H₀:
●​ Small p-value indicates strong evidence against null hypothesis
●​ Large p-value indicates weak evidence against null hypothesis
●​ Provides quantitative measure of statistical evidence
2. Decision Making Tool:
●​ Compare p-value with significance level (α)
●​ If p ≤ α → Reject H₀ (statistically significant)
●​ If p > α → Fail to reject H₀ (not statistically significant)
3. Strength of Evidence Guidelines:
●​ p > 0.10: Little or no evidence against H₀
●​ 0.05 < p ≤ 0.10: Weak evidence against H₀
●​ 0.01 < p ≤ 0.05: Evidence against H₀
●​ p ≤ 0.01: Strong evidence against H₀
●​ p ≤ 0.001: Very strong evidence against H₀
4. Practical Applications:
●​ Helps researchers make objective decisions
●​ Allows comparison across different studies
●​ Provides basis for statistical inference
●​ Used in scientific research, quality control, medical trials
5. Important Notes:
●​ p-value ≠ Probability that H₀ is true
●​ p-value ≠ Effect size magnitude
●​ Should be interpreted with confidence intervals
●​ Consider practical significance alongside statistical significance
Regression Analysis

Fundamentals of Regression Analysis


Definition and Purpose:​
Regression analysis is a statistical method that examines the relationship between a dependent
variable and one or more independent variables. It models the connection between variables to
make predictions and understand relationships.
Key Components:
Dependent Variable (Y):
●​ The outcome or response variable being predicted
●​ Also called response variable, outcome variable
●​ Examples: Sales revenue, student test scores, patient recovery time
Independent Variable(s) (X):
●​ Predictor or explanatory variables
●​ Used to predict or explain the dependent variable
●​ Examples: Advertising budget, study hours, medication dosage

Types of Regression Analysis:


1. Simple Linear Regression:
●​ One independent variable predicting one dependent variable
●​ Model: Y = β₀ + β₁X + ε
●​ Example: Predicting house price based on square footage
2. Multiple Linear Regression:
●​ Two or more independent variables
●​ Model: Y = β₀ + β₁X₁ + β₂X₂ + ... + β X + ε
●​ Example: Predicting salary based on education, experience, and age
3. Logistic Regression:
●​ Used for binary outcomes (0/1, yes/no)
●​ Predicts probability of category membership
●​ Example: Predicting whether a customer will buy a product

The Regression Model:​


Y = β₀ + β₁X + ε
Where:
●​ Y = Dependent variable
●​ X = Independent variable
●​ β₀ = Intercept (value of Y when X=0)
●​ β₁ = Slope (change in Y for one-unit change in X)
●​ ε = Error term (random variability)

Parameter Interpretation:
Intercept (β₀):
●​ Expected value of Y when all X variables are zero
●​ May or may not have practical interpretation
●​ Example: Base salary when experience and education are zero
Slope Coefficient (β₁):
●​ Represents the change in Y for a one-unit change in X
●​ Holding other variables constant (in multiple regression)
●​ Example: Expected increase in sales for each $1000 increase in advertising

Goodness of Fit Measures:


R-squared (R²):
●​ Proportion of variance in Y explained by X variables
●​ Ranges from 0 to 1 (0% to 100%)
●​ Higher values indicate better fit
●​ Formula: R² = 1 - (SS_residual / SS_total)
Adjusted R-squared:
●​ Modified version that penalizes adding unnecessary variables
●​ More appropriate for multiple regression
●​ Prevents overfitting
Standard Error of Estimate:
●​ Average distance that observed values fall from regression line
●​ Smaller values indicate better predictions
Practical Applications:
●​ Economics: Predicting GDP growth
●​ Marketing: Forecasting sales
●​ Healthcare: Predicting patient outcomes
●​ Education: Estimating student performance
●​ Finance: Risk assessment and stock prediction

Assumptions of Regression Analysis


Introduction:​
Regression analysis relies on several key assumptions. Violation of these assumptions can lead
to biased, inefficient, or misleading results.
Core Assumptions:
1. Linearity
●​ Relationship between X and Y must be linear
●​ Check: Scatter plots, residual plots
●​ Fix: Variable transformations, polynomial terms
2. Independence of Errors
●​ Residuals should not be correlated with each other
●​ Check: Durbin-Watson test (ideal ≈ 2)
●​ Fix: Time series models, robust standard errors
3. Homoscedasticity
●​ Constant variance of errors across all X values
●​ Check: Residual vs fitted plots (no funnel shape)
●​ Fix: Variable transformations, weighted least squares
4. Normality of Errors
●​ Residuals should follow normal distribution
●​ Check: Q-Q plots, Shapiro-Wilk test
●​ Fix: Data transformations, non-parametric methods
5. No Multicollinearity
●​ Predictor variables should not be highly correlated
●​ Check: Variance Inflation Factor (VIF < 10)
●​ Fix: Remove correlated variables, PCA
6. No Endogeneity
●​ X variables should not correlate with error term
●​ Check: Theoretical analysis, Hausman test
●​ Fix: Instrumental variables, fixed effects models

Practical Example - Housing Price Model:​


Assumptions check for: Price = β₀ + β₁×Size + β₂×Bedrooms + ε
1.​ Linearity: Scatter plots of Price vs Size, Price vs Bedrooms
2.​ Independence: Durbin-Watson test on residuals
3.​ Homoscedasticity: Residuals vs Predicted plot
4.​ Normality: Q-Q plot of residuals
5.​ Multicollinearity: VIF for Size and Bedrooms
Consequences of Violations:
●​ Biased coefficient estimates
●​ Incorrect standard errors
●​ Invalid hypothesis tests
●​ Poor predictions
●​ Misleading conclusions

3. Accuracy and Validity in Regression Analysis


Introduction:​
Ensuring accuracy and validity is crucial for drawing reliable conclusions from regression
analysis. This involves assessing model performance and verifying results.
Measures of Accuracy:
1. R-squared (Coefficient of Determination):
●​ Proportion of variance in Y explained by the model
●​ R² = 1 - (SS_residual / SS_total)
●​ Range: 0 to 1 (higher is better)
●​ Limitations: Increases with more variables, doesn't indicate causation
2. Adjusted R-squared:
●​ Adjusts for number of predictors in the model
●​ Penalizes adding irrelevant variables
●​ More reliable for model comparison
●​ Formula: 1 - [(1-R²)(n-1)/(n-k-1)]
3. Root Mean Square Error (RMSE):
●​ Standard deviation of residuals
●​ Measures average prediction error
●​ In same units as dependent variable
●​ Lower values indicate better fit
●​ Formula: √(Σ(y_i - ŷ_i)²/n)
4. Mean Absolute Error (MAE):
●​ Average absolute difference between observed and predicted
●​ Less sensitive to outliers than RMSE
●​ Easier to interpret
●​ Formula: Σ|y_i - ŷ_i|/n
5. Mean Absolute Percentage Error (MAPE):
●​ Average percentage error
●​ Useful for comparing across different scales
●​ Formula: (Σ|(y_i - ŷ_i)/y_i|/n) × 100%

Validation Techniques:
1. Train-Test Split:
●​ Split data into training and testing sets (e.g., 70-30 or 80-20)
●​ Build model on training data
●​ Evaluate on testing data
●​ Prevents overfitting
2. Cross-Validation:
●​ k-Fold Cross-Validation: Divide data into k subsets
●​ Use k-1 folds for training, 1 fold for testing
●​ Repeat k times and average results
●​ Common: 5-fold or 10-fold cross-validation
3. Leave-One-Out Cross-Validation (LOOCV):
●​ Special case where k = n
●​ Each observation serves as test set once
●​ Computationally expensive but comprehensive

Threats to Validity:
1. Internal Validity:
●​ Omitted Variable Bias: Missing important predictors
●​ Measurement Error: Inaccurate variable measurement
●​ Sample Selection Bias: Non-random sample selection
●​ Simultaneity Bias: Two-way causation
2. External Validity:
●​ Generalizability: Results applicable to other populations
●​ Temporal Stability: Relationships hold over time
●​ Context Dependence: Results specific to study conditions
3. Statistical Validity:
●​ Power: Adequate sample size
●​ Specification Error: Incorrect model form
●​ Assumption Violations: Breaking regression assumptions

Improving Accuracy and Validity:


1. Data Quality:
●​ Ensure accurate measurement
●​ Handle missing data appropriately
●​ Check for outliers and influential points
2. Model Specification:
●​ Include theoretically relevant variables
●​ Test different functional forms
●​ Consider interaction effects
3. Diagnostic Checking:
●​ Regular residual analysis
●​ Check for multicollinearity
●​ Verify assumption compliance
4. Robustness Checks:
●​ Test model with different specifications
●​ Use alternative estimation methods
●​ Compare with simpler/complex models

Practical Example - Sales Prediction Model:


Model: Sales = β₀ + β₁×Advertising + β₂×Price + β₃×Competition + ε
Accuracy Measures:
●​ R² = 0.85 (85% of sales variance explained)
●​ Adjusted R² = 0.83
●​ RMSE = $12,500 (average prediction error)
●​ MAE = $9,800
Validation:
●​ 5-fold cross-validation: Average R² = 0.82
●​ Train-test split (80-20): Test R² = 0.84
Threats Addressed:
●​ Controlled for major competitors (omitted variable bias)
●​ Used reliable sales data (measurement error)
●​ Random sample of stores (selection bias)
Interpretation Guidelines:
●​ High R² doesn't guarantee good predictions
●​ Consider practical significance alongside statistical
●​ Validate on new, unseen data
●​ Report confidence intervals for predictions

4. Dealing with Categorical Data in Regression


Introduction:​
Categorical variables represent groups or categories rather than numerical values. Proper
handling is essential for meaningful regression analysis.

Types of Categorical Variables:


1. Nominal Variables:
●​ Categories with no inherent order
●​ Examples: Gender, Color, Country, Product Type
2. Ordinal Variables:
●​ Categories with meaningful order
●​ Examples: Education Level, Satisfaction Rating, Income Bracket
3. Binary/Dichotomous Variables:
●​ Only two categories
●​ Examples: Yes/No, Male/Female, Success/Failure

Encoding Categorical Variables:


1. Dummy Coding (One-Hot Encoding):
●​ Create k-1 dummy variables for k categories
●​ One category serves as reference group
●​ Most common method for nominal variables
Example - Region with 4 categories:
●​ Reference: North
●​ Dummy1: South (1 if South, 0 otherwise)
●​ Dummy2: East (1 if East, 0 otherwise)
●​ Dummy3: West (1 if West, 0 otherwise)
2. Effect Coding:
●​ Similar to dummy coding but uses -1 for reference group
●​ Useful for comparing category means to overall mean
3. Treatment Coding:
●​ Compares each level to a reference level
●​ Default in many statistical packages

Practical Example - Salary Prediction:


Variables:
●​ Dependent: Salary (continuous)
●​ Independent1: Experience (continuous)
●​ Independent2: Education (categorical: HS, Bachelor, Master, PhD)
●​ Independent3: Department (categorical: Sales, Marketing, Engineering)
Model Specification:​
Salary = β₀ + β₁×Experience + β₂×Bachelor + β₃×Master + β₄×PhD + β₅×Marketing +
β₆×Engineering + ε
Reference Categories:
●​ Education: High School
●​ Department: Sales
Interpretation:
●​ β₂: Expected salary difference between Bachelor's and High School
●​ β₃: Expected salary difference between Master's and High School
●​ β₄: Expected salary difference between PhD and High School
●​ β₅: Expected salary difference between Marketing and Sales
●​ β₆: Expected salary difference between Engineering and Sales
Model with Interactions:​
Salary = β₀ + β₁×Experience + β₂×Master + β₃×Engineering + β₄×(Experience×Master) +
β₅×(Experience×Engineering) + ε
Interpretation:
●​ β₄: Additional effect of experience for Master's vs others
●​ β₅: Additional effect of experience for Engineering vs others
Best Practices:
1. Reference Category Selection:
●​ Choose meaningful reference group
●​ Often largest group or control group
●​ Consider theoretical importance
2. Avoiding Dummy Variable Trap:
●​ Always use k-1 dummies for k categories
●​ Including all k dummies causes perfect multicollinearity
3. Checking Category Effects:
●​ Test if categorical variable significantly improves model
●​ Use F-test for nested models
●​ Consider overall category significance
4. Handling Many Categories:
●​ Consider grouping similar categories
●​ Use regularization techniques
●​ Be cautious with hierarchical models

Advanced Techniques:
1. Analysis of Covariance (ANCOVA):
●​ Combines ANOVA and regression
●​ Tests group differences while controlling for covariates
2. Mixed Effects Models:
●​ Handles nested categorical data
●​ Useful for repeated measures or hierarchical data
3. Regularization Methods:
●​ Ridge/Lasso regression for high-dimensional categorical data
●​ Helps prevent overfitting with many categories

Diagnostic Considerations:
1. Homogeneity of Variance:
●​ Check if error variance is equal across groups
●​ Use Levene's test or plot residuals by category
2. Homogeneity of Regression Slopes:
●​ In ANCOVA, check if slopes are equal across groups
●​ Test interaction between covariate and categorical variable
3. Sample Size per Category:
●​ Ensure adequate observations in each category
●​ Small categories may lead to unstable estimates

Previous Year Questions & Solutions


VSA (1 Mark Questions)
1. ________ analysis estimates the relationship between single dependent
variable and single independent variable. (2025)​
Answer: Simple linear regression
2. Regression line is used ________. (2023)​
Answer: to predict the value of dependent variable based on independent variable

SA (5 Marks Questions)
3. Discuss interaction effects in regression with examples. (2025)
Definition: Interaction effects occur when the effect of one independent variable on the
dependent variable depends on the value of another independent variable.
Mathematical Representation:​
Y = β₀ + β₁X₁ + β₂X₂ + β₃(X₁ × X₂) + ε
Interpretation:
●​ β₃ represents the interaction effect
●​ If β₃ is significant, the relationship between X₁ and Y changes depending on X₂
Example 1: Marketing Study
●​ Y = Sales
●​ X₁ = Advertising Budget
●​ X₂ = Season (0 = Off-season, 1 = Peak-season)
●​ Interaction: Advertising × Season
●​ Interpretation: Effect of advertising on sales is different during peak vs off-season
Example 2: Education Research
●​ Y = Test Scores
●​ X₁ = Study Hours
●​ X₂ = Teaching Method (0 = Traditional, 1 = Innovative)
●​ Interaction: Study Hours × Teaching Method
●​ Interpretation: The benefit of additional study hours depends on teaching method
Testing Interaction Effects:
1.​ Include product term in regression
2.​ Test significance of interaction coefficient
3.​ Plot interaction effects for visualization
4.​ Conduct simple slopes analysis
Practical Importance:
●​ Reveals complex relationships
●​ Provides more accurate predictions
●​ Helps in targeted interventions
●​ Avoids misleading conclusions

LA (4 - 15 Marks Questions)
4. What is residual? (4 Marks) (2025)
Definition: A residual is the difference between an observed value and the value predicted by
the regression model.
Mathematical Formula:​
e_i = y_i - ŷ_i​
where:
●​ e_i = residual for i-th observation
●​ y_i = actual observed value
●​ ŷ_i = predicted value from regression model
Properties of Residuals:
●​ Sum of residuals is zero: Σe_i = 0
●​ Residuals are uncorrelated with predicted values
●​ Used to check regression assumptions
Importance and Uses:
1.​ Model Diagnostics: Check if regression assumptions are violated
2.​ Outlier Detection: Identify unusual observations
3.​ Homoscedasticity Check: Examine constant variance assumption
4.​ Model Improvement: Identify patterns for better specification
Residual Analysis:
●​ Plot residuals vs predicted values
●​ Plot residuals vs independent variables
●​ Check for patterns, trends, or heteroscedasticity
●​ Normal probability plot for normality check
Example:​
If actual sales = 100 and predicted sales = 95, then residual = 100 - 95 = 5

5. Given the following data pairs (x, y), find the regression equation. (7 Marks)
(2023)​
Data: (1, 1.24), (2, 5.23), (3, 7.24), (4, 7.60), (5, 9.97), (6, 14.31), (7, 13.99), (8,
14.88), (9, 18.04), (10, 20.70)
Solution:
Step 1: Calculate necessary sums​
n = 10​
Σx = 1+2+3+4+5+6+7+8+9+10 = 55​
Σy = 1.24+5.23+7.24+7.60+9.97+14.31+13.99+14.88+18.04+20.70 = 113.2​
Σxy=(1×1.24)+(2×5.23)+(3×7.24)+(4×7.60)+(5×9.97)+(6×14.31)+(7×13.99)+(8×14.88)+(9×18
.04)+(10×20.70)​
= 1.24 + 10.46 + 21.72 + 30.40 + 49.85 + 85.86 + 97.93 + 119.04 + 162.36 + 207.00
= 785.86​
Σx² = 1²+2²+3²+4²+5²+6²+7²+8²+9²+10²
= 1+4+9+16+25+36+49+64+81+100
= 385​
Σy² = 1.24²+5.23²+7.24²+7.60²+9.97²+14.31²+13.99²+14.88²+18.04²+20.70²​
= 1.5376 + 27.3529 + 52.4176 + 57.76 + 99.4009 + 204.7761 + 195.7201 + 221.4144 +
325.4416 + 428.49
= 1614.3112
Step 2: Calculate means​
x̄ = Σx/n = 55/10 = 5.5​
ȳ = Σy/n = 113.2/10 = 11.32
Step 3: Calculate slope (b₁)​
b₁ = [nΣxy - ΣxΣy] / [nΣx² - (Σx)²]​
= [10×785.86 - 55×113.2] / [10×385 - 55²]​
= [7858.6 - 6226] / [3850 - 3025]​
= 1632.6 / 825​
= 1.979
Step 4: Calculate intercept (b₀)​
b₀ = ȳ - b₁x̄​
= 11.32 - 1.979×5.5​
= 11.32 - 10.8845​
= 0.4355
Step 5: Write regression equation​
ŷ = 0.4355 + 1.979x
Answer: The regression equation is ŷ = 0.436 + 1.979x

6. Calculate the correlation coefficient from given dataset. (8 Marks) (2023)​


Using the same data as the previous question.
Solution:
Step 1: Use previously calculated values​
n = 10, Σx = 55, Σy = 113.2, Σxy = 785.86, Σx² = 385, Σy² = 1614.3112
Step 2: Calculate correlation coefficient (r)​
r = [nΣxy - ΣxΣy] / √{[nΣx² - (Σx)²] × [nΣy² - (Σy)²]}​
= [10×785.86 - 55×113.2] / √{[10×385 - 55²] × [10×1614.3112 - 113.2²]}
Step 3: Calculate numerator​
Numerator = 7858.6 - 6226 = 1632.6
Step 4: Calculate denominator components​
[nΣx² - (Σx)²] = 3850 - 3025 = 825​
[nΣy² - (Σy)²] = 16143.112 - 12814.24 = 3328.872
Step 5: Calculate denominator​
Denominator = √[825 × 3328.872] = √2745817.4 = 1657.05
Step 6: Calculate r​
r = 1632.6 / 1657.05 = 0.985
Step 7: Interpretation​
r = 0.985 indicates a very strong positive linear relationship between x and y.
Answer: Correlation coefficient r = 0.985

7. For the following dataset, obtain a prediction for x = 4.5. (5 Marks) (2023)​
Using the regression equation from question 5.
Solution:
Step 1: Use regression equation​
ŷ = 0.4355 + 1.979x
Step 2: Substitute x = 4.5​
ŷ = 0.4355 + 1.979×4.5​
= 0.4355 + 8.9055​
= 9.341
Step 3: Interpretation​
When x = 4.5, the predicted value of y is 9.341
Answer: The predicted value for x = 4.5 is 9.341

8. Define linear Regression. Why is it so important? Write down the assumptions


for success with linear-regression analysis. (15 Marks) (2024)
Definition of Linear Regression:​
Linear regression is a statistical method that models the relationship between a dependent
variable and one or more independent variables by fitting a linear equation to observed data. It
estimates how the dependent variable changes when any independent variable is varied while
others are held constant.
Types of Linear Regression:
1.​ Simple Linear Regression: One independent variable
○​ Model: Y = β₀ + β₁X + ε
2.​ Multiple Linear Regression: Two or more independent variables
○​ Model: Y = β₀ + β₁X₁ + β₂X₂ + ... + β X + ε
Importance of Linear Regression:
1. Prediction and Forecasting:
●​ Predict future values of dependent variable
●​ Used in sales forecasting, stock price prediction
●​ Example: Predicting house prices based on features
2. Relationship Analysis:
●​ Quantifies relationships between variables
●​ Identifies significant predictors
●​ Measures strength and direction of relationships
3. Control and Optimization:
●​ Helps in process optimization
●​ Identifies key factors affecting outcomes
●​ Supports decision-making in business and research
4. Scientific Research:
●​ Tests theoretical relationships
●​ Provides empirical evidence for hypotheses
●​ Used across various disciplines
5. Risk Assessment:
●​ Evaluates impact of different factors
●​ Supports risk management decisions
●​ Used in finance, healthcare, and engineering

Assumptions for Successful Linear Regression Analysis:


1. Linearity:
●​ Relationship between variables is linear
●​ Check: Scatter plots, residual plots
●​ Violation: Curvilinear patterns
●​ Solution: Transform variables, add polynomial terms
2. Independence of Errors:
●​ Residuals are independent of each other
●​ No autocorrelation
●​ Check: Durbin-Watson test (ideal ≈ 2)
●​ Solution: Time series models, robust standard errors
3. Homoscedasticity:
●​ Constant variance of errors
●​ Residuals form consistent band around zero
●​ Check: Residual vs fitted plot
●​ Violation: Funnel-shaped pattern
●​ Solution: Transformations, weighted least squares
4. Normality of Errors:
●​ Residuals follow normal distribution
●​ Important for inference and confidence intervals
●​ Check: Q-Q plot, Shapiro-Wilk test
●​ Solution: Data transformations, robust methods
5. No Perfect Multicollinearity:
●​ Independent variables not perfectly correlated
●​ Check: Variance Inflation Factor (VIF < 10)
●​ Solution: Remove correlated variables, use PCA
6. No Endogeneity:
●​ Independent variables uncorrelated with error term
●​ Violation: Omitted variable bias, measurement error
●​ Check: Theoretical reasoning, Hausman test
●​ Solution: Instrumental variables, fixed effects
Additional Considerations:
Sample Size:
●​ Adequate observations for reliable estimates
●​ Minimum 10-15 observations per predictor
●​ Larger samples provide more precise estimates
Outlier Treatment:
●​ Identify and handle influential points
●​ Use Cook's distance, leverage plots
●​ Consider robust regression methods
Model Specification:
●​ Include all relevant variables
●​ Correct functional form
●​ Test for interaction effects
Practical Applications:
●​ Economics: Demand forecasting
●​ Medicine: Drug dosage effects
●​ Marketing: Customer behavior analysis
●​ Engineering: Quality control
●​ Social Sciences: Policy impact assessment
Classification
What is Classification in Machine Learning? Explain with a suitable example.
Discuss the key differences between Classification and Regression.
Answer:
Classification is a type of supervised machine learning where the goal is to predict a discrete
categorical label for a given input data point. The model learns from labeled training data to
assign new, unseen instances to one of several predefined categories.
Key Characteristics:
●​ Supervised Learning: The model is trained on a labeled dataset, meaning each
training example is paired with the correct output.
●​ Categorical Output: The target variable is not continuous but belongs to a specific
category or group (e.g., "spam" or "not spam").
●​ Decision Boundary: The model essentially learns a boundary that separates the
different classes in the feature space.

Types of Classification:
1.​ Binary Classification: The simplest form, where the target variable has only two
possible outcomes.
○​ Examples:
■​ Email Spam Filtering: Classifying an email as "Spam" or "Not Spam".
■​ Medical Diagnosis: Predicting whether a tumor is "Malignant" or
"Benign".
■​ Loan Approval: Predicting if a loan application will "Default" or "Not
Default".
2.​ Multi-Class Classification: The target variable has more than two classes. The model
must decide on one class from three or more possibilities.
○​ Examples:
■​ Handwritten Digit Recognition: Classifying an image of a handwritten
digit into one of ten classes (0 through 9).
■​ Product Categorization: Classifying a product on an e-commerce site
into categories like "Electronics", "Clothing", or "Books".
■​ Animal Species Identification: Identifying a species from an image,
e.g., "Cat", "Dog", "Rabbit".
3.​ Multi-Label Classification: Each instance can be assigned to multiple labels
simultaneously.
○​ Examples:
■​ Movie Genre Tagging: A movie can be labeled as "Action", "Comedy",
and "Sci-Fi" all at once.
■​ Photo Tagging: A single image can contain multiple objects, like
"Person", "Car", and "Tree".
Difference between Classification and Regression:
Basis Classification Regression

Output Variable Categorical class labels Numerical continuous value

Purpose Predict a category or class Predict a quantity or amount

Example Email Spam Detection, Image House Price Prediction, Stock


Problems Recognition, Loan Default Market Forecasting,
Prediction Temperature Prediction

Algorithms Logistic Regression, Decision Linear Regression, Polynomial


Trees, k-NN, SVM Regression, Decision Trees for
regression

Evaluation Accuracy, Precision, Recall, Mean Absolute Error, Mean


Metrics F1-Score, ROC Curve Squared Error, R-squared

Explain Logistic Regression in detail. Why is it called "Logistic" despite being used
for classification? Illustrate the Sigmoid Function.
Logistic Regression is a fundamental classification algorithm used to predict the probability
that a given data point belongs to a particular category. It is primarily used for binary
classification problems involving two classes.

Why is it called "Logistic"?​


Despite its name, it is a classification algorithm. The term "Regression" is used because the
method applies a linear regression-like approach to a transformed version of the target variable.
The core of this transformation is the Logistic Function, also known as the Sigmoid Function,
which is where the name "Logistic" originates.

The Sigmoid Function:​


The sigmoid function is an S-shaped curve that maps any real-valued number into a value
between 0 and 1. This output is interpreted as a probability.
Formula: σ(z) = 1 / (1 + e^(-z))​
Where z is the linear combination of inputs and weights: z = b0 + b1*x1 + b2*x2 + ...
How it works:
1.​ Linear Combination: First, it calculates the weighted sum of the input features,
similar to linear regression: z = β₀ + β₁X₁ + β₂X₂ + ...
2.​ Sigmoid Transformation: This linear output z is fed into the sigmoid function.
P(Y=1) = 1 / (1 + e^(-z)). This gives the probability that the data point belongs to class 1.
3.​ Decision Boundary: A threshold, typically 0.5, is applied to this probability to make a
final class prediction.
○​ If P(Y=1) >= 0.5, predict Class 1.
○​ If P(Y=1) < 0.5, predict Class 0.
Example:​
Predicting if a student will pass (1) or fail (0) an exam based on their hours of study.
●​ Input Feature: Hours studied.
●​ Output: Pass (1) or Fail (0).
●​ The model calculates the probability P(Pass | Hours Studied). If this probability is 0.7 for
a student who studied 5 hours, and we use a 0.5 threshold, the student is classified as
"Pass."

Describe the key steps involved in building a classification model, from data
preparation to model training.
Building a robust classification model is a structured process involving several key steps:
1.​ Data Collection:
○​ Gather the relevant dataset from sources like databases, APIs, or files.
○​ Example: Collecting data for credit card fraud detection, including transaction
amount, merchant, location, time, and a label for fraud.
2.​ Data Preprocessing and Exploration:
○​ Handling Missing Values: Impute or remove missing data.
○​ Exploratory Data Analysis: Understand data distributions, correlations, and
class balance using visualizations.
○​ Example: A boxplot might reveal that fraudulent transactions have a higher
average amount.
3.​ Data Cleaning and Feature Engineering:
○​ Encoding Categorical Variables: Convert text categories into numbers using
One-Hot Encoding or Label Encoding.
○​ Feature Scaling/Normalization: Standardize or normalize numerical features
so no single feature dominates the model.
○​ Creating New Features: Derive new, more informative features from existing
ones.
4.​ Splitting the Data:
○​ Divide the dataset into two subsets:
■​ Training Set: Used to train the model, typically 70-80% of the data.
■​ Testing Set: Used to evaluate the final model's performance on unseen
data, typically 20-30%.
5.​ Model Selection and Training:
○​ Choose one or more classification algorithms, such as Logistic Regression,
Decision Tree, or Random Forest.
○​ Train the Model: Feed the training data to the algorithm so it can learn the
relationship between features and the target variable.
○​ Example: The Logistic Regression algorithm learns the coefficients for each
feature that best separate the "Fraud" and "Not Fraud" classes.
6.​ Model Evaluation:
○​ Use the held-out testing set to make predictions.
○​ Evaluate the model's performance using metrics like Accuracy, Precision, Recall,
F1-Score, and the ROC-AUC curve.
7.​ Model Tuning (Hyperparameter Optimization):
○​ Algorithms have hyperparameters, like the regularization strength in Logistic
Regression or the tree depth in a Decision Tree.
○​ Use techniques like Grid Search or Random Search with cross-validation to find
the best hyperparameters.
8.​ Deployment and Monitoring:
○​ Once satisfied, deploy the model to a production environment.
○​ Continuously monitor its performance, as data patterns can change over time, a
concept known as model drift.

Discuss the Confusion Matrix and the various evaluation metrics derived from it
for a classification model. Provide examples.
Answer:
A Confusion Matrix is a table that describes the performance of a classification model on a set of
test data for which the true values are known. It provides a detailed breakdown of correct and
incorrect predictions.
Structure of a Confusion Matrix (for Binary Classification):
Predicted: NO Predicted: YES

Actual: NO True Negatives (TN) False Positives (FP)

Actual: YES False Negatives (FN) True Positives (TP)

●​ True Positive (TP): The model correctly predicted the positive class. Example:
Correctly predicted "Fraud".
●​ True Negative (TN): The model correctly predicted the negative class. Example:
Correctly predicted "Not Fraud".
●​ False Positive (FP): The model incorrectly predicted the positive class. Also known as
a Type I error. Example: Predicting "Fraud" for a legitimate transaction.
●​ False Negative (FN): The model incorrectly predicted the negative class. Also known
as a Type II error. Example: Predicting "Not Fraud" for an actual fraudulent transaction.

Key Metrics Derived from the Confusion Matrix:


1.​ Accuracy:
○​ Overall, how often is the classifier correct?
○​ Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
○​ When to use: When the classes are balanced. It can be misleading for imbalanced
datasets.
2.​ Precision:
○​ When the model predicts "YES," how often is it correct?
○​ Formula: Precision = TP / (TP + FP)
○​ Focus: Minimizing False Positives.
○​ Example (Spam Detection): High Precision means that if an email is classified
as spam, it is very likely to be spam. The focus is on not putting good emails in the
spam folder.
3.​ Recall (Sensitivity or True Positive Rate):
○​ What proportion of actual "YES" did the model correctly catch?
○​ Formula: Recall = TP / (TP + FN)
○​ Focus: Minimizing False Negatives.
○​ Example (Disease Prediction): High Recall means that most sick people are
correctly identified. The focus is on not missing anyone who is sick.
4.​ F1-Score:
○​ The harmonic mean of Precision and Recall. It provides a single score that
balances both concerns.
○​ Formula: F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
○​ When to use: When you need a balance between Precision and Recall, especially
with an imbalanced dataset.

Example Scenario:​
A model to detect a rare disease from 1000 patients, where 990 are healthy and 10
are sick.
●​ If a model predicts "Healthy" for everyone:
○​ Accuracy = 990/1000 = 99%. This looks great but is useless.
○​ Precision = 0 (because there are no positive predictions).
○​ Recall = 0 (because it didn't catch any sick person).
●​ This clearly shows why Accuracy alone is insufficient and why Precision and Recall are
critical.

Example: Medical Diagnosis Test


●​ Actual Positive: Patients with the disease.
●​ Actual Negative: Healthy patients.
●​ Scenario: A model makes the following predictions on 100 patients.
○​ TN = 80, FP = 5, FN = 10, TP = 5.
●​ Calculations:
○​ Accuracy = (80 + 5) / 100 = 0.85 (85%)
○​ Precision = 5 / (5 + 5) = 0.5 (50%) -> Half of the predicted sick patients were
actually sick.
○​ Recall = 5 / (5 + 10) = 0.33 (33%) -> The model only caught 33% of the actually
sick patients.

Previous year solved:


1. Calculate Sensitivity from the data given in the following table on bowel
cancer testing. (3 Marks)
Sensitivity is the same as Recall. It measures the proportion of actual positive cases that
are correctly identified by the test.
Step 1: Identify the relevant values from the confusion matrix.
The confusion matrix is structured as follows:
Blood Test Actual: Yes (Bowel Actual: No (No Bowel Cancer)
Cancer)

Positive True Positives (TP) = 2 False Positives (FP) = 18

Negative False Negatives (FN) = 1 True Negatives (TN) = 132

From the table:


●​ True Positives (TP): Patients with cancer who tested positive = 2
●​ False Negatives (FN): Patients with cancer who tested negative = 1
Step 2: Apply the formula for Sensitivity (Recall).
The formula is:​
Sensitivity = TP / (TP + FN)
Step 3: Perform the calculation.​
Sensitivity = 2 / (2 + 1) = 2 / 3 ≈ 0.6667
Step 4: State the final answer.​
The sensitivity of the blood test is approximately 0.6667 or 66.67%.

2. Calculate false positive rate from given table. (3 Marks)

The False Positive Rate (FPR) measures the proportion of actual negative cases that are
incorrectly identified as positive by the test.
Step 1: Identify the relevant values from the confusion matrix.
●​ False Positives (FP): Patients without cancer who tested positive = 18
●​ True Negatives (TN): Patients without cancer who tested negative = 132
Step 2: Apply the formula for False Positive Rate.​
The formula is:​
False Positive Rate (FPR) = FP / (FP + TN)
Step 3: Perform the calculation.​
FPR = 18 / (18 + 132) = 18 / 150 = 0.12
Step 4: State the final answer.​
The False Positive Rate of the blood test is 0.12 or 12%.

3. Why do we calculate recall and precision? (4 Marks)


Answer:
We calculate Recall and Precision to get a more nuanced understanding of a classification
model's performance than what accuracy alone can provide, especially when dealing with
imbalanced datasets or when the costs of different types of errors are not equal.
[Link] (Sensitivity) is calculated to answer the question: "Of all the actual
positive cases, how many did we correctly find?"
○​ Why it's important: A high Recall is critical when the goal is to miss as
few positive cases as possible. The cost of a False Negative (missing a
positive case) is very high.
○​ Example: In the context of the bowel cancer test, a high Recall means the
test is good at catching most people who have cancer. A low Recall (high
FN) would mean many sick patients are told they are healthy, which is a
dangerous outcome.
[Link] is calculated to answer the question: "Of all the cases we predicted as
positive, how many are actually positive?"
●​ Why it's important: A high precision is critical when the goal is to ensure that
when we make a positive prediction, we are very confident about it. The cost of a
False Positive (incorrectly alarming a negative case) is high.
●​ Example: In spam email detection, a high Precision means that when an email is
sent to the spam folder, it is almost certainly spam. A low Precision (high FP)
would mean many important emails are incorrectly classified as spam, which is
highly undesirable.

4. Calculate the probability of having bowel cancer given a positive blood test
from a given table. (5 Marks)
Answer:
This question is asking for Precision. It is the probability that a patient actually has the
disease, given that their test result is positive.
Step 1: Identify the relevant values from the confusion matrix.
●​ True Positives (TP): Patients with cancer who tested positive = 2
●​ False Positives (FP): Patients without cancer who tested positive = 18
Step 2: Apply the formula for Precision (which is the conditional probability P(Cancer |
Positive Test)).​
The formula is:​
Precision = TP / (TP + FP)
Step 3: Perform the calculation.​
Precision = 2 / (2 + 18) = 2 / 20 = 0.10
Step 4: State the final answer.​
The probability of having bowel cancer given a positive blood test is 0.10 or 10%.

5. What is Bias and Variance in a Machine Learning Model? (5 Marks)


Answer:
Bias and Variance are two fundamental sources of error in machine learning models that
contribute to the model's inability to make perfect predictions.
Bias:
●​ Definition: Bias is the error that arises from the model's overly simplistic
assumptions about the underlying patterns in the training data. A high-bias model
is too rigid and fails to capture the complex relationships between the features and
the target variable.
●​ Analogy: It is like consistently aiming too far to the left when shooting at a target.
You are systematically wrong.
●​ Consequence: This leads to underfitting. The model performs poorly on the
training data and also generalizes poorly to new, unseen data. It is not flexible
enough to learn the data.
Variance:
●​ Definition: Variance is the error that arises from the model's excessive sensitivity
to small fluctuations in the training data. A high-variance model learns the noise
and random fluctuations in the training set as if they were important concepts.
●​ Analogy: It is like being very inconsistent with your aim, scattering shots all
around the target. You are unpredictable.
●​ Consequence: This leads to overfitting. The model performs exceptionally well
on the training data but fails to generalize to new data because it has learned the
training data "by heart," including its irrelevant details.
6. What is the trade-off between Bias and Variance? (5 Marks)
Answer:
The Bias-Variance Trade-off describes the inverse relationship between these two types of
error. In general, it is impossible to simultaneously reduce both bias and variance to zero.
Decreasing one will typically increase the other. The goal of model training is to find the
optimal balance that minimizes the total error.
●​ High Bias, Low Variance: The model is too simple (underfit). It makes consistent
but systematically inaccurate predictions. It fails on both training and test data.
●​ Low Bias, High Variance: The model is too complex (overfit). It learns the training
data perfectly, including its noise, so its predictions are inconsistent and unreliable
on new data.

The Trade-off:​
As we increase a model's complexity (e.g., by adding more features or using a more
powerful algorithm), the model's bias decreases (it can capture more complex patterns),
but its variance increases (it becomes more sensitive to the specific training data).
Conversely, if we simplify the model, its variance decreases (it becomes more stable), but
its bias increases (it may miss important patterns).
The optimal model is found at the point where the combined value of bias and variance is
minimized, leading to the best generalization performance on unseen data. This is the
fundamental challenge in machine learning.
Clustering
What is Clustering? Explain its key characteristics, common applications,
and how it differs from Classification.
Answer:
Clustering is an unsupervised machine learning technique used to group a set of objects in such
a way that objects in the same group (called a cluster) are more similar to each other than to
those in other groups. The goal is to discover the inherent, natural grouping within the data
without any prior knowledge or labels.
Key Characteristics of Clustering:
1.​ Unsupervised Learning: It does not use labeled output data. The algorithm interprets
the input data alone and finds patterns based on the features provided.
2.​ Exploratory Data Analysis: It is primarily used for discovering hidden patterns and
structures in data, making it a key tool for data mining.
3.​ Intra-cluster Similarity: Objects within a cluster are as similar as possible.
4.​ Inter-cluster Dissimilarity: Objects from different clusters are as different as
possible.

Common Applications of Clustering:


●​ Customer Segmentation: Grouping customers based on purchasing behavior,
demographics, or interests for targeted marketing. For example, an e-commerce
company might cluster users into "bargain hunters," "premium shoppers," and
"occasional buyers."
●​ Document Clustering: Grouping news articles or web documents by topic without
knowing the topics in advance.
●​ Image Segmentation: Grouping pixels in an image into regions to identify objects or
boundaries.
●​ Anomaly Detection: Identifying unusual data points that do not fit well into any
cluster, which can be used for fraud detection in credit card transactions or network
security.
●​ Biology: Grouping genes with similar expression patterns to understand their functions.

Difference between Clustering and Classification:


Points Clustering Classification

Type of Unsupervised Supervised


Learning

Use of Labels Does not use labeled data. The Requires a pre-labeled training
algorithm finds the labels. dataset to learn from.

Objective To find inherent structures and To learn a mapping function from


group similar data points. inputs to a pre-defined output label.

Nature Exploratory Predictive


Example Grouping customers into segments Predicting if an email is "spam" or
without knowing the segments "not spam" based on past examples.
beforehand.

Explain the k-means clustering algorithm in detail, including its steps,


advantages, and disadvantages.
Answer:
K-means clustering is one of the most popular and simple centroid-based clustering algorithms.
The goal is to partition 'n' data points into 'k' clusters, where each data point belongs to the
cluster with the nearest mean (centroid).
Key Assumption: The number of clusters, 'k', is known beforehand.

The K-means Algorithm Steps:


1.​ Initialization: Randomly select 'k' data points from the dataset to serve as the initial
centroids (cluster centers).
2.​ Assignment Step: For each data point in the dataset:
○​ Calculate the distance (usually Euclidean distance) between the data point and
each of the 'k' centroids.
○​ Assign the data point to the cluster whose centroid is the closest (most similar).
3.​ Update Step: For each of the 'k' clusters:
○​ Calculate the new centroid (mean) of the data points currently assigned to that
cluster. This mean becomes the new cluster center.
4.​ Iteration: Repeat steps 2 and 3 until a stopping condition is met. Common stopping
conditions are:
○​ The centroids no longer change significantly (the algorithm has converged).
○​ A predetermined number of iterations has been reached.
○​ The data points do not change clusters.
Example:​
Imagine we have the heights and weights of a group of people and we want to group them into 3
clusters (k=3) for a sports event.
1.​ Initialize: We randomly pick 3 people's (height, weight) as our initial centroids.
2.​ Assign: We look at every other person and assign them to the closest of these 3
centroids based on their own height and weight. This forms 3 initial, rough groups.
3.​ Update: For each of the 3 groups, we calculate the average height and average weight.
These averages become our new centroids.
4.​ Iterate: We repeat the assignment step. Now, people are assigned to these new, more
accurate centroids. Some people might change groups. We then update the centroids
again. We continue this until the groups stop changing.
Advantages of K-means:
●​ Simple and Easy to Implement: The algorithm is conceptually straightforward and easy
to understand.
●​ Efficient and Fast: It is computationally efficient and works well with large datasets.
●​ Guarantees Convergence: The algorithm will always converge to a solution.
Disadvantages of K-means:
●​ Pre-specification of 'k': The user must specify the number of clusters 'k' in advance,
which is often not known.
●​ Sensitive to Initialization: The final clusters can be highly dependent on the initial
random centroids, leading to sub-optimal results.
●​ Sensitive to Outliers: Since it uses the mean, outliers can significantly distort the
centroid's position.
●​ Assumes Spherical Clusters: It works best when clusters are spherical (circular in 2D)
and of roughly similar size. It performs poorly with complex, non-spherical cluster
shapes.

Describe Hierarchical Clustering. Explain its two main types with a suitable
example.
Answer:
Hierarchical Clustering is an algorithm that builds a hierarchy of clusters. The output is
typically a tree-like diagram called a dendrogram that shows the nested grouping of data points
and the similarity levels at which groupings change. Unlike K-means, it does not require the
number of clusters 'k' to be specified in advance.
Main Approaches:
There are two main types of Hierarchical Clustering:
1.​ Agglomerative (Bottom-Up) Approach:​
This is the most common approach. It starts by treating each data point as a single
cluster. Then, it repeatedly merges the closest pairs of clusters until all data points are in
a single cluster.​
Steps for Agglomerative Hierarchical Clustering:
○​ Step 1: Start by considering each data point as an individual cluster. So, if you
have 'n' data points, you have 'n' clusters.
○​ Step 2: Compute the proximity matrix (a matrix of distances between every pair
of clusters).
○​ Step 3: Find the two closest clusters and merge them into a single cluster.
○​ Step 4: Update the proximity matrix to reflect the new distances between the new
cluster and the original clusters.
○​ Step 5: Repeat steps 3 and 4 until only one single cluster remains.
2.​ Divisive (Top-Down) Approach:​
This is the reverse of the Agglomerative approach. It starts with all data points in one
single cluster and then recursively splits the clusters until each data point is in its own
singleton cluster.

Key Concept: Linkage Criteria​


When merging clusters, how do we define the "distance" between two clusters that contain
multiple points? This is defined by the linkage criterion.
●​ Single Linkage: Distance between two clusters is the distance between their two
closest members. Can produce long, "chain-like" clusters.
●​ Complete Linkage: Distance between two clusters is the distance between their two
farthest members. Tends to find compact, spherical clusters.
●​ Average Linkage: Distance between two clusters is the average distance between every
pair of members in the two clusters. A balanced approach.
The Dendrogram:​
The dendrogram is the key output. The vertical axis represents the distance (or dissimilarity) at
which clusters are merged. By drawing a horizontal line across the dendrogram, one can get the
clusters for any desired number of clusters.
Example:​
Imagine an online bookstore wants to understand its customers.
●​ Agglomerative Approach:
1.​ Start with each customer as their own cluster.
2.​ Find the two most similar customers (e.g., both who love sci-fi and mystery) and
merge them.
3.​ Then, find the next two most similar customers or clusters (e.g., a new customer
who loves romance is similar to the existing romance-book-buyer cluster) and
merge them.
4.​ Continue this process, building a hierarchy. The final dendrogram might show a
large branch for "Fiction Readers" which splits into "Sci-Fi/Fantasy" and
"Mystery/Thriller," and another large branch for "Non-Fiction Readers."
●​ Using the Dendrogram: The bookstore can decide how many segments they want.
Cutting the dendrogram at a high distance might give 2 clusters (Fiction/Non-Fiction).
Cutting it at a lower distance might give 5 more specific clusters.

Previous year solved:


VSA (1 Mark Questions)
1. Merge of cluster is used in ________ clustering. (2025)
Answer: Agglomerative Hierarchical
Explanation:
Agglomerative Hierarchical Clustering is a bottom-up approach. It begins by treating each data
point as its own individual cluster. Then, it repeatedly identifies the two closest clusters and
merges them together. This process of merging continues until all data points have been
combined into a single, encompassing cluster.

2. In presence of outlier, SAE (Sum Absolute Error) for centroid calculation is


good for cluster demerit. (2023)
Answer: False
Explanation:
The statement is incorrect. The key merit of using Sum Absolute Error (SAE) is that it is robust
to outliers. In clustering, the centroid in K-means is calculated using the mean, which
minimizes the Sum of Squared Errors (SSE). The mean is highly sensitive to outliers. If we
instead used a metric like SAE (Sum of Absolute Errors), the optimal center would be the
median, which is known to be resistant to outliers. Therefore, using SAE would actually be a
good strategy (a merit) in the presence of outliers, not a demerit.

3. Which one of the following is better proximity measure in clustering. (2023)


Answer: (The specific options were not provided, but the most robust and commonly
recommended measure is Manhattan Distance for high-dimensional or outlier-prone data,
while Euclidean Distance is most common for low-dimensional, dense data.)
Explanation:
The better proximity measure depends on the data and the problem.
-Euclidean Distance: This is the most common (as the crow flies) distance. It works well for
low-dimensional, dense data where the geometry is isotropic (same in all directions). However,
it is highly sensitive to outliers.
-Manhattan Distance: This is the sum of absolute differences along each axis. It is more
robust to outliers than Euclidean distance and is often better for high-dimensional data.
-Cosine Similarity: This measures the angle between two vectors, ignoring their magnitude.
It is better for text data or when the direction is more important than the magnitude.

Without specific options, a general principle is that Manhattan Distance is often a better
proximity measure when robustness to outliers is a concern.

SA (5 Marks Question)
4. Point out difference between K-means and k-medoid algorithm. (2025)
Answer:
K-means and K-medoids are both partition-based clustering algorithms, but they differ
fundamentally in how they define the cluster center and how they handle outliers.

Feature K-means Algorithm K-medoids Algorithm

Cluster Center The center is the mean (average) of The center is the medoid, which is
all data points in the cluster. This is the most centrally located actual
a calculated, virtual point that may data point in the cluster.
not exist in the dataset.

Objective Minimizes the Sum of Squared Minimizes the Sum of Absolute


Function Errors (SSE), i.e., the sum of Errors (SAE) or the sum of
squared distances between points dissimilarities between points and
and their cluster centroid. their cluster medoid.

Handling Highly sensitive to outliers. Since Robust to outliers. Because the


Outliers the mean is influenced by extreme medoid is an actual data point, it is
values, a single outlier can not disproportionately influenced
drastically shift the centroid's by a single extreme value.
position.

Algorithm Generally faster and Generally slower because it


Complexity computationally less expensive. requires more comparisons to find
the actual data point that
minimizes distance.

Common The standard Lloyd's algorithm. PAM (Partitioning Around


Algorithm Medoids) is the most common
K-medoids algorithm.
LA (5 Marks Question)
[Link] is the Minkowski distance metric? (2023)
Answer:
The Minkowski distance is a generalized metric for measuring the distance between two points
in a normed vector space. It is a generalization of other common distance measures like the
Manhattan and Euclidean distances.

Mathematical Formulation:​
For two points P = (x1, x2, ..., xn) and Q = (y1, y2, ..., yn) in an n-dimensional space, the
Minkowski distance of order p (where p is a positive integer) is defined as:
D(P, Q) = ( Σ from i=1 to n |xi - yi|^p )^(1/p)
Here, p is a parameter called the order or exponent.

Special Cases Derived from Minkowski Distance:​


The power of the Minkowski distance is that it can represent several other distance metrics by
simply changing the value of p.
●​ When p = 1:​
The formula becomes D(P, Q) = Σ from i=1 to n |xi - yi|.​
This is the Manhattan Distance (or City Block distance). It is the sum of the absolute
differences along each dimension.
●​ When p = 2:​
The formula becomes D(P, Q) = √( Σ from i=1 to n (xi - yi)^2 ).​
This is the familiar Euclidean Distance. It represents the shortest straight-line distance
between two points.
●​ When p approaches infinity:​
The Minkowski distance approaches the Chebyshev Distance. It is the maximum of the
absolute differences along any single dimension: D(P, Q) = max_i (|xi - yi|).

Interpretation and Use:


●​ The parameter p controls the sensitivity of the distance to differences along individual
dimensions.
●​ A lower value of p (like 1) makes the distance less sensitive to a single large difference in
one dimension (more robust to outliers).
●​ A higher value of p (like 2 or higher) makes the distance more sensitive to large
differences in any single dimension.
Decision tree and kNN
Explain the Decision Tree algorithm for classification. Describe its building
process and key terminology with a suitable example.
Answer:
A Decision Tree is a supervised machine learning algorithm used for both classification and
regression tasks. For classification, it builds a tree-like model to predict the class of a data point
by learning simple decision rules inferred from the data features.
Key Terminology:
●​ Root Node: The topmost node that represents the entire dataset. It is the first feature
used for splitting.
●​ Internal Node: A node that represents a decision point on a feature, splitting the data
further.
●​ Leaf Node (Terminal Node): The final node that provides the predicted class label.
●​ Splitting: The process of dividing a node into two or more sub-nodes based on a
condition on a feature.
●​ Branch/Sub-Tree: A section of the entire tree.
●​ Pruning: The process of removing sub-nodes to reduce the complexity of the tree and
prevent overfitting.

The Building Process (Tree Induction):


The tree is built in a top-down, greedy manner through recursive partitioning.
1.​ Start at the Root Node: Begin with the entire training dataset.
2.​ Feature Selection: Choose the "best" feature to split the data. This is done using a
metric that measures the impurity or purity of the resulting nodes. Common metrics are:
○​ Gini Impurity: Measures the probability of incorrectly classifying a randomly
chosen element. A Gini index of 0 denotes perfect purity.
○​ Information Gain: Based on the concept of Entropy from information theory.
Entropy measures the disorder or uncertainty in a node. The split that results in
the largest decrease in entropy (highest information gain) is chosen.
3.​ Split the Dataset: The dataset at the node is split into subsets based on the possible
values of the selected feature (e.g., for "Age", splits could be "Age < 30" and "Age >=
30").
4.​ Generate Child Nodes: For each subset created by the split, a new child node is
created.
5.​ Recurse: Repeat steps 2-4 for each child node, using only the data that has reached that
node. This process continues until a stopping condition is met.
Stopping Conditions:
●​ All samples in a node belong to the same class.
●​ A predefined maximum tree depth is reached.
●​ The number of samples in a node falls below a minimum threshold.

Example: Predicting if a person will buy a computer.


Imagine a dataset with features like Age, Income, Student, and Credit Rating.
The target is "Buys Computer" (Yes or No).
1.​ The algorithm might find that the feature "Age" gives the highest information gain. The
root node splits the data into "Age <= 30" and "Age > 30".
2.​ For the "Age <= 30" branch, it might then find that "Student" is the next best feature,
splitting into "Student=Yes" and "Student=No".
3.​ The "Student=Yes" node might now contain only data points where the class is "Buys
Computer=Yes". This becomes a leaf node predicting "Yes".
4.​ Similarly, the other branches are split until all data is classified into pure leaf nodes or a
stopping condition is met.
The final tree provides a set of interpretable rules, such as: "IF Age <= 30 AND Student = Yes
THEN Buys Computer = Yes".

What are Regression Trees? How do they differ from Classification Trees? Explain
the concept of tree pruning.
Answer:
Regression Trees:
While classification trees predict a categorical class label, regression trees predict a continuous
value. The fundamental building process is similar, but the criteria for splitting nodes and the
prediction at the leaf nodes are different.
●​ Splitting Criterion: Instead of using Gini Impurity or Information Gain, regression
trees typically use metrics like Mean Squared Error (MSE) or Mean Absolute Error
(MAE). The algorithm chooses the split that minimizes the variance (or error) in the
child nodes.
●​ Prediction at Leaf Node: In a classification tree, the leaf node predicts the mode
(most common) class. In a regression tree, the leaf node predicts the mean (or median)
value of the target variable for all training data points that reach that leaf.

Difference between Classification and Regression Trees:


Feature Classification Tree Regression Tree

Prediction Type Categorical Class (e.g., Yes/No) Continuous Value (e.g., Price,
Salary)

Splitting Criterion Gini Impurity, Information Gain Variance Reduction, MSE, MAE

Leaf Node Output Mode (Most frequent class) of Mean (or Median) value of the
the data points in the leaf. data points in the leaf.

Example of a Regression Tree:​


Predicting the price of a house.
●​ A split could be: "IF Square Footage > 2000".
●​ The data is divided. The left child node (Square Footage <= 2000) might have an average
house price of $300,000.
●​ The right child node (Square Footage > 2000) might have an average price of $500,000.
●​ A new house with 1500 sq. ft. would be directed to the left node and be predicted to have
a price of $300,000.
Tree Pruning:
Pruning is a technique used to reduce the size of a decision tree to prevent overfitting. An
overfit tree learns the training data, including its noise and outliers, too well, leading to poor
performance on new, unseen data.
●​ What is it? Pruning involves removing parts of the tree that provide little power in
classifying or predicting instances. It simplifies the model.
●​ Why is it needed? A very deep, complex tree will have low bias but high variance
(overfitting). Pruning helps to find a tree with an optimal balance between bias and
variance.
●​ How it works (Cost Complexity Pruning):
1.​ Grow the tree to its maximum depth, allowing it to overfit.
2.​ A complexity parameter (alpha) is introduced that penalizes the tree for having
too many leaves.
3.​ For each possible value of alpha, sub-trees are pruned (replaced with a leaf node)
if the overall cost (error + complexity penalty) is reduced.
4.​ The optimally pruned tree is the one that minimizes this cost-complexity measure,
typically found via cross-validation.

Describe the Random Forest algorithm. Why is it called an "ensemble" method,


and how does it improve upon a single Decision Tree?
Answer:
Random Forest is a powerful ensemble learning method primarily used for classification and
regression. It operates by constructing a multitude of decision trees at training time and
outputting the mode of the classes (for classification) or mean prediction (for regression) of the
individual trees.
Why is it an "Ensemble" Method?
It is called an ensemble method because it combines multiple models (in this case, multiple
decision trees) to create a single, more robust and accurate model. The core idea is that a group
of "weak learners" (individual trees) can come together to form a "strong learner."

The Algorithm Steps:


1.​ Bagging (Bootstrap Aggregating):
○​ From the original training dataset of size N, multiple new training sets are created
by randomly sampling N instances with replacement. This means some instances
may be repeated, and others may be left out in each sample. These are called
bootstrap samples.
2.​ Random Feature Selection:
○​ For each bootstrap sample, a decision tree is grown. However, when splitting a
node, the algorithm is not allowed to choose from all features. Instead, it
randomly selects a subset of features (e.g., the square root of the total number of
features) and finds the best split only within this random subset.
3.​ Voting/Averaging:
○​ Each tree in the forest makes its own prediction.
○​ For a classification task, the final prediction is the class that gets the majority vote
from all the trees.
○​ For a regression task, the final prediction is the average of the predictions from all
the trees.
How Random Forest Improves upon a Single Decision Tree:
1.​ Reduces Overfitting (Lowers Variance): A single decision tree is prone to
overfitting the training data. By averaging the results of many trees, each trained on a
slightly different dataset and using different features, Random Forest cancels out the
noise and reduces variance, leading to better generalization.
2.​ Increases Robustness and Accuracy: The ensemble is less sensitive to the
specificities of the training data. While one tree might make an error, it is unlikely that a
majority of the trees will make the same error on the same data point.
3.​ Handles Missing Data Better: The use of bagging means that not all trees are trained
with every data point, making the model more robust to missing values.
Example:​
Imagine building a Random Forest to diagnose a disease.
●​ One tree might be trained on a sample where it focuses on symptoms like fever and
cough.
●​ Another tree might be trained on a different sample and focus on age and blood pressure.
●​ A new patient's data is run through all trees. Some trees may predict "Healthy," others
"Diseased." The final diagnosis is the majority vote. This collective decision is typically
more accurate and stable than relying on a single doctor's (or a single tree's) opinion.

Explain the k-Nearest Neighbors (kNN) algorithm for both classification and
regression. (10 Marks)
Answer:
k-Nearest Neighbors (kNN) is a simple, instance-based supervised learning algorithm used for
both classification and regression. It is a non-parametric method, meaning it does not make
strong assumptions about the underlying data distribution.
Core Principle: The value or class of a data point is determined by the data points closest to it in
the feature space. The "k" in kNN is a user-defined constant representing the number of nearest
neighbors to consider.
The Algorithm Steps:
1.​ Choose the value of k: Select the number of neighbors (k). A small k (e.g., 1, 3) is
sensitive to noise, while a large k can smooth over decision boundaries.
2.​ Calculate Distance: For a new, unlabeled data point, calculate the distance between
this point and every other point in the training dataset. Common distance metrics are
Euclidean distance and Manhattan distance.
3.​ Find Nearest Neighbors: Identify the 'k' training data points that are closest to the
new point.
4.​ Make a Prediction:
○​ For Classification: The predicted class for the new point is the most frequent
class (the mode) among its k-nearest neighbors.
○​ For Regression: The predicted value for the new point is the average (or
median) of the target values of its k-nearest neighbors.

Example for Classification:​


Predicting the type of fruit (Apple or Orange) based on its weight and color.
●​ We have a training set of known fruits with their weights and colors.
●​ A new, unknown fruit arrives. The algorithm calculates the distance from this new fruit
to all fruits in the training set.
●​ Let k=3. It finds the 3 closest fruits.
●​ If 2 of them are Apples and 1 is an Orange, the algorithm classifies the new fruit as an
Apple.
Example for Regression:​
Predicting the price of a house based on its size and number of bedrooms.
●​ The training data has houses with known sizes, bedrooms, and prices.
●​ For a new house, the algorithm finds the k=5 most similar houses (in terms of size and
bedrooms) from the training data.
●​ The predicted price for the new house is the average price of those 5 nearest neighbor
houses.

Explain the concept of Weighted k-Nearest Neighbors (kNN). How does it differ
from standard kNN, and what are its advantages? Provide a detailed example. (10
Marks)
Weighted k-Nearest Neighbors (Weighted kNN) is a sophisticated variant of the standard kNN
algorithm that addresses a key limitation: the assumption that all neighbors contribute equally
to the prediction. In reality, a closer neighbor is likely to be more similar to the query point than
a farther one. Weighted kNN incorporates this intuition by assigning a weight to each of the k
neighbors, where the weight is a function of the distance.
Core Principle:​
The fundamental idea is that not all votes are equal. The influence of a neighbor on the final
prediction is proportional to its proximity to the data point being classified or for which a value
is being predicted. Closer neighbors have a stronger say in the outcome than more distant ones.

How it Works:​
The process is identical to standard kNN for the first three steps, diverging only at the final
prediction step.
1.​ Select k: Choose a value for k (the number of neighbors to consider).
2.​ Calculate Distance: For a new query point, calculate the distance to every point in the
training set using a metric like Euclidean or Manhattan distance.
3.​ Identify k-Nearest Neighbors: Find the k training points with the smallest distances
to the query point.
4.​ Assign Weights and Predict: This is the crucial difference.
○​ Each of the k neighbors is assigned a weight. The weight is inversely proportional
to its distance from the query point. A common weighting function is the inverse
distance: weight_i = 1 / (distance_i). To avoid division by zero, it is often modified
to weight_i = 1 / (distance_i + ε), where ε is a very small constant.
○​ Other common weighting functions include the inverse squared distance
(weight_i = 1 / (distance_i)^2) which penalizes farther points more heavily, and a
gaussian function.

Making the Prediction:


●​ For Classification (Weighted Voting):​
Instead of a simple majority vote, a weighted vote is conducted. The predicted class is the
one with the highest total weight from the k neighbors.​
Predicted Class = argmax( Σ weight_i for all neighbors of class c )
●​ For Regression (Weighted Average):​
Instead of a simple average, a weighted average is calculated. The predicted value is the
sum of the target values of the k neighbors, each multiplied by their weight, divided by
the sum of all weights.​
Predicted Value = (Σ (weight_i * target_value_i)) / (Σ weight_i)

Advantages of Weighted kNN over Standard kNN:


1.​ Higher Accuracy: By giving more importance to closer neighbors, it often makes more
precise and accurate predictions that better reflect the local data structure.
2.​ Handles Crowding Effect: In dense regions of the feature space, a query point might
have many somewhat-distant neighbors. Standard kNN would let them all vote equally,
potentially drowning out the signal from the one or two very close neighbors. Weighted
kNN mitigates this.
3.​ Robustness to Noise: If the k neighbors include an outlier that is slightly farther away,
its lower weight will minimize its negative impact on the prediction. In standard kNN,
that outlier's vote counts as much as a very close, reliable neighbor.

Disadvantage:
●​ Computational Overhead: It requires the calculation of weights, adding a small but
non-zero computational cost compared to the standard algorithm.

Previous year solved:


VSA (1 Mark Questions)
1. k-NN is a ___ algorithm. (2025)
Answer: lazy learning
Explanation:
k-NN is known as a lazy learner (or instance-based learner) because it does not undergo a
traditional training phase to build a model. It simply stores the entire training dataset. All the
computation is deferred until the time of prediction, when it searches for the nearest neighbors
to the query point.

2. What will be the Euclidean distance between two data points A(2,5) and B(2,7)? (2025)
Answer: 2
Explanation:
The Euclidean distance between two points A(x1, y1) and B(x2, y2) is calculated as:
Distance = √[(x2 - x1)² + (y2 - y1)²]
For points A(2, 5) and B(2, 7):
Distance = √[(2 - 2)² + (7 - 5)²] = √[0 + 4] = √4 = 2

3. Which of the following will be true about k in k-NN in terms of Bias? (2025)
Answer: (The specific options were not provided, but the general principle is:)
As the value of k increases, the bias of the model increases.
Explanation:
In k-NN, the parameter k controls the complexity of the model.
- A small k (e.g., k=1) creates a complex model that closely fits the training data, resulting in low
bias but high variance.
- A large k makes the model smoother and simpler. The prediction is based on an average of
more points, which may miss finer patterns in the data. This results in higher bias but lower
variance.
4. What is not the reason for overfitting? (2025)
(A) noise and outliers
(B) too little training data
(C) local maxima
(D) None of these
Answer: (C) local maxima
Explanation:
- (A) noise and outliers: Models can learn noise and outliers as if they were true patterns,
leading to overfitting.
- (B) too little training data: With insufficient data, the model cannot learn the general underlying
distribution and may memorize the few examples it has.
- (C) local maxima: This is a problem associated with optimization algorithms (like in training
Neural Networks), not a direct cause of overfitting. A model getting stuck in a local maximum
might not even reach a good fit, let alone overfit. Therefore, it is not a standard reason for
overfitting.

5. In Decision tree, Internal node have ______. (2023)


Answer: branches (or splits)
Explanation:
In a decision tree, an internal node (or non-leaf node) represents a test on a specific feature.
Based on the outcome of this test, the data is split and passed down to its child nodes.
Therefore, an internal node must have branches (typically two or more) leading to its children.

SA (5 Marks Questions)
6. What is the role of noise in model underfitting and overfitting? (2023)
Noise refers to irrelevant information, random fluctuations, or errors within the dataset. Its role
differs in underfitting and overfitting:

[Link] in Overfitting:
- Complex models (e.g., very deep decision trees, k-NN with a very small k) have a high
capacity to learn intricate details from the training data.
- If the training data contains noise, these complex models can learn this noise as if it were a
genuine pattern. They essentially "memorize" the random errors.
- This leads to excellent performance on the training data but poor performance on new,
unseen test data because the learned noise does not generalize. This is the hallmark of
overfitting.

[Link] in Underfitting:
- Underfitting occurs when a model is too simple to capture the underlying trend in the data,
including the true signal.
- While a simple model is generally less likely to learn noise, the presence of noise can make
it even harder for the model to identify the true signal amidst the confusion.
- The model fails to learn adequately, performing poorly on both training and test data.
7. Briefly describe procedure of Decision tree construction. (2025)
The procedure for constructing a decision tree is a top-down, greedy, recursive partitioning
algorithm.
[Link] at the Root: Begin with the entire training dataset at the root node.
[Link] the Best Split: For all features, calculate a metric (like Information Gain or Gini Gain)
that quantifies how well each feature splits the data into pure subgroups. The feature that
provides the best (highest) gain is selected to split the node.
[Link] the Node: Partition the dataset at the current node into subsets based on the possible
values of the selected feature. This creates new child nodes.
[Link]: Repeat steps 2 and 3 for each newly created child node, using only the subset of
the data that has reached that node.
[Link] when a Stopping Criterion is met: The recursion stops when one of the following
occurs:
- All instances in a node belong to the same class.
- A predefined maximum depth of the tree is reached.
- The number of instances in a node falls below a minimum threshold.
- The split does not lead to a significant information gain.
[Link] Leaf Nodes: When a stopping criterion is met, the node is declared a leaf node. The
predicted label for this leaf is the majority class (for classification) or the mean value (for
regression) of the training instances in that node.

LA (5 - 10 Marks Questions)
8. Why is the k-NN algorithm known as the Lazy Learner algorithm? (5 Marks) (2025)
Answer:
1.​ No Explicit Training Phase: KNN does not build a generalized model (like a tree or a
hyperplane) during the training phase. It performs minimal to zero computation when
given the training data.
2.​ Model is the Data: The entire training dataset itself serves as the "model." It simply
stores all instances and their labels in memory.
3.​ Deferred Computation: All the heavy computation (calculating distances, searching for
neighbors) is deferred until a new, unknown data point needs to be classified or
predicted.
4.​ Local and On-Demand Learning: Learning is performed locally for every new query.
The algorithm only constructs a decision boundary or prediction at the moment a
prediction is requested.
5.​ Instance-Based Learning: It is an instance-based method, meaning it relies on
comparing a new instance directly to existing stored instances (neighbors) rather than
applying a pre-learned, generalized rule set.

9. Compute Information gain from the given table. Also compute the entropy for the
attribute Outlook. (5 Marks) (2025)
Answer:
The dataset is:

Step 1: Calculate the Entropy of the Parent Node (the entire dataset).
- Total instances = 5 + 4 + 5 = 14
- Proportion of ‘+’ class, p(+) = (2+4+3)/14 = 9/14
- Proportion of ‘-’ class, p(-) = (3+0+2)/14 = 5/14
- Entropy(D) = - [p(+) * log₂(p(+)) + p(-) * log₂(p(-))]
Entropy(D) = - [(9/14) * log₂(9/14) + (5/14) * log₂(5/14)]
Entropy(D) = - [(0.642 * -0.637) + (0.357 * -1.486)]
Entropy(D) = - [(-0.409) + (-0.530)] = - [-0.939] ≈ **0.939**

Step 2: Calculate the Weighted Average Entropy for the Attribute 'Outlook'.
This is also known as the Entropy after the split.

- Entropy(Sunny): p(+) = 2/5, p(-) = 3/5


Entropy(Sunny) = - [(2/5)log₂(2/5) + (3/5)log₂(3/5)]
Entropy(Sunny) = - [(0.4 * -1.322) + (0.6 * -0.737)] = - [(-0.529) + (-0.442)] = - [-0.971] = 0.971

- Entropy(Overcast):** p(+) = 4/4 = 1, p(-) = 0/4 = 0


Entropy(Overcast) = - [1 * log₂(1) + 0 * log₂(0)] = - [1 * 0 + 0] = 0
(Note: 0 * log₂(0) is taken as 0.)

- Entropy(Rainy):p(+) = 3/5, p(-) = 2/5


Entropy(Rainy) = - [(3/5)log₂(3/5) + (2/5)log₂(2/5)] = 0.971 (same as Sunny)

-Weighted Average Entropy(Outlook):


= [ (5/14) * Entropy(Sunny) ] + [ (4/14) * Entropy(Overcast) ] + [ (5/14) * Entropy(Rainy) ]
= [ (5/14) * 0.971 ] + [ (4/14) * 0 ] + [ (5/14) * 0.971 ]
= [0.347] + [0] + [0.347] = **0.694**

Step 3: Calculate the Information Gain for 'Outlook'.


Information Gain(Outlook) = Entropy(D) - Weighted Average Entropy(Outlook)
Information Gain(Outlook) = 0.939 - 0.694 = 0.245
Final Answers:
- Entropy for the attribute Outlook (Weighted Average Entropy) = 0.694
- Information Gain from the attribute Outlook = 0.245

10. Describe k-NN algorithm with an example. (10 Marks) (2023)


Answer:
The k-Nearest Neighbors (k-NN) algorithm is a simple, instance-based supervised learning
algorithm used for both classification and regression.

Algorithm Description:
1. Input:
- Training dataset with features and labels.
- A new, unlabeled data point (query point).
- A positive integer 'k' (number of neighbors to consider).
- A distance metric (e.g., Euclidean distance).

[Link]:
Step 1: Compute Distance. Calculate the distance between the query point and every single
point in the training dataset.
Step 2: Find Nearest Neighbors. Identify the 'k' data points from the training set that have the
smallest distance to the query point.
Step 3: Aggregate Neighbors.
- For Classification: Take the majority vote (the most frequent class) among the k neighbors.
- For Regression: Take the average of the target values of the k neighbors.

Example 1: Weather and Activity


Task: Decide if a person will go "Swimming" or "Hiking" based on Temperature
(°C) and Humidity (%).
Training Data:
●​ (30°C, 80%) -> Swimming
●​ (28°C, 85%) -> Swimming
●​ (15°C, 40%) -> Hiking
●​ (12°C, 35%) -> Hiking
New Day: (26°C, 75%)
k=3 Calculation:
●​ Distances: ≈4.12, ≈3.61, ≈11.18, ≈14.14
●​ 3 Nearest Neighbors: (28,85), (30,80), (15,40)
●​ Votes: 2 Swimming, 1 Hiking
Prediction: Swimming

Example 2: Customer Behavior


Task: Classify a customer as "High Value" or "Low Value" based on Purchase
Frequency (per month) and Average Spend ($).
Training Data:
●​ (8 visits, $200) -> High
●​ (10 visits, $250) -> High
●​ (2 visits, $30) -> Low
●​ (1 visit, $20) -> Low
New Customer: (7 visits, $180)
k=3 Calculation:
●​ Distances: ≈22.36, ≈33.54, ≈150.60, ≈160.01
●​ 3 Nearest Neighbors: (8,200), (10,250), (2,30)
●​ Votes: 2 High, 1 Low
Prediction: High Value

Example 3: Medical Diagnosis


Task: Predict if a patient has "Flu" or "Cold" based on Fever Temperature (°F) and Symptom
Duration (days).
Training Data:
●​ (102°F, 5 days) -> Flu
●​ (101°F, 6 days) -> Flu
●​ (99°F, 2 days) -> Cold
●​ (98°F, 3 days) -> Cold
New Patient: (100°F, 4 days)
k=3 Calculation:
●​ Distances: ≈2.24, ≈2.83, ≈2.24, ≈3.61
●​ 3 Nearest Neighbors: (102,5), (101,6), (99,2)
●​ Votes: 2 Flu, 1 Cold
Prediction: Flu

Example 4: House Price Category


Task: Categorize houses as "Expensive" or "Affordable" based on Size (sq. ft.) and Number of
Bedrooms.
Training Data:
●​ (2000 [Link]., 4 BR) -> Expensive
●​ (1800 [Link]., 3 BR) -> Expensive
●​ (1000 [Link]., 2 BR) -> Affordable
●​ (1200 [Link]., 2 BR) -> Affordable
New House: (1700 [Link]., 3 BR)
k=3 Calculation:
●​ Distances: 300, 100, 806, 583
●​ 3 Nearest Neighbors: (1800,3), (2000,4), (1200,2)
●​ Votes: 2 Expensive, 1 Affordable
Prediction: Expensive

Example 5: Email Classification


Task: Classify emails as "Spam" or "Not Spam" based on Number of Links and Number of
"Free" word occurrences.
Training Data:
●​ (10 links, 5 "Free") -> Spam
●​ (8 links, 6 "Free") -> Spam
●​ (1 link, 0 "Free") -> Not Spam
●​ (2 links, 1 "Free") -> Not Spam
New Email: (7 links, 4 "Free")
k=3 Calculation:
●​ Distances: ≈3.16, ≈2.24, ≈8.49, ≈7.07
●​ 3 Nearest Neighbors: (8,6), (10,5), (2,1)
●​ Votes: 2 Spam, 1 Not Spam
Prediction: Spam

Common questions

Powered by AI

Parametric tests assume a specific distribution (usually normal) and require data that is continuous and has consistent variance. They are effective when these assumptions are met, examples include the Z-test, T-test, and ANOVA . Non-parametric tests do not assume any distribution for the data, making them applicable for small samples or non-normally distributed data, and are particularly useful for categorical or ranked data. Examples include the Chi-Square test and Mann-Whitney U test .

To minimize Type I Error, strategies include using conservative significance levels, applying the Bonferroni correction for multiple tests, pre-registering study hypotheses, and replicating findings. For reducing Type II Error, conducting power analysis, increasing sample size, using more sensitive measurement tools, and reducing variability through better design are effective strategies .

Weighted kNN assigns different weights to the neighbors, giving more influence to those closer to the query point, whereas standard kNN assigns equal weight to all k neighbors. Advantages of weighted kNN include higher accuracy by reflecting local data structures better, reduced effects of crowding, and improved robustness to noise by minimizing the impact of outlier neighbors. However, this comes at the cost of increased computational requirements .

In hypothesis testing, decreasing the significance level (α) increases the probability of making a Type II error (β), indicating an inverse relationship between the two. Therefore, reducing one often increases the other, and the power of a test (1−β) also increases as β decreases . Careful selection of α affects both error types and power without necessarily minimizing them simultaneously .

Noise and outliers can cause models to overfit by leading them to learn from irrelevant data points as if they were significant patterns, resulting in poor generalization. To mitigate this, data cleaning techniques, regularization methods, using simpler models, cross-validation, and ensemble methods such as Random Forests can be employed. These approaches help focus the model on key patterns rather than noise and outliers .

Ensemble learning in Random Forest improves upon a single decision tree by reducing overfitting and increasing robustness. Random Forest creates multiple decision trees and combines their outputs to form a more accurate prediction. It uses bagging to decrease variance, while random feature selection increases robustness. The aggregated result is less sensitive to individual errors, providing a more generalized and accurate model compared to a single decision tree .

Understanding p-values is crucial because they represent the probability of observing data as extreme as the observed, under the null hypothesis. Common misinterpretations to avoid include thinking that a p-value is the probability that the null hypothesis is true, or assuming it measures effect size or practical importance. Recognizing these misinterpretations helps in correctly assessing evidence strength and making valid conclusions in hypothesis testing .

The 5-step hypothesis testing procedure involves: 1) Defining null and alternative hypotheses, 2) Choosing a significance level (α), 3) Collecting data and calculating the test statistic, 4) Deciding whether to reject the null hypothesis using p-values or critical values, and 5) Stating a conclusion. This structured approach ensures consistent, unbiased decision-making by providing a clear framework to test hypotheses against statistical data, reducing error probabilities, and guiding interpretation of results .

The Two-Sample Mean Test is used to compare the means of two independent groups to determine if there is a significant difference between them. It is applied when comparing distinct groups (e.g., male vs. female) where the variable of interest is continuous. Groups must be independent of each other . The testing involves calculating a test statistic and comparing it against critical value thresholds or using p-values to draw conclusions about group mean differences .

The Central Limit Theorem states that the distribution of the sample mean approximates a normal distribution as the sample size increases, regardless of the population distribution. This allows statisticians to use normal distribution-based methods, which are typically more robust and powerful, in cases where the original data is not normally distributed, such as skewed data .

You might also like