UNIT-II
Descriptive Statistics:
Descriptive statistics summarize and organize features of a dataset using numbers, charts, and
graphs.
Examples:
• Mean: Average value
• Median: Middle value
• Mode: Most frequent value
• Range: Difference between max and min
• Standard deviation: Spread of the data
• Measure of Central Tendency
These values represent the center or typical value in a dataset.
Mean (Average):
Formula: Mean=Sum of all values/Number of values
• Example:
Dataset = [10, 20, 30, 40]
Mean = (10+20+30+40)/4 = 25
Median:
• The middle value when data is ordered.
• If even number of values: take average of two middle values.
• Example:
Dataset = [10, 20, 30, 40, 50]
Median = 30
Dataset = [10, 20, 30, 40]
Median = (20+30)/2 = 25
Mode:
• The most frequent value in the dataset.
• Example:
Dataset = [10, 20, 20, 30, 40]
Mode = 20
2. Measure of Dispersion (Spread)
These values show how spread out or scattered the data is.
Range:
• Formula:
Range=Maximum−Minimum
• Example:
Dataset = [10, 20, 30, 40]
Range = 40 - 10 = 30
Variance:
• Measures the average squared deviation from the mean.
Standard Deviation:
• The square root of variance.
• Tells us how much the values deviate from the mean.
• Example:
Dataset = [10, 20, 30, 40]
Mean = 25
Deviations = [-15, -5, 5, 15]
Squared = [225, 25, 25, 225]
Variance = 500 / 4 = 125
Std. Deviation = √125 ≈ 11.18
Real-Life Example: Sales Data
Let’s say you’re analyzing daily sales (in ₹) for a store over one week:
[800, 1000, 950, 1100, 1050, 990, 1200]
• Mean = (800 + 1000 + 950 + 1100 + 1050 + 990 + 1200)/7 = 1013.57
• Median = 1000 (after sorting)
• Mode = No mode (no repeated values)
• Range = 1200 - 800 = 400
• Standard Deviation = ≈ 121.4 (calculated using formula)
This tells the store owner:
• Typical sales are ~₹1000
• Sales vary by around ₹120
• ₹800 was the lowest, ₹1200 was the highest
Summary Table
Measure Meaning Use Case
Mean Average of values Average salary, marks, temperature
Median Middle value Income analysis (less affected by outliers)
Mode Most frequent value Product popularity, survey choices
Range Spread from min to max Knowing variation in data
Std Deviation Spread around the mean Consistency in test scores, prices, etc.
Data Preparation:
Data Preparation is the essential step in the data science workflow that comes before analysis
or modeling. Raw data is often incomplete, inconsistent, or messy, and cannot be used directly
for insights or machine learning.
Goal:
Transform raw data into a clean, structured, and machine-readable format.
Steps in Data Preparation (with Examples)
1. Handling Missing Values
Real-world datasets often have missing entries, like blank cells or NaN.
Techniques:
• Remove rows/columns with too many missing values
• Impute missing values with:
• Mean/Median (for numerical values)
• Mode (for categorical values)
• Forward-fill or backward-fill (time series)
Example:
Customer_Age = [25, 28, None, 30, 27]
→ Impute missing age with mean = (25+28+30+27)/4 = 27.5
→ Updated: [25, 28, 27.5, 30, 27]
2. Removing Duplicates
Duplicate entries can skew analysis or lead to data leakage in machine learning models.
Example:
Name Email
John john@[Link]
John john@[Link]
Remove duplicate row → Keep only one.
3. Converting Data Types
Ensure each column has the correct type: integers, floats, strings, dates, etc.
Example:
• "2025-07-24" should be a Date, not a String
• "Age" column with '25', '30' as text → Convert to Integer
This is important for:
• Calculations (age differences, totals)
• Sorting (dates)
• Machine learning models (which need numerical input)
4. Normalization or Standardization
When features (columns) have different scales, we need to scale them so that no single feature
dominates.
Normalization:
Scales values between 0 and 1
x_normalized = (x - min) / (max - min)
Standardization:
Converts values to z-scores (mean = 0, std = 1)
z = (x - mean) / std
Example:
• Income ranges from ₹30,000 to ₹2,00,000
• Age ranges from 18 to 70
→ Income will dominate unless scaled
5. Encoding Categorical Variables
Convert text labels into numerical codes for ML models.
Example:
Gender column: ['Male', 'Female', 'Female', 'Male']
→ Label Encoding:
Male = 1, Female = 0
→ [1, 0, 0, 1]
Real-Time Use Case: Customer Data for ML
Let’s say you're building a churn prediction model for a telecom company. You receive the
following raw data:
Customer ID Age Gender Plan Monthly Charges Last Login
101 25 Male A 1200 2025-06-01
102 Female A 2025-06-02
101 25 Male A 1200 2025-06-01
103 30 Female B 950 not recorded
After Data Preparation:
• Missing Age → filled using median or mean
• Missing Monthly Charges → filled with average charges for Plan A
• "not recorded" → replaced with NaT (Not a Time) or a default date
• Duplicate customer ID 101 → removed
• "Gender" and "Plan" → encoded to numeric format
• "Monthly Charges" → normalized
Result: Clean, ready-to-use data for training your model to predict churn accurately.
Summary Table
Step Purpose Real-life Example
Handle Missing Values Fill or remove blanks Replace blank salary with average in
employee data
Avoid redundant
Remove Duplicates Delete repeated user registration entries
information
Enable correct
Convert Data Types Change age from text to integer
calculations
Prepare features for Scale income and expenses for credit
Normalize/Standardize
modeling scoring
Encode Categorical Data Convert text to numbers Change "Plan A" to 1, "Plan B" to 2, etc.
Exploratory Data Analysis (EDA):
Exploratory Data Analysis (EDA) is a crucial step in data science and analytics. It's a process of
visually and statistically summarizing the main characteristics of a dataset to uncover patterns,
find anomalies, and guide further analysis.
Why EDA is Important
EDA is foundational to any data-driven project for several reasons:
• Understanding the data: It gives you a first look at the dataset's structure, data types,
and value distribution. This is essential for a complete understanding of the information
you have.
• Pattern recognition: It helps you find hidden patterns and relationships between
different variables that might not be obvious in raw data.
• Outlier detection: You can easily spot errors or unusual data points (outliers) that could
skew your analysis or negatively affect a model's performance.
• Feature engineering and selection: The insights from EDA help you identify the most
important features for a model and guide you on how to transform them for better
performance.
• Informed modeling: By understanding your data's characteristics, you can make
informed decisions about which machine learning models or statistical tests are most
suitable.
Types of EDA
EDA is often categorized by the number of variables being analyzed at once.
1. Univariate Analysis
This focuses on analyzing a single variable to understand its characteristics.
• Goal: Describe the data and find patterns within a single feature.
• Techniques:
• Histograms: Show the distribution of a numerical variable.
• Box Plots: Visualize the spread and detect outliers.
• Bar Charts: Used for categorical data to show frequencies.
• Summary Statistics: Measures like mean, median, mode, and standard
deviation describe central tendency and spread.
2. Bivariate Analysis
This examines the relationship between two variables.
• Goal: Find connections, correlations, and dependencies between two variables.
• Techniques:
• Scatter Plots: Visualize the relationship between two continuous variables.
• Correlation Coefficient: A number that measures the strength and direction of a
relationship (e.g., Pearson's correlation for linear relationships).
• Cross-tabulation (Contingency Tables): Shows the frequency distribution of
two categorical variables.
• Line Graphs: Useful for showing the relationship between a continuous variable
and a time-based variable.
3. Multivariate Analysis
This explores the relationships among three or more variables.
• Goal: Understand how multiple variables interact with each other.
• Techniques:
• Pair Plots: A grid of scatter plots showing the relationships between multiple
variables at once.
• Principal Component Analysis (PCA): A dimensionality reduction technique
that simplifies complex datasets while retaining key information.
• Spatial Analysis: Uses maps to visualize the geographical distribution of data.
• Time Series Analysis: Focuses on patterns and trends in time-based data.
Key Steps for Performing EDA
• Understand the Problem and the Data: Before you start, you need to have a clear
understanding of the business or research question you are trying to solve. You should
also familiarize yourself with the dataset's variables, data types, and any potential
limitations.
• Import and Inspect the Data: Load the data into your analysis environment (e.g.,
Python with Pandas). Inspect its size (rows and columns), check for missing values, and
identify data types for each variable.
• Handle Missing Data: Decide how to manage missing values. You can either remove the
data points or impute (fill in) the values using a suitable method like the mean or median.
• Explore Data Characteristics: Calculate summary statistics (mean, median, standard
deviation, etc.) for numerical variables and create frequency tables for categorical
variables. This provides a clear overview of your data's properties.
• Visualize Data Relationships: Use plots like histograms, box plots, scatter plots, and
correlation matrices to visually explore the data. This is where you'll find most of the
patterns and insights.
• Handle Outliers: Identify and manage outliers, which are data points that are
significantly different from the rest. Outliers can be detected using methods like the
Interquartile Range (IQR) or Z-scores. You can then decide whether to remove, adjust,
or keep them, depending on the context.
• Perform Data Transformation: If necessary, transform your data to prepare it for
modeling. This could involve scaling numerical variables, encoding categorical
variables, or applying mathematical functions to fix skewness.
• Communicate Findings and Insights: The final step is to summarize and present your
discoveries in a clear and compelling way. Use visualizations to support your findings
and highlight key insights, limitations, and suggestions for the next steps.
Data Summarization:
Data summarization is the process of condensing large and complex datasets into smaller, more
meaningful pieces of information without losing the essence of the data. It’s like reading the
highlights of a long book instead of reading every page.
2. Purpose of Data Summarization
We summarize data to:
• Quickly understand the data’s main characteristics.
• Spot patterns and trends without reading every data point.
• Prepare data for deeper analysis like hypothesis testing or modeling.
• Communicate results to decision-makers in a clear way.
3. Types of Data Summarization
A. Numerical Summarization (Statistical Measures)
Used when dealing with quantitative (numeric) data.
1. Central Tendency → Shows where most values lie:
- Mean (average)
- Median (middle value)
- Mode (most frequent value)
2. Spread / Variability → Shows how spread out values are:
- Range (max – min)
- Variance
- Standard Deviation
3. Shape of Data:
- Skewness (asymmetry)
- Kurtosis (peakedness)
Example:
Dataset: [10, 20, 30, 40, 50]
- Mean = 30
- Median = 30
- Range = 50 – 10 = 40
- Std. Dev. ≈ 15.8
B. Categorical Summarization
Used when data is qualitative (categories, labels).
• Frequency Table: Counts of each category.
• Percentage/Proportion: Share of each category.
C. Graphical Summarization
Visual representation to help interpret quickly:
• Histogram → Distribution of numeric data.
• Bar Chart → Comparison of categories.
• Box Plot → Spread + outliers.
• Pie Chart → Proportions of categories.
• Heatmap → Relationship between multiple variables.
4. Example
Raw Dataset:
CustomerID Age Region Purchase
1 25 East 2000
2 45 West 3000
3 35 East 1500
4 28 North 4000
Summarized Data:
- Numerical Summary: Mean Age = 33.25, Mean Purchase = ₹2,625, Max Purchase = ₹4,000
- Categorical Summary: Region Counts → East: 2, West: 1, North: 1
- Graphical Summary: Bar chart showing purchase per region.
Data Distribution :
Data distribution refers to the way values in a dataset are spread or arranged across possible values. It
describes the frequency or probability of occurrence of each value (or range of values) and is fundamental
in understanding data characteristics.
Importance in Data Science
- Guides Statistical Analysis: Many statistical models assume specific data distributions (e.g., normal
distribution in parametric tests).
- Detects Anomalies: Outliers and unusual patterns can be identified by observing the spread.
- Data Preprocessing: Skewed or non-normal distributions may require transformations.
- Model Selection: Certain machine learning algorithms perform better with specific data distributions.
Types of Data Distributions
A. Based on Shape
1. Normal (Gaussian) Distribution
- Symmetrical bell-shaped curve.
- Mean = Median = Mode.
- Many natural phenomena follow this distribution (e.g., height, weight).
2. Uniform Distribution
- Equal probability for all values in the range.
- Example: Rolling a fair die.
3. Skewed Distribution
- Positively Skewed (Right Skew): Long tail on the right; mean > median.
- Negatively Skewed (Left Skew): Long tail on the left; mean < median.
4. Bimodal and Multimodal Distributions
- Two or more peaks in the data.
- Example: Test scores of two different student groups.
B. Based on Probability Type
1. Discrete Distributions (data takes specific, separate values)
- Binomial Distribution: Number of successes in a fixed number of trials.
- Poisson Distribution: Number of events occurring in a fixed time/space.
2. Continuous Distributions (data can take any value within a range)
- Normal Distribution: Common in natural and social phenomena.
- Exponential Distribution: Time until an event occurs.
Methods to Represent Data Distribution
- Histogram: Displays frequency counts for grouped intervals (bins).
- Boxplot: Shows median, quartiles, and outliers.
- Density Plot: Smooth curve showing probability density.
- Violin Plot: Combines boxplot with a mirrored density plot.
Example
Exam Scores Data:
- Mean = 72, Median = 74
- Slightly left-skewed (negative skew) → Most students scored high, but a few low scores reduced the
mean.
Summary Table
Distribution Type Shape Example Applications
Normal Symmetrical bell Human height Parametric tests,
curve regression
Uniform Flat, equal Dice rolls Random sampling
probability
Positive Skew Long tail right Income levels Wealth distribution
analysis
Negative Skew Long tail left Age at retirement Demographic studies
Bimodal Two peaks Test scores from two Population
batches segmentation
Poisson Skewed, discrete Number of Event counting
emails/day
Exponential Continuous, skewed Time to service Reliability analysis
completion
Measuring Asymmetry
In data science, asymmetry (or skewness) refers to the degree to which the distribution of data
deviates from perfect symmetry around its central value (mean or median). A symmetric
distribution has equal spread on both sides, while an asymmetric distribution shows more
concentration of values on one side.
Importance in Data Science
Measuring asymmetry is important for:
• Understanding Data Shape – Identifying whether data is symmetric or skewed.
• Selecting Appropriate Models – Many machine learning algorithms assume normally
distributed data.
• Feature Engineering – Skewed variables may require transformation before model training.
• Business Insights – Detecting skew can reveal unusual patterns, such as extreme spending or
unusual customer behavior.
Types of Asymmetry
a) Positive Skew (Right Skewed)
• Tail extends more towards the right side of the distribution.
• Mean > Median > Mode.
• Example: Distribution of income levels in a country.
b) Negative Skew (Left Skewed)
• Tail extends more towards the left side of the distribution.
• Mean < Median < Mode.
• Example: Age at retirement for a specific population.
Measures of Asymmetry
Moment Coefficient of Skewness
Formula:
Skewness = [ Σ(xi - x̄ )³ ] / [ n * s³ ]
Where:
• xi = individual data values
• x̄ = mean of data
• s = standard deviation
• n = number of observations
Pearson’s Coefficients of Skewness
1. First Coefficient: ( x̄ – Mode ) / s
2. Second Coefficient: 3( x̄ – Median ) / s
Bowley’s (Quartile) Coefficient of Skewness
Formula:
Skewness = ( Q3 + Q1 – 2Q2 ) / ( Q3 – Q1 )
Where:
• Q1 = First Quartile
• Q2 = Median
• Q3 = Third Quartile
Interpretation of Skewness Values
Skewness Value Interpretation
0 Perfectly symmetric
0 to 0.5 or -0.5 to 0 Approximately symmetric
0.5 to 1 Moderately positively skewed
-1 to -0.5 Moderately negatively skewed
>1 Highly positively skewed
< -1 Highly negatively skewed
Detecting Asymmetry
• Histogram – Shows shape and tail direction.
• Boxplot – Longer whisker indicates skew direction.
• Density Plot – Reveals deviation from symmetry.
Handling Skewness in Data Science
• Log Transformation – Commonly used for right-skewed data.
• Square Root Transformation – Useful for moderate skew.
• Box-Cox Transformation – General transformation method.
• Use of Robust Models – Decision Trees and Random Forests handle skew naturally.
Example
Dataset: Exam scores = {45, 50, 52, 53, 55, 60, 95}
• Mean = 58.57
• Median = 53
• Skewness (calculated) ≈ 1.40 → Positive Skew.
Sample Mean and Estimated Mean
Sample Mean
The sample mean is the arithmetic average of values from a sample, not the entire population.
Formula:
x̄ = Σ(xi) / n
Where:
• x̄ = sample mean
• xi = each value in the sample
• n = number of observations in the sample
Properties
• Serves as an unbiased estimator of the population mean (μ).
• Sensitive to extreme values (outliers).
• Simple to compute and widely used.
Example
Sample data: {10, 12, 15, 18, 20}
x̄ = (10 + 12 + 15 + 18 + 20) / 5 = 75 / 5 = 15
The sample mean is 15.
Estimated Mean
The estimated mean refers to the value obtained by using the sample mean to approximate the
unknown population mean (μ).
Since we cannot compute the exact population mean without having all data points, we estimate
it using the sample mean:
μ̂ ≈ x̄
Where:
• μ̂ = estimated population mean
• x̄ = sample mean
Relationship
• Sample mean is a statistic (calculated from data).
• Estimated mean is an estimation of a parameter (population mean) based on that statistic.
Example in Data Science Context
Suppose we want the average monthly spending of all customers in a city:
• Population size: 100,000 customers (unknown μ).
• We take a sample of 200 customers and find the sample mean = ₹ 5,200.
Estimated mean:
μ̂ ≈ 5,200
This is our best guess for the true population mean.
Importance in Data Science
• Used in descriptive statistics to summarize data.
• Provides a basis for inference — many statistical models assume the mean is known or
estimated.
• Critical in hypothesis testing and confidence interval calculation.
Difference Between Sample Mean and Estimated Mean
Aspect Sample Mean (x̄ ) Estimated Mean (μ̂)
Definition Arithmetic mean of sample Approximation of the
data population mean using
sample data
Data Basis Calculated directly from Based on the sample mean
the sample
Purpose Describes the sample Predicts the population
mean
Symbol x̄ μ̂
• Sample mean is computed from observed data.
• Estimated mean uses the sample mean to infer the population mean.
Variance and Standard Score
In data science and statistics, variance and standard score (also called z-score) are essential
concepts used to measure the spread of data and standardize data points for comparison. Variance
indicates how much the data points differ from the mean, while the standard score tells us how
far a specific value is from the mean in terms of standard deviations.
Variance
Variance is a measure of the dispersion of a set of values. It calculates the average of the squared
differences between each value and the mean.
A high variance indicates that the data points are spread out widely from the mean, while a low
variance means they are closer to the mean.
Formula for population variance: σ² = Σ (xᵢ - μ)² / N
Formula for sample variance: s² = Σ (xᵢ - x̄ )² / (n - 1)
Example:
Consider the dataset: 5, 7, 3
Mean = (5 + 7 + 3) / 3 = 5
Variance = [(5-5)² + (7-5)² + (3-5)²] / 3 = (0 + 4 + 4) / 3 = 2.67
Standard Score (Z-score)
The standard score (z-score) is a statistical measure that describes a value's position relative to
the mean of a group of values, measured in terms of standard deviations. It is especially useful
for comparing values from different datasets or distributions.
Formula: z = (x - μ) / σ
Example:
If a student scored 85 on a test where the mean score was 75 and the standard deviation was 5:
z = (85 - 75) / 5 = 2
Interpretation: The student scored 2 standard deviations above the mean.
Applications in Data Science
1. Variance is used in statistical modeling to understand variability and detect features with high
or low variability.
2. Z-scores are used in anomaly detection, standardizing data for machine learning models, and
in hypothesis testing.
Statistical Inference (Frequency Approach)
Definition:
Statistical inference is the process of drawing conclusions about a population based on
information from a sample.
The frequentist (frequency) approach defines probability as the long-run frequency of an
event occurring after repeated trials.
• Example: If you flip a fair coin many times, the probability of heads = 0.5 means that in
the long run, 50% of flips will show heads.
Key Idea:
• Population → The entire group (e.g., all students in a college).
• Sample → A small part of the population (e.g., 50 students chosen randomly).
• Inference → Using the sample to estimate or test something about the population (like the
average marks of all students).
Example:
Suppose you want to know the average height of college students.
• You cannot measure all 5,000 students.
• Instead, you take a sample of 100 students.
• Using their heights, you estimate the average for the whole college.
This process = statistical inference
Applications:
• Predicting election results by surveying a small group.
• Estimating the failure rate of machines in a factory.
• Medical trials – testing a drug on a small group before general use.
Variability of Estimates:
When we take different random samples from the same population, the estimates (like sample
mean, variance, or proportion) will not be exactly the same. This variation is called sampling
variability or variability of estimates.
Example:
Imagine you want to estimate the average mark of students in a class of 500.
• Sample 1 (50 students): Mean = 68
• Sample 2 (50 students): Mean = 72
• Sample 3 (50 students): Mean = 70
Each sample gives a slightly different mean → This is variability of estimates.
Because we need to know how reliable our sample estimate is.
• If variability is small → The estimate is stable and reliable.
• If variability is large → The estimate is uncertain.
Applications:
• In quality control, measuring consistency in production.
• In finance, understanding how different samples of stock data may affect predictions.
• In medicine, testing how reliable treatment outcomes are when repeated with different
groups.
Hypothesis Testing using Confidence Intervals:
A confidence interval (CI) gives a range of values within which the true population
parameter is likely to lie with a certain level of confidence (usually 95%).
Instead of just giving a single estimate (like a sample mean), CI provides a range.
Example:
Suppose a sample of 100 students has an average height = 160 cm, with a 95% confidence
interval of [158 cm, 162 cm].
• This means we are 95% confident that the true average height of all students lies
between 158 and 162 cm.
Hypothesis Testing with CI:
Hypothesis testing is a method used to check whether a claim about data is true or not using
sample data.
Think of it like checking if a statement is correct.
Example
A mobile company says:
“Our phone battery lasts 10 hours.”
You test 25 phones and find the average battery life is 9.2 hours.
Now the question is:
* Is the company telling the truth?
or
* Is the battery life actually different?
To answer this, we use Hypothesis Testing.
Two Types of Hypotheses
When we test a claim, we create two statements.
[Link] Hypothesis (H₀)
This means the claim is true.
Example
H₀: Average battery life = 10 hours
2. Alternative Hypothesis (H₁)
This means the claim is not true.
Example
H₁: Average battery life ≠ 10 hours
Significance Level (α)
Before testing, we decide how strict we want to be.
Usually we choose:
α=0.05
This means 5% chance of making a wrong decision is allowed.
Two Ways to Test the Hypothesis
There are two common methods:
1. Confidence Interval Method
2. p-value Method
Both methods give the same final answer.
Confidence Interval Method
A confidence interval is simply a range of possible values where the true answer may lie.
Think of it like:
“The real value is probably somewhere between these two numbers.”
Example
Suppose after testing phones you get:
Average battery life = 9.2 hours
95% Confidence Interval:
(8.8, 9.6)
This means:
The real battery life is likely between 8.8 and 9.6 hours.
Now check the company claim
Company claim = 10 hours
Confidence Interval = 8.8 to 9.6
Question:
Does 10 lie inside the interval?
8.8 — 9.6
10 is outside the range.
Decision
Since 10 is outside the interval
We reject the company claim
Conclusion:
The battery life is not 10 hours.
p-value Method
The p-value tells us:
“How likely is this result if the company claim is true?”
Interpretation
p-value Meaning
p-value < 0.05 Strong evidence against H₀
p-value ≥ 0.05 Not enough evidence
Example
Suppose after calculation we get:
p-value = 0.02
Significance level
α=0.05
Compare:
0.02 < 0.05
Decision
Reject H₀
Conclusion:
The battery life is different from 10 hours.
Simple Real-Life Example
Teacher says:
“Average marks of students = 70”
You test some students and find average = 65.
Now you check using hypothesis testing.
Confidence Interval Example
Confidence interval = (63, 67)
Teacher claim = 70
70 is outside the interval
Teacher claim is wrong.
p-value Example
p-value = 0.01
0.01 < 0.05
👉 Reject teacher claim.
Easy Way to Remember
Confidence Interval
Check range
If claim outside range → reject
p-value
Compare p-value with 0.05
If
p-value < 0.05 → reject
Errors in Hypothesis Testing
When we perform hypothesis testing, sometimes the decision we make can be wrong.
These mistakes are called Errors in Hypothesis Testing.
There are two types of errors:
• Type I Error
• Type II Error
1. Type I Error (False Positive)
Definition
Type I Error occurs when we reject the Null Hypothesis (H₀) even though it is actually true.
Simple Meaning
We say the claim is wrong, but in reality the claim is correct.
Example (Battery Life Example)
Company claim:
Average battery life = 10 hours
Actual truth:
Battery life really is 10 hours
But our test result says:
Reject H₀
This means we say the company claim is false, even though it is true.
This is called Type I Error.
Symbol
Type I Error is represented by:
α (alpha)
Usually
α = 0.05
This means 5% chance of making a Type I Error.
2. Type II Error (False Negative)
Definition
Type II Error occurs when we fail to reject the Null Hypothesis even though it is false.
Simple Meaning
We say the claim is correct, but actually it is wrong.
Example (Battery Life Example)
Company claim:
Average battery life = 10 hours
Actual truth:
Battery life is 8 hours
But our test result says:
Do not reject H₀
This means we say the company claim is correct, even though it is wrong.
This is called Type II Error.
Symbol
Type II Error is represented by:
β (beta)
Simple Comparison
Error Type Meaning Example
Type I Error Rejecting true H₀ Saying battery life ≠ 10 when it is actually 10
Type II Error Accepting false H₀ Saying battery life = 10 when it is actually 8
Easy Real-Life Example
Imagine a fire alarm system.
Type I Error
Alarm rings but there is no fire.
(False alarm)
Type II Error
There is fire, but alarm does not ring.
(Missed detection)
Quick Memory Trick
Type I Error → False Alarm
Type II Error → Missed Detection