0% found this document useful (0 votes)

4 views23 pages

Unit 2 DataScience

The document covers descriptive statistics, data preparation, exploratory data analysis (EDA), data summarization, and data distribution. It explains key statistical measures such as mean, median, mode, range, and standard deviation, along with the importance of cleaning and preparing data for analysis. Additionally, it discusses various types of data distributions and their significance in statistical modeling and anomaly detection.

Uploaded by

nandinipechetti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views23 pages

Unit 2 DataScience

Uploaded by

nandinipechetti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

UNIT-II

Descriptive Statistics:
Descriptive statistics summarize and organize features of a dataset using numbers, charts, and
graphs.

Examples:
• Mean: Average value
• Median: Middle value
• Mode: Most frequent value
• Range: Difference between max and min
• Standard deviation: Spread of the data
• Measure of Central Tendency
These values represent the center or typical value in a dataset.

Mean (Average):
Formula: Mean=Sum of all values/Number of values
• Example:
Dataset = [10, 20, 30, 40]
Mean = (10+20+30+40)/4 = 25
Median:
• The middle value when data is ordered.
• If even number of values: take average of two middle values.
• Example:
Dataset = [10, 20, 30, 40, 50]
Median = 30
Dataset = [10, 20, 30, 40]
Median = (20+30)/2 = 25
Mode:
• The most frequent value in the dataset.
• Example:
Dataset = [10, 20, 20, 30, 40]
Mode = 20
2. Measure of Dispersion (Spread)
These values show how spread out or scattered the data is.
Range:
• Formula:
Range=Maximum−Minimum
• Example:
Dataset = [10, 20, 30, 40]
Range = 40 - 10 = 30
Variance:
• Measures the average squared deviation from the mean.
Standard Deviation:
• The square root of variance.
• Tells us how much the values deviate from the mean.
• Example:
Dataset = [10, 20, 30, 40]
Mean = 25
Deviations = [-15, -5, 5, 15]
Squared = [225, 25, 25, 225]
Variance = 500 / 4 = 125
Std. Deviation = √125 ≈ 11.18
Real-Life Example: Sales Data
Let’s say you’re analyzing daily sales (in ₹) for a store over one week:
[800, 1000, 950, 1100, 1050, 990, 1200]
• Mean = (800 + 1000 + 950 + 1100 + 1050 + 990 + 1200)/7 = 1013.57
• Median = 1000 (after sorting)
• Mode = No mode (no repeated values)
• Range = 1200 - 800 = 400
• Standard Deviation = ≈ 121.4 (calculated using formula)
This tells the store owner:
• Typical sales are ~₹1000
• Sales vary by around ₹120
• ₹800 was the lowest, ₹1200 was the highest
Summary Table

Measure Meaning Use Case

Mean Average of values Average salary, marks, temperature

Median Middle value Income analysis (less affected by outliers)

Mode Most frequent value Product popularity, survey choices

Range Spread from min to max Knowing variation in data

Std Deviation Spread around the mean Consistency in test scores, prices, etc.

Data Preparation:
Data Preparation is the essential step in the data science workflow that comes before analysis
or modeling. Raw data is often incomplete, inconsistent, or messy, and cannot be used directly
for insights or machine learning.

Goal:
Transform raw data into a clean, structured, and machine-readable format.
Steps in Data Preparation (with Examples)
1. Handling Missing Values
Real-world datasets often have missing entries, like blank cells or NaN.
Techniques:
• Remove rows/columns with too many missing values
• Impute missing values with:
• Mean/Median (for numerical values)
• Mode (for categorical values)
• Forward-fill or backward-fill (time series)
Example:
Customer_Age = [25, 28, None, 30, 27]
→ Impute missing age with mean = (25+28+30+27)/4 = 27.5
→ Updated: [25, 28, 27.5, 30, 27]
2. Removing Duplicates
Duplicate entries can skew analysis or lead to data leakage in machine learning models.
Example:

Name Email

John john@[Link]

Remove duplicate row → Keep only one.

3. Converting Data Types
Ensure each column has the correct type: integers, floats, strings, dates, etc.
Example:
• "2025-07-24" should be a Date, not a String
• "Age" column with '25', '30' as text → Convert to Integer
This is important for:
• Calculations (age differences, totals)
• Sorting (dates)
• Machine learning models (which need numerical input)
4. Normalization or Standardization
When features (columns) have different scales, we need to scale them so that no single feature
dominates.
Normalization:
Scales values between 0 and 1
x_normalized = (x - min) / (max - min)
Standardization:
Converts values to z-scores (mean = 0, std = 1)
z = (x - mean) / std
Example:
• Income ranges from ₹30,000 to ₹2,00,000
• Age ranges from 18 to 70
→ Income will dominate unless scaled
5. Encoding Categorical Variables
Convert text labels into numerical codes for ML models.
Example:
Gender column: ['Male', 'Female', 'Female', 'Male']
→ Label Encoding:
Male = 1, Female = 0
→ [1, 0, 0, 1]
Real-Time Use Case: Customer Data for ML
Let’s say you're building a churn prediction model for a telecom company. You receive the
following raw data:

Customer ID Age Gender Plan Monthly Charges Last Login

101 25 Male A 1200 2025-06-01

102 Female A 2025-06-02

101 25 Male A 1200 2025-06-01

103 30 Female B 950 not recorded

After Data Preparation:

• Missing Age → filled using median or mean
• Missing Monthly Charges → filled with average charges for Plan A
• "not recorded" → replaced with NaT (Not a Time) or a default date
• Duplicate customer ID 101 → removed
• "Gender" and "Plan" → encoded to numeric format
• "Monthly Charges" → normalized
Result: Clean, ready-to-use data for training your model to predict churn accurately.
Summary Table

Step Purpose Real-life Example

Handle Missing Values Fill or remove blanks Replace blank salary with average in
employee data

Avoid redundant
Remove Duplicates Delete repeated user registration entries
information

Enable correct
Convert Data Types Change age from text to integer
calculations

Prepare features for Scale income and expenses for credit

Normalize/Standardize
modeling scoring

Encode Categorical Data Convert text to numbers Change "Plan A" to 1, "Plan B" to 2, etc.

Exploratory Data Analysis (EDA):

Exploratory Data Analysis (EDA) is a crucial step in data science and analytics. It's a process of
visually and statistically summarizing the main characteristics of a dataset to uncover patterns,
find anomalies, and guide further analysis.
Why EDA is Important
EDA is foundational to any data-driven project for several reasons:
• Understanding the data: It gives you a first look at the dataset's structure, data types,
and value distribution. This is essential for a complete understanding of the information
you have.
• Pattern recognition: It helps you find hidden patterns and relationships between
different variables that might not be obvious in raw data.
• Outlier detection: You can easily spot errors or unusual data points (outliers) that could
skew your analysis or negatively affect a model's performance.
• Feature engineering and selection: The insights from EDA help you identify the most
important features for a model and guide you on how to transform them for better
performance.
• Informed modeling: By understanding your data's characteristics, you can make
informed decisions about which machine learning models or statistical tests are most
suitable.
Types of EDA
EDA is often categorized by the number of variables being analyzed at once.
1. Univariate Analysis
This focuses on analyzing a single variable to understand its characteristics.
• Goal: Describe the data and find patterns within a single feature.
• Techniques:
• Histograms: Show the distribution of a numerical variable.
• Box Plots: Visualize the spread and detect outliers.
• Bar Charts: Used for categorical data to show frequencies.
• Summary Statistics: Measures like mean, median, mode, and standard
deviation describe central tendency and spread.

2. Bivariate Analysis
This examines the relationship between two variables.
• Goal: Find connections, correlations, and dependencies between two variables.
• Techniques:
• Scatter Plots: Visualize the relationship between two continuous variables.
• Correlation Coefficient: A number that measures the strength and direction of a
relationship (e.g., Pearson's correlation for linear relationships).
• Cross-tabulation (Contingency Tables): Shows the frequency distribution of
two categorical variables.
• Line Graphs: Useful for showing the relationship between a continuous variable
and a time-based variable.

3. Multivariate Analysis
This explores the relationships among three or more variables.
• Goal: Understand how multiple variables interact with each other.
• Techniques:
• Pair Plots: A grid of scatter plots showing the relationships between multiple
variables at once.
• Principal Component Analysis (PCA): A dimensionality reduction technique
that simplifies complex datasets while retaining key information.
• Spatial Analysis: Uses maps to visualize the geographical distribution of data.
• Time Series Analysis: Focuses on patterns and trends in time-based data.
Key Steps for Performing EDA
• Understand the Problem and the Data: Before you start, you need to have a clear
understanding of the business or research question you are trying to solve. You should
also familiarize yourself with the dataset's variables, data types, and any potential
limitations.
• Import and Inspect the Data: Load the data into your analysis environment (e.g.,
Python with Pandas). Inspect its size (rows and columns), check for missing values, and
identify data types for each variable.
• Handle Missing Data: Decide how to manage missing values. You can either remove the
data points or impute (fill in) the values using a suitable method like the mean or median.
• Explore Data Characteristics: Calculate summary statistics (mean, median, standard
deviation, etc.) for numerical variables and create frequency tables for categorical
variables. This provides a clear overview of your data's properties.
• Visualize Data Relationships: Use plots like histograms, box plots, scatter plots, and
correlation matrices to visually explore the data. This is where you'll find most of the
patterns and insights.
• Handle Outliers: Identify and manage outliers, which are data points that are
significantly different from the rest. Outliers can be detected using methods like the
Interquartile Range (IQR) or Z-scores. You can then decide whether to remove, adjust,
or keep them, depending on the context.
• Perform Data Transformation: If necessary, transform your data to prepare it for
modeling. This could involve scaling numerical variables, encoding categorical
variables, or applying mathematical functions to fix skewness.
• Communicate Findings and Insights: The final step is to summarize and present your
discoveries in a clear and compelling way. Use visualizations to support your findings
and highlight key insights, limitations, and suggestions for the next steps.
Data Summarization:
Data summarization is the process of condensing large and complex datasets into smaller, more
meaningful pieces of information without losing the essence of the data. It’s like reading the
highlights of a long book instead of reading every page.
2. Purpose of Data Summarization
We summarize data to:
• Quickly understand the data’s main characteristics.
• Spot patterns and trends without reading every data point.
• Prepare data for deeper analysis like hypothesis testing or modeling.
• Communicate results to decision-makers in a clear way.
3. Types of Data Summarization
A. Numerical Summarization (Statistical Measures)
Used when dealing with quantitative (numeric) data.
1. Central Tendency → Shows where most values lie:
- Mean (average)
- Median (middle value)
- Mode (most frequent value)
2. Spread / Variability → Shows how spread out values are:
- Range (max – min)
- Variance
- Standard Deviation
3. Shape of Data:
- Skewness (asymmetry)
- Kurtosis (peakedness)
Example:
Dataset: [10, 20, 30, 40, 50]
- Mean = 30
- Median = 30
- Range = 50 – 10 = 40
- Std. Dev. ≈ 15.8
B. Categorical Summarization
Used when data is qualitative (categories, labels).
• Frequency Table: Counts of each category.
• Percentage/Proportion: Share of each category.
C. Graphical Summarization
Visual representation to help interpret quickly:
• Histogram → Distribution of numeric data.
• Bar Chart → Comparison of categories.
• Box Plot → Spread + outliers.
• Pie Chart → Proportions of categories.
• Heatmap → Relationship between multiple variables.
4. Example
Raw Dataset:

CustomerID Age Region Purchase

1 25 East 2000

2 45 West 3000

3 35 East 1500

4 28 North 4000

Summarized Data:
- Numerical Summary: Mean Age = 33.25, Mean Purchase = ₹2,625, Max Purchase = ₹4,000
- Categorical Summary: Region Counts → East: 2, West: 1, North: 1
- Graphical Summary: Bar chart showing purchase per region.

Data Distribution :
Data distribution refers to the way values in a dataset are spread or arranged across possible values. It
describes the frequency or probability of occurrence of each value (or range of values) and is fundamental
in understanding data characteristics.

Importance in Data Science

- Guides Statistical Analysis: Many statistical models assume specific data distributions (e.g., normal
distribution in parametric tests).
- Detects Anomalies: Outliers and unusual patterns can be identified by observing the spread.
- Data Preprocessing: Skewed or non-normal distributions may require transformations.
- Model Selection: Certain machine learning algorithms perform better with specific data distributions.
Types of Data Distributions
A. Based on Shape
1. Normal (Gaussian) Distribution
- Symmetrical bell-shaped curve.
- Mean = Median = Mode.
- Many natural phenomena follow this distribution (e.g., height, weight).

2. Uniform Distribution
- Equal probability for all values in the range.
- Example: Rolling a fair die.

3. Skewed Distribution
- Positively Skewed (Right Skew): Long tail on the right; mean > median.
- Negatively Skewed (Left Skew): Long tail on the left; mean < median.

4. Bimodal and Multimodal Distributions

- Two or more peaks in the data.
- Example: Test scores of two different student groups.

B. Based on Probability Type

1. Discrete Distributions (data takes specific, separate values)
- Binomial Distribution: Number of successes in a fixed number of trials.
- Poisson Distribution: Number of events occurring in a fixed time/space.

2. Continuous Distributions (data can take any value within a range)

- Normal Distribution: Common in natural and social phenomena.
- Exponential Distribution: Time until an event occurs.

Methods to Represent Data Distribution

- Histogram: Displays frequency counts for grouped intervals (bins).
- Boxplot: Shows median, quartiles, and outliers.
- Density Plot: Smooth curve showing probability density.
- Violin Plot: Combines boxplot with a mirrored density plot.

Example
Exam Scores Data:
- Mean = 72, Median = 74
- Slightly left-skewed (negative skew) → Most students scored high, but a few low scores reduced the
mean.
Summary Table
Distribution Type Shape Example Applications
Normal Symmetrical bell Human height Parametric tests,
curve regression
Uniform Flat, equal Dice rolls Random sampling
probability
Positive Skew Long tail right Income levels Wealth distribution
analysis
Negative Skew Long tail left Age at retirement Demographic studies
Bimodal Two peaks Test scores from two Population
batches segmentation
Poisson Skewed, discrete Number of Event counting
emails/day
Exponential Continuous, skewed Time to service Reliability analysis
completion

Measuring Asymmetry
In data science, asymmetry (or skewness) refers to the degree to which the distribution of data
deviates from perfect symmetry around its central value (mean or median). A symmetric
distribution has equal spread on both sides, while an asymmetric distribution shows more
concentration of values on one side.
Importance in Data Science
Measuring asymmetry is important for:
• Understanding Data Shape – Identifying whether data is symmetric or skewed.
• Selecting Appropriate Models – Many machine learning algorithms assume normally
distributed data.
• Feature Engineering – Skewed variables may require transformation before model training.
• Business Insights – Detecting skew can reveal unusual patterns, such as extreme spending or
unusual customer behavior.
Types of Asymmetry
a) Positive Skew (Right Skewed)
• Tail extends more towards the right side of the distribution.
• Mean > Median > Mode.
• Example: Distribution of income levels in a country.

b) Negative Skew (Left Skewed)

• Tail extends more towards the left side of the distribution.
• Mean < Median < Mode.
• Example: Age at retirement for a specific population.
Measures of Asymmetry
Moment Coefficient of Skewness
Formula:
Skewness = [ Σ(xi - x̄ )³ ] / [ n * s³ ]
Where:
• xi = individual data values
• x̄ = mean of data
• s = standard deviation
• n = number of observations
Pearson’s Coefficients of Skewness
1. First Coefficient: ( x̄ – Mode ) / s
2. Second Coefficient: 3( x̄ – Median ) / s
Bowley’s (Quartile) Coefficient of Skewness
Formula:
Skewness = ( Q3 + Q1 – 2Q2 ) / ( Q3 – Q1 )
Where:
• Q1 = First Quartile
• Q2 = Median
• Q3 = Third Quartile
Interpretation of Skewness Values

Skewness Value Interpretation

0 Perfectly symmetric

0 to 0.5 or -0.5 to 0 Approximately symmetric

0.5 to 1 Moderately positively skewed

-1 to -0.5 Moderately negatively skewed

>1 Highly positively skewed

< -1 Highly negatively skewed

Detecting Asymmetry
• Histogram – Shows shape and tail direction.
• Boxplot – Longer whisker indicates skew direction.
• Density Plot – Reveals deviation from symmetry.
Handling Skewness in Data Science
• Log Transformation – Commonly used for right-skewed data.
• Square Root Transformation – Useful for moderate skew.
• Box-Cox Transformation – General transformation method.
• Use of Robust Models – Decision Trees and Random Forests handle skew naturally.
Example
Dataset: Exam scores = {45, 50, 52, 53, 55, 60, 95}
• Mean = 58.57
• Median = 53
• Skewness (calculated) ≈ 1.40 → Positive Skew.
Sample Mean and Estimated Mean
Sample Mean
The sample mean is the arithmetic average of values from a sample, not the entire population.

Formula:
x̄ = Σ(xi) / n
Where:
• x̄ = sample mean
• xi = each value in the sample
• n = number of observations in the sample
Properties
• Serves as an unbiased estimator of the population mean (μ).
• Sensitive to extreme values (outliers).
• Simple to compute and widely used.
Example
Sample data: {10, 12, 15, 18, 20}
x̄ = (10 + 12 + 15 + 18 + 20) / 5 = 75 / 5 = 15
The sample mean is 15.
Estimated Mean
The estimated mean refers to the value obtained by using the sample mean to approximate the
unknown population mean (μ).

Since we cannot compute the exact population mean without having all data points, we estimate
it using the sample mean:
μ̂ ≈ x̄
Where:
• μ̂ = estimated population mean
• x̄ = sample mean
Relationship
• Sample mean is a statistic (calculated from data).
• Estimated mean is an estimation of a parameter (population mean) based on that statistic.
Example in Data Science Context
Suppose we want the average monthly spending of all customers in a city:
• Population size: 100,000 customers (unknown μ).
• We take a sample of 200 customers and find the sample mean = ₹ 5,200.

Estimated mean:
μ̂ ≈ 5,200
This is our best guess for the true population mean.
Importance in Data Science
• Used in descriptive statistics to summarize data.
• Provides a basis for inference — many statistical models assume the mean is known or
estimated.
• Critical in hypothesis testing and confidence interval calculation.
Difference Between Sample Mean and Estimated Mean

Aspect Sample Mean (x̄ ) Estimated Mean (μ̂)

Definition Arithmetic mean of sample Approximation of the

data population mean using
sample data

Data Basis Calculated directly from Based on the sample mean

the sample

Purpose Describes the sample Predicts the population

mean

Symbol x̄ μ̂

• Sample mean is computed from observed data.

• Estimated mean uses the sample mean to infer the population mean.
Variance and Standard Score
In data science and statistics, variance and standard score (also called z-score) are essential
concepts used to measure the spread of data and standardize data points for comparison. Variance
indicates how much the data points differ from the mean, while the standard score tells us how
far a specific value is from the mean in terms of standard deviations.
Variance
Variance is a measure of the dispersion of a set of values. It calculates the average of the squared
differences between each value and the mean.
A high variance indicates that the data points are spread out widely from the mean, while a low
variance means they are closer to the mean.
Formula for population variance: σ² = Σ (xᵢ - μ)² / N
Formula for sample variance: s² = Σ (xᵢ - x̄ )² / (n - 1)
Example:
Consider the dataset: 5, 7, 3
Mean = (5 + 7 + 3) / 3 = 5
Variance = [(5-5)² + (7-5)² + (3-5)²] / 3 = (0 + 4 + 4) / 3 = 2.67
Standard Score (Z-score)
The standard score (z-score) is a statistical measure that describes a value's position relative to
the mean of a group of values, measured in terms of standard deviations. It is especially useful
for comparing values from different datasets or distributions.
Formula: z = (x - μ) / σ
Example:
If a student scored 85 on a test where the mean score was 75 and the standard deviation was 5:
z = (85 - 75) / 5 = 2
Interpretation: The student scored 2 standard deviations above the mean.
Applications in Data Science
1. Variance is used in statistical modeling to understand variability and detect features with high
or low variability.
2. Z-scores are used in anomaly detection, standardizing data for machine learning models, and
in hypothesis testing.
Statistical Inference (Frequency Approach)
Definition:
Statistical inference is the process of drawing conclusions about a population based on
information from a sample.
The frequentist (frequency) approach defines probability as the long-run frequency of an
event occurring after repeated trials.

• Example: If you flip a fair coin many times, the probability of heads = 0.5 means that in
the long run, 50% of flips will show heads.
Key Idea:
• Population → The entire group (e.g., all students in a college).
• Sample → A small part of the population (e.g., 50 students chosen randomly).
• Inference → Using the sample to estimate or test something about the population (like the
average marks of all students).
Example:
Suppose you want to know the average height of college students.
• You cannot measure all 5,000 students.
• Instead, you take a sample of 100 students.
• Using their heights, you estimate the average for the whole college.
This process = statistical inference
Applications:
• Predicting election results by surveying a small group.
• Estimating the failure rate of machines in a factory.
• Medical trials – testing a drug on a small group before general use.

Variability of Estimates:

When we take different random samples from the same population, the estimates (like sample
mean, variance, or proportion) will not be exactly the same. This variation is called sampling
variability or variability of estimates.
Example:
Imagine you want to estimate the average mark of students in a class of 500.
• Sample 1 (50 students): Mean = 68
• Sample 2 (50 students): Mean = 72
• Sample 3 (50 students): Mean = 70
Each sample gives a slightly different mean → This is variability of estimates.
Because we need to know how reliable our sample estimate is.
• If variability is small → The estimate is stable and reliable.
• If variability is large → The estimate is uncertain.
Applications:
• In quality control, measuring consistency in production.
• In finance, understanding how different samples of stock data may affect predictions.
• In medicine, testing how reliable treatment outcomes are when repeated with different
groups.
Hypothesis Testing using Confidence Intervals:
A confidence interval (CI) gives a range of values within which the true population
parameter is likely to lie with a certain level of confidence (usually 95%).
Instead of just giving a single estimate (like a sample mean), CI provides a range.
Example:
Suppose a sample of 100 students has an average height = 160 cm, with a 95% confidence
interval of [158 cm, 162 cm].
• This means we are 95% confident that the true average height of all students lies
between 158 and 162 cm.
Hypothesis Testing with CI:

Hypothesis testing is a method used to check whether a claim about data is true or not using
sample data.
Think of it like checking if a statement is correct.
Example
A mobile company says:
“Our phone battery lasts 10 hours.”
You test 25 phones and find the average battery life is 9.2 hours.
Now the question is:
* Is the company telling the truth?
or
* Is the battery life actually different?
To answer this, we use Hypothesis Testing.
Two Types of Hypotheses
When we test a claim, we create two statements.
[Link] Hypothesis (H₀)
This means the claim is true.
Example
H₀: Average battery life = 10 hours
2. Alternative Hypothesis (H₁)
This means the claim is not true.
Example
H₁: Average battery life ≠ 10 hours
Significance Level (α)
Before testing, we decide how strict we want to be.
Usually we choose:
α=0.05
This means 5% chance of making a wrong decision is allowed.
Two Ways to Test the Hypothesis
There are two common methods:
1. Confidence Interval Method
2. p-value Method
Both methods give the same final answer.
Confidence Interval Method
A confidence interval is simply a range of possible values where the true answer may lie.
Think of it like:
“The real value is probably somewhere between these two numbers.”
Example
Suppose after testing phones you get:
Average battery life = 9.2 hours
95% Confidence Interval:
(8.8, 9.6)
This means:
The real battery life is likely between 8.8 and 9.6 hours.
Now check the company claim
Company claim = 10 hours
Confidence Interval = 8.8 to 9.6
Question:
Does 10 lie inside the interval?
8.8 — 9.6
10 is outside the range.
Decision
Since 10 is outside the interval
We reject the company claim
Conclusion:
The battery life is not 10 hours.
p-value Method
The p-value tells us:
“How likely is this result if the company claim is true?”
Interpretation
p-value Meaning
p-value < 0.05 Strong evidence against H₀
p-value ≥ 0.05 Not enough evidence
Example
Suppose after calculation we get:
p-value = 0.02
Significance level
α=0.05
Compare:
0.02 < 0.05
Decision
Reject H₀

Conclusion:
The battery life is different from 10 hours.
Simple Real-Life Example
Teacher says:
“Average marks of students = 70”
You test some students and find average = 65.
Now you check using hypothesis testing.
Confidence Interval Example
Confidence interval = (63, 67)
Teacher claim = 70
70 is outside the interval
Teacher claim is wrong.
p-value Example
p-value = 0.01
0.01 < 0.05
👉 Reject teacher claim.
Easy Way to Remember
Confidence Interval
Check range
If claim outside range → reject
p-value
Compare p-value with 0.05
If
p-value < 0.05 → reject

Errors in Hypothesis Testing

When we perform hypothesis testing, sometimes the decision we make can be wrong.

These mistakes are called Errors in Hypothesis Testing.

There are two types of errors:

• Type I Error

• Type II Error

1. Type I Error (False Positive)

Definition

Type I Error occurs when we reject the Null Hypothesis (H₀) even though it is actually true.

Simple Meaning

We say the claim is wrong, but in reality the claim is correct.

Example (Battery Life Example)

Company claim:

Average battery life = 10 hours

Actual truth:

Battery life really is 10 hours

But our test result says:

Reject H₀

This means we say the company claim is false, even though it is true.

This is called Type I Error.

Symbol

Type I Error is represented by:

α (alpha)

Usually

α = 0.05

This means 5% chance of making a Type I Error.

2. Type II Error (False Negative)

Definition

Type II Error occurs when we fail to reject the Null Hypothesis even though it is false.

Simple Meaning

We say the claim is correct, but actually it is wrong.

Example (Battery Life Example)

Company claim:

Average battery life = 10 hours

Actual truth:

Battery life is 8 hours

But our test result says:

Do not reject H₀

This means we say the company claim is correct, even though it is wrong.

This is called Type II Error.

Symbol

Type II Error is represented by:

β (beta)

Simple Comparison

Error Type Meaning Example

Type I Error Rejecting true H₀ Saying battery life ≠ 10 when it is actually 10

Type II Error Accepting false H₀ Saying battery life = 10 when it is actually 8

Easy Real-Life Example

Imagine a fire alarm system.

Type I Error

Alarm rings but there is no fire.

(False alarm)

Type II Error

There is fire, but alarm does not ring.

(Missed detection)

Quick Memory Trick

Type I Error → False Alarm

Type II Error → Missed Detection

Unit 2 DataScience
No ratings yet
Unit 2 DataScience
22 pages
Unit4notespdf 2026 01 09 09 10 51
No ratings yet
Unit4notespdf 2026 01 09 09 10 51
12 pages
EDA Lecture Notes for Management Research
No ratings yet
EDA Lecture Notes for Management Research
39 pages
DataScience Exam Notes
No ratings yet
DataScience Exam Notes
12 pages
Understanding Statistics for ML
No ratings yet
Understanding Statistics for ML
20 pages
Kome Default
No ratings yet
Kome Default
15 pages
Unit 3 B-Exploratory Data Analysis (EDA) - Insights
No ratings yet
Unit 3 B-Exploratory Data Analysis (EDA) - Insights
47 pages
Data Analytics EDA Techniques in Python
No ratings yet
Data Analytics EDA Techniques in Python
16 pages
Exploratory Data Analysis Essentials
No ratings yet
Exploratory Data Analysis Essentials
26 pages
Iat 2 Aids
No ratings yet
Iat 2 Aids
34 pages
EDA: Analyzing Data Patterns and Trends
No ratings yet
EDA: Analyzing Data Patterns and Trends
31 pages
CH 3
No ratings yet
CH 3
33 pages
Machine Learning Data Fundamentals
No ratings yet
Machine Learning Data Fundamentals
23 pages
Descriptive Statistics Overview Guide
No ratings yet
Descriptive Statistics Overview Guide
31 pages
Chapter 1
No ratings yet
Chapter 1
30 pages
Essential Guide to Exploratory Data Analysis
No ratings yet
Essential Guide to Exploratory Data Analysis
51 pages
Data Science Project Guide: Theory to Practice
No ratings yet
Data Science Project Guide: Theory to Practice
7 pages
SMDS Metireal
No ratings yet
SMDS Metireal
65 pages
Bim 41
No ratings yet
Bim 41
23 pages
Data Preparation & Visualization Guide
No ratings yet
Data Preparation & Visualization Guide
6 pages
Data Science: Exploratory Analysis Guide
No ratings yet
Data Science: Exploratory Analysis Guide
42 pages
Data Science Fundamentals Explained
No ratings yet
Data Science Fundamentals Explained
44 pages
Exploratory Data Analysis (EDA) - Complete Guide
No ratings yet
Exploratory Data Analysis (EDA) - Complete Guide
25 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
79 pages
Data Science Lifecycle Overview Guide
No ratings yet
Data Science Lifecycle Overview Guide
38 pages
Data Exploration and EDA Techniques
No ratings yet
Data Exploration and EDA Techniques
45 pages
Data Visualization and Analysis Techniques
No ratings yet
Data Visualization and Analysis Techniques
21 pages
Understanding Analytical Decision Making
No ratings yet
Understanding Analytical Decision Making
27 pages
Descriptive and Inferential Statistics
No ratings yet
Descriptive and Inferential Statistics
122 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
47 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
29 pages
EDA and Descriptive Statistics Guide
No ratings yet
EDA and Descriptive Statistics Guide
40 pages
House Price Prediction Project Report
No ratings yet
House Price Prediction Project Report
37 pages
AI, ML, DL, and Data Science Explained
No ratings yet
AI, ML, DL, and Data Science Explained
24 pages
Importance of EDA in Data Science
No ratings yet
Importance of EDA in Data Science
20 pages
Numerical Discrete Variables in EDA
No ratings yet
Numerical Discrete Variables in EDA
35 pages
Advanced EDA
No ratings yet
Advanced EDA
7 pages
Data Types and Analysis Techniques
No ratings yet
Data Types and Analysis Techniques
2 pages
Estimation of Location and Variability
No ratings yet
Estimation of Location and Variability
50 pages
Module - 2 Exploratory Data Analysis
No ratings yet
Module - 2 Exploratory Data Analysis
24 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
42 pages
Types of Data: Structured, Unstructured, Semi-Structured
No ratings yet
Types of Data: Structured, Unstructured, Semi-Structured
8 pages
Module 2
No ratings yet
Module 2
113 pages
Understanding Data Types and EDA
No ratings yet
Understanding Data Types and EDA
12 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
23 pages
Exploratory Data Analysis Techniques in R
No ratings yet
Exploratory Data Analysis Techniques in R
49 pages
Exploratory Data Analysis Overview
No ratings yet
Exploratory Data Analysis Overview
53 pages
Introduction to Exploratory Data Analysis
No ratings yet
Introduction to Exploratory Data Analysis
9 pages
Data Analysis Strategies and Techniques
No ratings yet
Data Analysis Strategies and Techniques
24 pages
Lec # 5
No ratings yet
Lec # 5
48 pages
Understanding Data Types in EDA
No ratings yet
Understanding Data Types in EDA
28 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
41 pages
ML Design Module2 2025-26
No ratings yet
ML Design Module2 2025-26
23 pages
Zapotec Mortuary Practices: Adults vs. Children
No ratings yet
Zapotec Mortuary Practices: Adults vs. Children
7 pages
Critique of Janov's Primal Therapy
No ratings yet
Critique of Janov's Primal Therapy
13 pages
Grade 12 Bartending Lesson Plan
No ratings yet
Grade 12 Bartending Lesson Plan
4 pages
Effective Conflict Management Strategies
No ratings yet
Effective Conflict Management Strategies
24 pages
TESOL Observation Report: EFL Class
No ratings yet
TESOL Observation Report: EFL Class
7 pages
MOA for Reading Center in Isabela School
No ratings yet
MOA for Reading Center in Isabela School
3 pages
4th Grade Math Lesson Plan: Perimeter
No ratings yet
4th Grade Math Lesson Plan: Perimeter
3 pages
Teacher-Student Relationships and Achievement
No ratings yet
Teacher-Student Relationships and Achievement
8 pages
Philosophy and the Human Person Overview
No ratings yet
Philosophy and the Human Person Overview
4 pages
Effective Lecture Strategies for Students
No ratings yet
Effective Lecture Strategies for Students
4 pages
Nietzsche's "Twilight of the Idols" Insights
No ratings yet
Nietzsche's "Twilight of the Idols" Insights
67 pages
Understanding School Curricula Types
No ratings yet
Understanding School Curricula Types
73 pages
Rural Leadership and Community Development
No ratings yet
Rural Leadership and Community Development
30 pages
Enhancing Education with Nyakarimojong
No ratings yet
Enhancing Education with Nyakarimojong
9 pages
Raja Mahendra Pratap University Results 2023-24
No ratings yet
Raja Mahendra Pratap University Results 2023-24
2 pages
JEE Advanced 2025: Chemical Bonding PYQs
No ratings yet
JEE Advanced 2025: Chemical Bonding PYQs
3 pages
IELTS Advanced Listening Practice Guide
No ratings yet
IELTS Advanced Listening Practice Guide
8 pages
Why Choose FEUP in Portugal?
No ratings yet
Why Choose FEUP in Portugal?
2 pages
Homunculus and Applied Philosophy
No ratings yet
Homunculus and Applied Philosophy
11 pages
Bipolar Disorder Management Strategies
No ratings yet
Bipolar Disorder Management Strategies
4 pages
Grade 3 Science Lesson Plan Overview
No ratings yet
Grade 3 Science Lesson Plan Overview
19 pages
Enhancing Engagement in Multicultural Teams
No ratings yet
Enhancing Engagement in Multicultural Teams
13 pages
Earth Science For Stem: Quarter 1 - Module 3: Physical and Chemical Properties of Minerals
No ratings yet
Earth Science For Stem: Quarter 1 - Module 3: Physical and Chemical Properties of Minerals
27 pages
Dark Impacts of AI and Robotics
No ratings yet
Dark Impacts of AI and Robotics
3 pages
MCQs On "Human Reproduction & Embryonic Development".
No ratings yet
MCQs On "Human Reproduction & Embryonic Development".
27 pages
PPST Indicator 2025-2026 Cot 1 and Cot 2
No ratings yet
PPST Indicator 2025-2026 Cot 1 and Cot 2
11 pages
Psychiatry MCQs for Exam Preparation
100% (1)
Psychiatry MCQs for Exam Preparation
17 pages
Big Data Technologies: Hadoop Overview
No ratings yet
Big Data Technologies: Hadoop Overview
30 pages
LET English Specialization Reviewer
75% (4)
LET English Specialization Reviewer
141 pages
SESB: Bilingual Education in Berlin
No ratings yet
SESB: Bilingual Education in Berlin
2 pages

Unit 2 DataScience

Uploaded by

Unit 2 DataScience

Uploaded by

UNIT-II

Measure Meaning Use Case

Mean Average of values Average salary, marks, temperature

Median Middle value Income analysis (less affected by outliers)

Mode Most frequent value Product popularity, survey choices

Range Spread from min to max Knowing variation in data

Remove duplicate row → Keep only one.

Customer ID Age Gender Plan Monthly Charges Last Login

101 25 Male A 1200 2025-06-01

102 Female A 2025-06-02

101 25 Male A 1200 2025-06-01

103 30 Female B 950 not recorded

After Data Preparation:

Step Purpose Real-life Example

Prepare features for Scale income and expenses for credit

Exploratory Data Analysis (EDA):

CustomerID Age Region Purchase

Importance in Data Science

4. Bimodal and Multimodal Distributions

B. Based on Probability Type

2. Continuous Distributions (data can take any value within a range)

Methods to Represent Data Distribution

b) Negative Skew (Left Skewed)

Skewness Value Interpretation

0 to 0.5 or -0.5 to 0 Approximately symmetric

0.5 to 1 Moderately positively skewed

-1 to -0.5 Moderately negatively skewed

>1 Highly positively skewed

< -1 Highly negatively skewed

Aspect Sample Mean (x̄ ) Estimated Mean (μ̂)

Definition Arithmetic mean of sample Approximation of the

Data Basis Calculated directly from Based on the sample mean

Purpose Describes the sample Predicts the population

• Sample mean is computed from observed data.

Errors in Hypothesis Testing

These mistakes are called Errors in Hypothesis Testing.

There are two types of errors:

1. Type I Error (False Positive)

We say the claim is wrong, but in reality the claim is correct.

Example (Battery Life Example)

Average battery life = 10 hours

Battery life really is 10 hours

This is called Type I Error.

Type I Error is represented by:

This means 5% chance of making a Type I Error.

2. Type II Error (False Negative)

We say the claim is correct, but actually it is wrong.

Example (Battery Life Example)

Average battery life = 10 hours

Battery life is 8 hours

But our test result says:

This is called Type II Error.

Type II Error is represented by:

Error Type Meaning Example

Type II Error Accepting false H₀ Saying battery life = 10 when it is actually 8

Easy Real-Life Example

Imagine a fire alarm system.

Alarm rings but there is no fire.

There is fire, but alarm does not ring.

Quick Memory Trick

Type I Error → False Alarm

Type II Error → Missed Detection

You might also like