0% found this document useful (0 votes)
7 views26 pages

Statistics For DS

The document provides an overview of statistics in data science, covering its definition, importance, and types, including descriptive and inferential statistics. It explains key concepts such as measures of central tendency, measures of dispersion, and the significance of sampling in data collection. Additionally, it outlines the steps for effective data collection and the classification of data types.

Uploaded by

ap6981549
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views26 pages

Statistics For DS

The document provides an overview of statistics in data science, covering its definition, importance, and types, including descriptive and inferential statistics. It explains key concepts such as measures of central tendency, measures of dispersion, and the significance of sampling in data collection. Additionally, it outlines the steps for effective data collection and the classification of data types.

Uploaded by

ap6981549
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

📘 Statistics for Data Science

Topic 1: What is Statistics?


🔹 Definition
• Statistics is the branch of mathematics that deals with collecting, organizing, analyzing, and interpreting data to
make informed decisions.
• In Data Science, statistics provides the tools and techniques to understand patterns, relationships, and trends
hidden in data.

🔹 Why Statistics in Data Science?


1. Helps summarize large datasets into meaningful numbers.
2. Allows us to make predictions (e.g., sales forecast, weather prediction).
3. Helps in decision-making under uncertainty.
4. Provides the foundation for Machine Learning algorithms.

🔹 Two Main Types of Statistics


1. Descriptive Statistics
a. Summarizes and describes data.
b. Example: Mean, Median, Mode, Standard Deviation, Graphs (histograms, pie charts).
c. E.g.: “The average score of the class is 75.”
2. Inferential Statistics
a. Uses a sample to make conclusions about a population.
b. Based on probability.
c. Example: Hypothesis testing, Confidence intervals, Regression.
d. E.g.: “Based on a survey of 1,000 voters, we predict candidate A will win the election.”

🔹 Example in Data Science


Suppose we have data of 1,000 customers’ purchases:
• Descriptive: Find the average amount spent and make a bar chart of top products.
• Inferential: Take a sample of 100 customers, analyze them, and predict buying patterns for all 1,000 customers.

🔹 Key Points / Summary


• Statistics = Science of data analysis.
• Two types → Descriptive (summarize data) & Inferential (predict/conclude).
• Core of Data Science, AI, and ML.

Topic 2: Descriptive Statistics


🔹 Definition
Descriptive Statistics is the branch of statistics that summarizes and describes the main features of a dataset.
• It does not make predictions or conclusions beyond the given data.
• Focuses on what the data shows.

🔹 Why Descriptive Statistics?


• To simplify large datasets into meaningful information.
• To identify patterns, trends, and variations in data.
• To prepare data before applying Machine Learning or Inferential Statistics.

🔹 Key Measures in Descriptive Statistics


1. Measures of Central Tendency (where data is centered)
a.
Example: If test scores are 60, 70, 80 → Mean = (60+70+80)/3 = 70
b. Median: Middle value when data is arranged.
Example: [10, 15, 20] → Median = 15
a. Mode: Most frequently occurring value.
Example: [2, 3, 3, 5, 7] → Mode = 3

2. Measures of Dispersion (Spread of Data)


a. Range = Max − Min
Example: [5, 7, 9, 15] → Range = 15−5 = 10
b. Variance (σ²): Average of squared deviations from the mean.
c. Standard Deviation (σ): Square root of variance.
Small σ → Data is close to mean; Large σ → Data is spread out

3. Measures of Shape
a. Skewness: Tells if data is symmetric or skewed.
i. Positive skew → Tail on right (e.g., income distribution).
ii. Negative skew → Tail on left.
b. Kurtosis: Tells whether data is flat or peaked compared to normal distribution.

🔹 Visual Representation
• Graphs/Charts used in Descriptive Statistics:
o Histogram
o Bar chart
o Pie chart
o Box plot

🔹 Example in Data Science


Suppose you analyze daily sales of an online store (10 days):
[120, 150, 130, 160, 140, 200, 180, 190, 170, 210]
• Mean = 165
• Median = 165
• Mode = None
• Range = 210 − 120 = 90
• Standard Deviation ≈ 28.7
From this, you know the average daily sales and how much they vary.

🔹 Summary / Key Points


• Descriptive statistics = summarizing data.
• Key tools: Central tendency, Dispersion, Shape.
• Helps understand patterns before deeper analysis.

Topic 3: Inferential Statistics


🔹 Definition
• Inferential Statistics is the branch of statistics that allows us to make predictions, generalizations, or decisions
about a population based on data collected from a sample.
• Unlike descriptive statistics (which just summarizes data), inferential statistics uses probability theory to test
hypotheses and draw conclusions.

🔹 Why Inferential Statistics?


• It is often impossible to study an entire population (too large, costly, or time-consuming).
• Instead, we take a sample and use it to infer about the population.
• Helps in decision-making under uncertainty.

🔹 Key Concepts
1. Population vs. Sample
a. Population: Entire group (e.g., all customers of Amazon).
b. Sample: Subset of the population (e.g., 1,000 customers surveyed).
2. Parameter vs. Statistic
a. Parameter: A true value describing the population (e.g., actual average income of all Indians).
b. Statistic: A value calculated from a sample (e.g., average income of 1,000 surveyed Indians).
3. Probability
a. Foundation of inferential statistics.
b. Helps estimate how likely our conclusion is correct.
4. Confidence Interval (CI)
a. Range of values within which the true population parameter is expected to lie.
b. Example: “The average height of students is 160–170 cm with 95% confidence.”
5. Hypothesis Testing
a. Process to test claims about the population.
b. Steps:
i. State null hypothesis (H₀) and alternative hypothesis (H₁).
ii. Collect sample data.
iii. Use test statistics (z-test, t-test, chi-square, ANOVA).
iv. Decide whether to reject H₀.
6. Regression Analysis
a. Used to study relationships between variables.
b. Example: Predicting sales based on advertising spend.

🔹 Example in Data Science


Suppose a company wants to know the average time customers spend on their app.
• Population = 1 million customers
• Sample = 1,000 customers surveyed
• Descriptive Statistics → Average time (sample) = 35 minutes
• Inferential Statistics → Predicts true population average lies between 33 and 37 minutes (95% CI)
This helps the company understand overall usage without surveying all 1 million users.

🔹 Summary / Key Points


• Descriptive → What data shows
• Inferential → What we can conclude/predict
• Uses probability, hypothesis testing, confidence intervals, regression.
• Essential for data-driven decision-making.

🔹 Parameter
• A parameter is a numerical value that describes a characteristic of the entire population.
• It is fixed but usually unknown, because studying the whole population is often impossible.
• Examples of parameters:
o Population mean (μ)
o Population standard deviation (σ)
o Population proportion (P)
Example:
If we want the average height of all students in India, that average is a parameter.
But since measuring all students is impractical, we usually estimate it using a sample.

🔹 Statistic
• A statistic is a numerical value that describes a characteristic of a sample.
• It is calculated from data collected from a part of the population.
• Statistics are used to estimate population parameters.
• Examples of statistics:
o Sample mean (xˉ)
o Sample standard deviation (s)
o Sample proportion (p̂ )
Example:
If we take a sample of 500 students from India and calculate their average height, that average is a statistic.

🔹 Key Difference
Aspect Parameter (Population) Statistic (Sample)
Definition A value describing the entire population A value describing a sample
Symbol μ, σ, P xˉ, s, p
̂
Size Fixed (constant) Varies with different samples
Known/Unknown Usually unknown Known (calculated)
Example Average salary of all data analysts in India Average salary of 500 sampled data analysts

🔹 Example in Data Science


Case: An e-commerce company wants to know the average daily spending of its 1 million customers.
• Parameter (μ): True average spending of all 1 million customers (unknown).
• Statistic (xˉ\bar{x}xˉ): Average spending of a sample of 1,000 customers (calculated).
• Using inferential statistics, we use the statistic to estimate the parameter.

🔹 Summary / Key Points


• Parameter = Population → Fixed but unknown
• Statistic = Sample → Calculated and used to estimate parameter
• In Data Science: We work with statistics (from sample data) to infer parameters (about population).

Topic 5: Data Types


Data in statistics is classified into different types based on nature and measurement scale.

🔹 1. Qualitative (Categorical Data)


• Represents categories/labels (not numbers with meaning).
Types:
• Nominal → Categories without order.
o Example: Gender (Male/Female), City (Delhi, Mumbai).
• Ordinal → Categories with order/rank.
o Example: Education Level (High School < Graduate < Postgraduate).
🔹 2. Quantitative (Numerical Data)
• Represents numeric values (can be measured/calculated).
Types:
• Discrete → Countable values (whole numbers).
o Example: Number of students in a class (25, 30).
• Continuous → Any value within a range (decimals possible).
o Example: Height (170.5 cm), Weight (65.8 kg).

🔹 Quick Table
Type Subtype Example
Qualitative Nominal Blood Group (A, B, AB, O)
Ordinal Movie Rating ( , , )
Quantitative Discrete Number of Cars (1, 2, 3)
Continuous Temperature (36.5°C)

Shortcut to remember:
• Qualitative → Categories
• Quantitative → Numbers
• Discrete = Counting | Continuous = Measuring

Topic 6: Steps to Collect Data


Collecting data is the first and most important step in statistics & data science because the quality of data directly affects
the quality of results.

🔹 Steps in Data Collection


1. Define the Objective
a. Clearly state what you want to find out.
b. Example: “Find the average monthly spending of college students on food.”
2. Identify the Population
a. Decide the target group from which data will be collected.
b. Example: All college students in Mumbai.
3. Choose a Sample (if needed)
a. If the population is too large, select a sample that represents the population.
b. Example: Select 500 students from different colleges.
4. Select Data Collection Method
a. Primary Data (collected first-hand): Surveys, Interviews, Experiments, Observations.
b. Secondary Data (already available): Databases, Reports, Government data, Company records.
5. Design the Data Collection Tool
a. Prepare questionnaires, forms, or sensors depending on the study.
b. Example: A Google Form with questions on monthly food spending.
6. Collect the Data
a. Execute the plan and gather responses/measurements.
b. Ensure accuracy and honesty in responses.
7. Organize & Store the Data
a. Arrange data in a structured format (tables, Excel, databases).
b. Example: Save student spending responses in an Excel sheet.

🔹 Example in Data Science


Case: A company wants to know customer satisfaction with their mobile app.
• Objective → Measure satisfaction level.
• Population → All app users.
• Sample → 2,000 randomly selected users.
• Method → Online survey form.
• Data Collected → Ratings from 1–5.
• Organized → Stored in a database for analysis.

🔹 Summary / Key Points


• Plan before collecting (Objective → Population → Sample).
• Choose correct method (Primary or Secondary).
• Organize data properly for further analysis.

Sampling is important in Statistics for Data Science because:


• It saves time and cost (analyzing full population is impractical).
• Makes analysis practical when full data isn’t available.
• Provides accurate estimates of population parameters.
• Is the basis of inferential statistics (hypothesis testing, confidence intervals).
• Helps handle big data efficiently (used in model training & validation).
In short: Sampling makes statistics practical, efficient, and reliable in Data Science.

Probability Sampling (Every member has a known chance of being selected)

1. Simple Random Sampling


a. Each member has an equal chance of selection.
b. Example: Randomly picking 50 students from a college of 1,000.
2. Stratified Sampling
a. Population divided into strata (groups), then sample taken from each group.
b. Example: Divide students by branch (Science, Commerce, Arts), then select proportionally from each.
3. Systematic Sampling
a. Select every kth element from the population.
b. Example: From a list of 1,000 customers, select every 10th customer.
4. Cluster Sampling
a. Population divided into clusters (groups), then entire clusters are randomly selected.
b. Example: Choose 3 random schools in Mumbai and survey all students in those schools.

Topic 8: Measures of Central Tendency

🔹 Definition
• Central Tendency refers to the single value that represents the center or average of a dataset.
• It tells us where most of the data points tend to cluster.
The three main measures are: Mean, Median, Mode.

🔹 1. Mean (Average)
• Formula:
• Example: Scores = [60, 70, 80]

• Use: Works best when data has no extreme outliers.

🔹 2. Median (Middle Value)


• Arrange data in order → pick the middle value.
• If even number of values → take average of the two middle values.
• Example: [10, 15, 20] → Median = 15
• Example: [5, 10, 15, 20] → Median = (10+15)/2 = 12.5
• Use: Best when data has outliers (skewed distribution).

🔹 3. Mode (Most Frequent Value)


• Value that occurs most often in the dataset.
• Example: [2, 3, 3, 5, 7] → Mode = 3
• A dataset can have:
o One mode → Unimodal
o Two modes → Bimodal
o Many modes → Multimodal
• Use: Works well with categorical data (e.g., most popular product).
🔹 Example in Data Science
Dataset: Daily sales (in ₹): [120, 150, 130, 160, 140, 200, 180, 190, 170, 210]
• Mean = (Sum of all values ÷ 10) = 165
• Median = (5th + 6th value)/2 = (160 + 170)/2 = 165
• Mode = None (all values occur once)
Interpretation: Most sales are around ₹165/day.

🔹 Summary / Key Points


• Mean = Arithmetic average → Affected by outliers.
• Median = Middle value → Best when outliers exist.
• Mode = Most frequent value → Useful for categories.
In Data Science → Always check all three to understand data distribution.
Topic 9: Measures of Dispersion

🔹 Definition
• Dispersion means the spread or variability of data.
• While Central Tendency tells us the “center” of data (mean, median, mode),
Dispersion tells us how far the data is spread out around the center.
In short: How consistent or variable the data is.

🔹 Importance in Data Science


• Low dispersion → Data is consistent (values close to each other).
• High dispersion → Data is spread out (greater variability).
• Helps detect outliers, understand risk, and improve predictive models.
🔹 Types of Measures of Dispersion
1. Range
a. Difference between maximum and minimum values.
b. Formula:
Range=Max−Min
c. Example: [10, 15, 20, 25] → Range = 25 – 10 = 15
d. Limitation: Affected by extreme values.

2. Mean Deviation (MD)


a. Average of absolute deviations from the mean.
b. Formula:

3. Variance (σ²)
a. Average of squared deviations from the mean.
b. Formula (Population):

4. Standard Deviation (σ)


a. Square root of variance.
b. Formula: σ=σ2
c. Example above: √2.67 ≈ 1.63
d. Widely used in Data Science & ML to measure variability.

🔹 Example in Data Science


Suppose daily sales (in ₹) of a store are:
[120, 130, 140, 200, 210]
• Mean = 160
• Range = 210 – 120 = 90
• Variance ≈ 1460
• Standard Deviation ≈ 38.2
Interpretation: Though average sales are ₹160/day, sales vary ±38.2 around this average.

🔹 Summary / Key Points


• Range → Simple, shows extreme spread.
• MD → Uses average deviation.
• Variance & SD → Most reliable, used in ML models.
🔹 Definition
• The Interquartile Range (IQR) is a measure of spread of the middle 50% of the data.
• It is the difference between the third quartile (Q3) and the first quartile (Q1).
IQR=Q3−Q1

🔹 Quartiles Recap
• Q1 (First Quartile) → Value below which 25% of data lies.
• Q2 (Median) → Value below which 50% of data lies.
• Q3 (Third Quartile) → Value below which 75% of data lies.
So, IQR = Range of the middle 50% data.

🔹 Example
Data: [5, 7, 8, 12, 13, 14, 18, 21, 23, 27]
• Q1 = 8
• Q2 = 13.5 (median)
• Q3 = 21
IQR=21−8=13
Interpretation: The middle 50% of values lie within a spread of 13 units.

🔹 Importance of IQR
Not affected by extreme values (outliers) → More robust than Range.
Used in Boxplots to show spread and detect outliers.
Helpful in Data Science for outlier removal:
🔹 Quick Comparison
Measure Focus Sensitive to Outliers?
Range Max – Min Yes
IQR Q3 – Q1 No

Shortcut to remember:
• Range → full spread.
• IQR → middle 50% spread.

Topic 11: Outliers

🔹 Definition
• An outlier is a data point that is significantly different (too small or too large) compared to the rest of the dataset.
• Outliers can occur due to errors, unusual conditions, or natural variability.
Outliers affect mean, variance, standard deviation, correlation, and ML models.

🔹 Causes of Outliers
1. Data entry errors → Typing ₹10000 instead of ₹1000.
2. Measurement errors → Faulty sensors, misrecorded values.
3. Natural variation → A very tall person in a group.
4. Sampling errors → Mixing two different populations.

🔹 Detection Methods
1. Using IQR (Interquartile Range Method)
a. Formula:
i. Lower Bound = Q1 – 1.5 × IQR
ii. Upper Bound = Q3 + 1.5 × IQR
b. Any data point outside this range is an outlier.
Example:
Data = [10, 12, 13, 15, 18, 95]
c. Q1 = 12, Q3 = 18, IQR = 6
d. Lower Bound = 12 – 1.5(6) = 3
e. Upper Bound = 18 + 1.5(6) = 27
f. Outlier = 95

2. Using Z-Score (Standard Score)


a. Formula:

b. If |Z| > 3, the value is an outlier.


Example:
If mean = 50, SD = 5, data point = 70 →

(Outlier)

3. Visualization
a. Boxplot → Shows outliers as points outside whiskers.
b. Scatterplot → Outliers stand apart from clusters.

🔹 Handling Outliers
• Check cause first → Is it an error or a real observation?
• If error → Correct or remove.
• If genuine →
o Use robust statistics (Median, IQR instead of Mean & SD).
o Apply log transformation to reduce impact.
o Use modeling techniques resistant to outliers (like tree-based models).

🔹 Example in Data Science


Suppose daily online sales are:
[120, 130, 135, 125, 140, 2000]
• Mean = 448.3 → Misleading due to outlier (2000).
• Median = 132.5 → Better representation.
Outlier handling is crucial before ML modeling, otherwise predictions become biased.

Summary:
• Outlier = extreme value.
• Detection → IQR Method, Z-Score, Visualization.
• Handling → Remove, Correct, or Use robust methods.

What is an Outlier?
• An outlier is a data point that is significantly different from the rest of the data.
• It lies far away from the majority of values.
Example:
Heights of students = [150, 152, 155, 157, 160, 162, 210]
Here 210 cm is an outlier.

🔹 Why Outliers Matter in Data Science?


• They can skew the mean and increase variance.
• Can mislead ML models and cause poor predictions.
• Sometimes they are errors, but sometimes they represent rare but important cases (e.g., fraud detection, network
attacks).

How to Handle Outliers?


1. Remove Outliers
a. If caused by data entry/measurement error.
b. Example: Employee age recorded as 450 years → Remove.
2. Cap or Floor (Winsorizing)
a. Replace extreme values with nearest acceptable value.
b. Example: If income > 99th percentile → replace with 99th percentile value.
3. Transformation
a. Apply log, square root, or Box-Cox transformation to reduce the effect of outliers.
4. Treat Separately
a. Keep outliers as separate cases if they are meaningful.
b. Example: Fraud transactions in banking → should not be removed.

Topic 12: Distribution

🔹 What is a Distribution?
• A distribution shows how data values are spread out (frequency) across different ranges.
• It tells us which values are common and which are rare.
Example: If you plot marks of 100 students, some ranges (like 40–60) will have more students than others.

🔹 Types of Distributions
1. Normal Distribution (Bell Curve)
• Most common distribution in statistics & ML.
• Symmetrical around the mean.
• Properties:
o Mean = Median = Mode
o 68% data within 1σ, 95% within 2σ, 99.7% within 3σ (Empirical Rule).
Example: Heights, weights, exam marks (if class is large).
🔹 Standard Normal Distribution
• A special case of the Normal Distribution.
• It has:
o Mean (μ) = 0
o Standard Deviation (σ) = 1
• Any normal distribution can be converted into standard normal using the Z-score.

This process is called Standardization.

🔹 Why Standard Normal?


• Makes it easier to compare different datasets.
• Probability tables (Z-tables) are available for standard normal → used in hypothesis testing & confidence intervals.

🔹 Example
Suppose exam scores are normally distributed with:
• Mean (μ) = 70,
• Standard Deviation (σ) = 10.
A student scored 85.

This means the student’s score is 1.5 standard deviations above the mean.

🔹 Properties of Standard Normal


1. Symmetrical bell curve centered at 0.
2. Area under the curve = 1 (represents probability).
3. Probability interpretation (Empirical Rule still holds):
a. Z between –1 and +1 → 68% data.
b. Z between –2 and +2 → 95% data.
c. Z between –3 and +3 → 99.7% data.

Empirical Rule (68–95–99.7 Rule)


Definition: The Empirical Rule states that for a dataset following a Normal Distribution:
• About 68% of data lies within 1 standard deviation (σ) of the mean (μ).
• About 95% of data lies within 2σ of the mean.
• About 99.7% of data lies within 3σ of the mean.
👉 This is why it’s also called the 68–95–99.7 Rule.

🔹 Data Science Use Cases


• Feature scaling in ML (standardization).
• Anomaly detection (check Z-scores).
• Statistical inference (Z-tests, confidence intervals).
Shortcut:
• Normal distribution → general bell curve.
• Standard Normal → bell curve with μ=0, σ=1.
• Z-score tells how many σ away from mean a value lies.

2. Uniform Distribution
• All values equally likely (flat shape).
• Example: Rolling a fair die (1–6 all have probability 1/6).

3. Skewed Distribution
• Not symmetrical (tail longer on one side).
o Right (Positive) Skew → Tail on right. (Example: Income, sales revenue)
o Left (Negative) Skew → Tail on left. (Example: Retirement age, exam where most score high).

Topic 14: Kurtosis

🔹 Definition
• Kurtosis measures the “tailedness” (heaviness of tails) of a distribution compared to a normal distribution.
• In simple words: It tells us how much data is in the tails vs center.

🔹 Formula

• Uses the 4th power of deviations → focuses on extreme values (outliers).


• Often we use Excess Kurtosis = Kurtosis – 3
o Because Normal Distribution has Kurtosis = 3.
🔹 Types of Kurtosis
1. Mesokurtic (Normal, K=3)
a. Standard bell curve (Normal distribution).
b. Balanced tails.
2. Leptokurtic (K > 3)
a. Taller peak, fatter tails (more outliers).
b. Example: Stock market returns.
3. Platykurtic (K < 3)
a. Flatter peak, thinner tails (fewer outliers).
b. Example: Uniform distribution.

🔹 Example
• Dataset A: Normal exam scores → Mesokurtic.
• Dataset B: Daily stock returns (many outliers) → Leptokurtic.
• Dataset C: Dice roll outcomes (even spread) → Platykurtic.

🔹 Quick Summary Table


Type Kurtosis Value Shape Example
Mesokurtic K=3 Normal Heights, marks
Leptokurtic K>3 Peaked, fat tails Stock returns
Platykurtic K<3 Flat, thin tails Dice outcomes

Shortcut to remember:
• Lepto = Lots of outliers
• Platy = Plain/Flat
• Meso = Medium/Normal

Topic 15: Central Limit Theorem (CLT)

🔹 Definition
The Central Limit Theorem (CLT) states:
When we take repeated random samples of a population and calculate their means, the distribution of those sample
means will approach a Normal Distribution, regardless of the shape of the original population — provided the sample size is
large enough (n ≥ 30).

🔹 Key Points
1. Works even if population is not normal.
2. Sample size n ≥ 30 is usually enough for CLT to hold.
3. Mean of sample means ≈ Population mean (μ).
4. Standard deviation of sample means = Standard Error (SE):

🔹 Example
Population: Marks of students in a college (not normally distributed, skewed).
• If we take one sample of 30 students, its mean may not look like the population.
• But if we take many samples of 30 students and plot their means → The shape will look normal (bell curve).
• That’s why CLT allows us to use normal probability methods (Z-scores, confidence intervals, hypothesis tests)
even if data isn’t normal.

🔹 Why Important in Data Science?


• Basis for hypothesis testing & confidence intervals.
• Used in A/B Testing (sample user groups behave normally).
• Justifies why many ML algorithms assume normality.
• Helps us work with small samples instead of full population.

🔹 Visual Intuition
• Population distribution → may be skewed or uniform.
• Distribution of sample means → tends to Normal (bell curve).

🔹 Shortcut to Remember
• Large samples → sample means ~ Normal.
• Mean of sample means → Population mean.
• Spread reduces with sample size (σ/√n).

Topic 16: Covariance & Correlation


🔹 1. Covariance
Definition
• Covariance measures the direction of relationship between two variables.
• It tells us whether two variables move together (positive) or opposite (negative).
Formula

Interpretation
• Positive → When X increases, Y tends to increase.
• Negative → When X increases, Y tends to decrease.
• Zero → No linear relationship.
Example:
• Height & Weight → Positive covariance.
• Price & Demand → Negative covariance.
Problem: Covariance has no fixed range, so hard to compare.

🔹 2. Correlation
Definition
• Correlation measures the strength & direction of relationship between two variables.
• It is the standardized version of covariance.
Formula (Pearson’s Correlation Coefficient)

Range

• +1 → Perfect positive linear relation.


• –1 → Perfect negative linear relation.
• 0 → No linear relation.
Example
• Height & Weight → r ≈ +0.8 (strong positive).
• Price & Demand → r ≈ –0.7 (strong negative).
• Shoe size & Intelligence → r ≈ 0 (no relation).

🔹 Difference Between Covariance & Correlation


Aspect Covariance Correlation
Meaning Direction of relationship Direction + Strength of relationship
Scale Unbounded (depends on units) Always between –1 and +1
Comparability Hard to compare Easy to compare
Example 200 or –150 etc. +0.8, –0.6 etc.

🔹 Why Important in Data Science?


• Covariance → Basis for PCA (Principal Component Analysis).
• Correlation → Feature selection in ML (remove highly correlated variables to avoid multicollinearity).
• Both used in finance to check stock movement relationships.

Shortcut to remember:
• Covariance = Just “direction”.
• Correlation = “direction + strength” (scaled version).

Topic 17: Euclidean Distance & Manhattan Distance

🔹 1. Euclidean Distance
Definition
• It is the straight-line distance between two points in space (like a ruler).
• Based on Pythagoras theorem.
Formula (for 2D points)

Used in: k-NN, clustering, image recognition.

🔹 2. Manhattan Distance
Definition
• Also called Taxicab distance or L1 distance.
• It is the sum of absolute differences of coordinates.
• Like moving in a grid/city blocks, not diagonally.
Formula (for n-dimensions)

Used in: text mining, recommender systems, when data is sparse.

🔹 Visual Intuition
• Euclidean → "as the crow flies" (shortest path).
• Manhattan → "like a taxi in city blocks" (right-angled moves).

🔹 Comparison
Feature Euclidean (L2) Manhattan (L1)
Path Straight line Grid blocks
Formula √Σ(x–y)² Σ
Best for Dense data, geometric distance High-dimensional, sparse data
Example GPS distance City distance

🔹 Importance in Data Science


• Distance metrics affect clustering & nearest neighbor algorithms.
• Euclidean works well when features are continuous & normalized.
• Manhattan works better with sparse or high-dimensional data (e.g., NLP).

Shortcut to remember:
• Euclidean = Straight line.
• Manhattan = City blocks.

Topic 18 : Qualitative vs Quantitative

🔹 1. Definition
• Qualitative Data / Analysis
o Non-numerical information.
o Describes qualities, categories, characteristics, or meaning.
o Example: Colors, Gender, Feedback (“good,” “bad,” “average”).
• Quantitative Data / Analysis
o Numerical information.
o Measures quantities, amounts, or frequencies.
o Example: Age, Salary, Marks, Height, Ratings (out of 10).

🔹 2. Characteristics
Feature Qualitative Quantitative
Data type Words, symbols, categories Numbers, measurable values
Nature Descriptive, subjective Objective, measurable
Goal Understand meaning/patterns Test hypotheses, find relationships
Scale of measurement Nominal, Ordinal Interval, Ratio
Sample size Small (in-depth) Large (generalizable)
Collection method Interviews, focus groups, surveys Surveys, experiments, sensors
Analysis method Content analysis, coding, NLP Statistical tests, graphs, models
Example “Student is happy” “Student scored 85 marks”

🔹 3. Examples
• Qualitative Examples:
o Blood group: A, B, O, AB
o Customer feedback: “Excellent service”
o Car brand: Toyota, BMW, Tesla
• Quantitative Examples:
o Salary = ₹45,000
o Weight = 70 kg
o Number of students = 60

🔹 4. In Data Science
• Qualitative Data often transformed into quantitative for analysis:
o Example: Sentiment (Positive = 1, Neutral = 0, Negative = -1).
• Quantitative Data used for:
o Building ML models, statistical analysis, forecasting.
Shortcut to remember:
• Qualitative = Quality (Words, Categories).
• Quantitative = Quantity (Numbers, Measurement).

Estimating Population Parameters in a clear and simple way.

🔹 What are Population Parameters?


• A parameter is a numerical value that describes a characteristic of the population (e.g., population mean μ,
population proportion p, population standard deviation σ).
• Since collecting data from the entire population is often impossible, we estimate parameters using sample
statistics.

🔹 Methods of Estimating Population Parameters


1. Point Estimation
• Uses a single value from the sample (called a statistic) to estimate the population parameter.
• Example:
o Sample mean (xˉ) estimates population mean (μ).
o Sample proportion (p^ ) estimates population proportion (p).
Example:
If 100 students are sampled and their average height is 165 cm, then xˉ=165\bar{x} = 165xˉ=165 cm is a point estimate of
the population mean μ.

2. Interval Estimation (Confidence Intervals)


• Instead of giving a single value, it provides a range of values that is likely to contain the population parameter.
• Formula for a confidence interval of the mean (when σ is known):

• Where:
o xˉ = sample mean
o σ = population standard deviation
o n = sample size
o Z = Z-score corresponding to confidence level (e.g., 1.96 for 95% CI)
Example:
• Sample mean height = 165 cm
• σ = 10 cm, n = 100


We are 95% confident that the population mean lies between 163.04 and 166.96 cm.

3. Standard Error (SE)


• Measures the accuracy of the sample statistic as an estimate of the population parameter.
🔹 Real-Life Example
A company wants to know the average daily sales across all its stores (population).
• It samples 50 stores and finds average sales = ₹45,000 (sample mean).
• Using point estimation → population mean ≈ ₹45,000.
• Using interval estimation → 95% CI (₹42,000, ₹48,000).
This means the company is 95% confident that the true population mean daily sales lie between ₹42k and ₹48k.

In short:
• Point Estimation → one best guess (sample statistic).
• Interval Estimation → a range with a confidence level.
• SE → tells how reliable the estimate is.

📘 Assignment – Estimating Population Parameters (Beginner Level)


Q1. Estimating Mean (Large Sample)
A sample of 64 light bulbs has an average lifetime = 1,200 hours with a standard deviation = 100 hours.
Find the 95% confidence interval for the population mean lifetime of light bulbs.
Use Z = 1.96 for 95%.

Q2. Estimating Proportion


Out of 500 customers surveyed, 350 said they are satisfied with the service.
Find the 95% confidence interval for the true proportion of satisfied customers.

Q3. Application Question (Company Claim)


A company claims that the average daily production = 5,000 units.
A sample of 100 days shows an average = 4,900 units with σ = 200.
Construct a 99% confidence interval and check if the company’s claim of 5,000 units is reasonable.
Hint: Use Z = 2.58 for 99%.

Hypothesis Testing
1. What is it?
Hypothesis Testing is a statistical method to make decisions about a population parameter based on sample data.
It helps us check whether the observed data has enough evidence to accept or reject a claim (hypothesis).

Hypothesis testing is a foundational concept in statistics, helping researchers and analysts make
data-driven decisions.

It is about assessing evidence: Do the sample data provide enough evidence to support a claim
about the population?
● Examples of Real-World Applications:
1) Medicine: Testing if a new drug is more effective than an existing one.
2) Business: Determining if a new marketing strategy increases sales.
3) Education: Evaluating if a new teaching method improves student performance.
4) Manufacturing: Ensuring a product meets quality standards.

Steps of Hypothesis Testing

1. State the Hypotheses


2. Set the Significance Level (α)
3. Choose the Appropriate Test
4. Calculate the Test Statistic
5. Find the P-value
6. Make a Decision
7. Draw a Conclusion

State the Hypotheses


1) Null Hypothesis (H0):
Defined as the hypothesis that there is no effect or no difference.
2) Alternative Hypothesis (H1):
Represents a new effect, difference, or relationship that you want to test.
● Example:
H0: The average test score is 75.
H1: The average test score is not 75.

Set the Significance Level (α)


1. The level of significance is the demarcation between the confidence and critical regions in a normal
distribution curve.
2. The confidence region is that area under the curve in which the null hypothesis is [Link] is also termed as
acceptance region.
3. The region in which the null hypothesis is rejected is called the critical region or rejection region.
4. In a normal distribution curve, 95% of area under the curve is the confidence region and rest is the
critical region.
- α=0.05 (5%): This means there is a 5% chance of rejecting the null hypothesis when it is actually true.
- 𝛼=0.01 (1%): This means there is a 1% chance of incorrectly rejecting the null hypothesis.

The Core Idea in Simple Terms

Imagine you have a claim. For example, a pizza place says, "Our delivery time is 30 minutes or less!" You're skeptical. So, you
decide to test this.
How? You don't check every single delivery (that's impossible). Instead, you take a sample—say, 100 recent deliveries—and
calculate the average delivery time for that sample. Let's say your sample average is 35 minutes.
Now, the big question: Is this enough evidence to prove the pizza place wrong?
This is what hypothesis testing does. It answers: "Is the effect I see in my sample data strong enough to be considered
real, or could it just be a fluke caused by random chance?"
In our case: "Is the 35-minute average strong evidence that the true average is over 30 minutes, or did I just randomly get 100
slow deliveries?"

The Step-by-Step Process

Hypothesis testing follows a strict, logical procedure. Here are the key steps:
Step 1: Formulate the Hypotheses

You state two opposing hypotheses.


• Null Hypothesis (H₀): The "status quo" or "no effect" hypothesis. It's what you're testing against. It always contains
an equality (=, ≤, ≥).
o In our example: H₀: Mean delivery time ≤ 30 minutes. (The pizza place's claim is true.)
• Alternative Hypothesis (H₁ or Ha): The hypothesis that contradicts the null. This is what you hope to find evidence
for. It contains an inequality (>, <, ≠).
o In our example: H₁: Mean delivery time > 30 minutes. (The pizza place's claim is false; they are slower.)
You assume H₀ is true until the data provides strong evidence against it.
Step 2: Choose the Significance Level (α)

This is the threshold for what you consider "strong enough" evidence. It's the probability of rejecting the null hypothesis when
it is actually true (a false alarm, also called a Type I Error).
The most common choice is α = 0.05 (5%). This means you are willing to accept a 5% risk of concluding the pizza place is slow
when it actually isn't.
Other common levels are α = 0.01 (1%) for stricter tests (e.g., in medicine) and α = 0.10 (10%) for more relaxed ones.
Step 3: Collect Data and Calculate a Test Statistic

You collect your sample data. Then, you plug your data into a formula that produces a Test Statistic. This number (often a t-
score or z-score) measures how far your sample result is from the null hypothesis value, in terms of standard errors.
• In our example: You calculate a t-statistic based on:
o Sample mean (35 min)
o Hypothesized population mean (30 min)
o Sample standard deviation and size (100)
A test statistic of +2.5 would mean your sample average is 2.5 standard errors above 30.
Step 4: Calculate the P-value

This is the most crucial concept.


The P-value is the probability of obtaining your observed results (or more extreme results) if the null hypothesis is true.
• In our example: The p-value answers the question: "If the true delivery time really is 30 minutes, what is the
probability of randomly getting a sample of 100 deliveries with an average time of 35 minutes or more?"
• A low p-value (typically ≤ α): Means your observed data is very unlikely under the null hypothesis. This is evidence
against the null hypothesis.
• A high p-value (> α): Means your observed data is fairly likely under the null hypothesis. You don't have strong
evidence to reject it.
Step 5: Make a Decision

Compare the p-value to your significance level (α).


• If p-value ≤ α: Reject the Null Hypothesis (H₀). You have statistically significant evidence to support the alternative
hypothesis (H₁).
• If p-value > α: Fail to Reject the Null Hypothesis (H₀). You do not have strong enough evidence to support the
alternative hypothesis.
Crucial Note: "Fail to reject" is not the same as "accepting" the null. It means we didn't find proof against it. The case is
inconclusive.

Let's Apply it to the Pizza Example

1. H₀: μ ≤ 30 min
H₁: μ > 30 min
2. α: We choose 0.05.
3. Data: From our sample of 100 orders, we calculate a test statistic (let's say, t = 2.5).
4. P-value: We find the p-value associated with t=2.5 is 0.01.
5. Decision: Is 0.01 ≤ 0.05? Yes! Therefore, we Reject the Null Hypothesis.
Conclusion: There is statistically significant evidence at the 5% level to conclude that the mean delivery time is greater than
30 minutes.

The Courtroom Analogy

This is a fantastic way to remember the process:


• Null Hypothesis (H₀): The defendant is innocent. (The default assumption.)
• Alternative Hypothesis (H₁): The defendant is guilty.
• Significance Level (α): The standard of proof "beyond a reasonable doubt."
• Data Collection: The trial evidence and testimony.
• P-value: The probability of seeing the evidence presented if the defendant was truly innocent.
• Decision:
o Low p-value: It's very unlikely this evidence would appear if they were innocent. Verdict: Guilty (Reject H₀).
o High p-value: The evidence isn't strong enough to rule out innocence. Verdict: Not Guilty (Fail to Reject H₀).
Notice the verdict is not "Innocent."
Potential for Errors (Mistakes)

No test is perfect. There are two types of errors:

Reject H₀ (We say "Guilty") Fail to Reject H₀ (We say "Not Guilty")

Type I Error (False Positive)


H₀ is TRUE
Jailing an innocent person. Correct Decision
(They are Innocent)
Probability = α (our significance level).

Type II Error (False Negative)


H₀ is FALSE
Correct Decision Letting a guilty person go free.
(They are Guilty)
Probability = β.

In summary, hypothesis testing is a formal, statistical framework for making decisions about a population based on
sample data, while quantifying the risk of being wrong.

Example 1: One-Sample t-test

Scenario: A coffee shop claims their lattes contain an average of 8 oz of milk. You're skeptical and think they might be putting
in less. You randomly sample 16 lattes and measure their milk content.
Data: Sample mean = 7.6 oz, Sample standard deviation = 0.5 oz, Sample size = 16
Step 1: State Hypotheses

• H₀: μ = 8 oz (The claim is true)


• H₁: μ < 8 oz (They're putting in less milk)
Step 2: Set Significance Level

• α = 0.05 (5% significance level)


Step 3: Calculate Test Statistic

We use the t-statistic formula:

Where:
• Xˉ =7.6 (sample mean)
• μ=8 (hypothesized population mean)
• s=0.5 (sample standard deviation)
• n=16 (sample size)
t=7.6−80.5/16=−0.40.5/4=−0.40.125=−3.2t=0.5/16 7.6−8 =0.5/4−0.4 =0.125−0.4 =−3.2
Test statistic: t = -3.2
Step 4: Find P-value

Degrees of freedom = n - 1 = 15
For a one-tailed test with t = -3.2 and df = 15:
• P-value ≈ 0.003
Step 5: Make Decision

• P-value (0.003) < α (0.05)


• Reject H₀
Conclusion: There is statistically significant evidence at the 0.05 level to conclude that the average milk content in lattes is
less than 8 oz.

Example 2: One-Sample Z-test

Scenario: A cereal company claims their boxes contain an average of 500 grams. You know from long-term quality control
that the population standard deviation is 15 grams. You sample 50 boxes and find an average weight of 495 grams.
Data: Sample mean = 495g, Population standard deviation = 15g, Sample size = 50
Step 1: State Hypotheses

• H₀: μ = 500g (The claim is true)


• H₁: μ ≠ 500g (Two-tailed test - could be over or underfilling)
Step 2: Set Significance Level

• α = 0.05
Step 3: Calculate Test Statistic

We use the Z-statistic formula:


Z=xˉ−μσ/nZ=σ/n xˉ−μ
Where:
• xˉ=495xˉ=495
• μ=500μ=500
• σ=15σ=15
• n=50n=50
Z=495−50015/50=−515/7.071=−52.121=−2.357 Z=15/50 495−500 =15/7.071−5 =2.121−5 =−2.357
Test statistic: Z = -2.357
Step 4: Find P-value

For a two-tailed test with Z = -2.357:


• P-value = 2 × P(Z < -2.357) ≈ 2 × 0.0092 = 0.0184
Step 5: Make Decision

• P-value (0.0184) < α (0.05)


• Reject H₀
Conclusion: There is statistically significant evidence at the 0.05 level to conclude that the average cereal box weight differs
from 500 grams.

Key Differences Between t-test and Z-test

Feature t-test Z-test

When to use Population standard deviation unknown Population standard deviation known
Standard
Uses sample standard deviation (s) Uses population standard deviation (σ)
Deviation

Distribution t-distribution (heavier tails) Normal distribution

Sample Size Works well with small samples (<30) Typically used with larger samples

Z=xˉ−μσ/nZ=σ/n xˉ−μ
Test Statistic t=xˉ−μs/nt=s/n xˉ−μ

You might also like