Chapter 7: Basic Statistics
1. Statistics – Meaning (Detailed Explanation)
Statistics is a branch of mathematics that deals with data, but it is not only about numbers. It
is about understanding what numbers are telling us.
Statistics involves four main steps:
1. Collecting data – gathering information
2. Organizing data – arranging data in tables or charts
3. Analyzing data – finding averages, spread, relationships
4. Interpreting data – drawing conclusions and decisions
In simple words:
Statistics helps us convert raw data into useful information.
Real-life example:
A teacher collects marks of students (data), calculates average and pass percentage (analysis),
and decides whether students understood the subject (interpretation).
2. Importance of Statistics in Data Science (Detailed
Explanation)
In data science, we work with huge amounts of data. Statistics helps us manage and
understand this data.
Statistics is important because it:
Reduces large data into simple numbers (mean, percentage)
Helps compare groups (Class A vs Class B)
Identifies patterns and trends
Helps in prediction and decision-making
Example:
Netflix uses statistics to recommend movies based on user behavior.
Without statistics, data science cannot exist.
3. Types of Data Collection (Detailed Explanation)
Data collection is the first step in statistics. Data can be collected in two major ways
depending on how much control we have.
Observational Data
In observational data:
We only observe what is happening
We do not interfere or control
Examples:
Conducting surveys
Census data
Observing customer purchases
👉 Used when experiments are not possible.
(b) Experimental Data
In experimental data:
We conduct experiments
We control variables
Examples:
Giving different medicines to two groups
Testing two different teaching methods
Gives more accurate cause-and-effect results.
(b) Experimental Data
Data is collected by conducting experiments
Researcher controls conditions
Examples:
Testing new medicine
Comparing two teaching methods
4. Population and Sample (Detailed Explanation)
In statistics, studying the entire population is often difficult.
Population
Complete group under study
Very large in size
Example:
All voters in a country
Sample
Small part selected from population
Used to represent population
Example:
1000 voters selected for survey
A good sample gives accurate results about the population.
5. Sampling Methods (Detailed Explanation)
Sampling is the method of selecting individuals from a population.
Random Sampling
Every individual has equal chance
No bias
Example:
Lottery method
Unequal Probability Sampling
Some individuals have higher chance
Used when groups are unequal
Example:
Selecting more people from cities than villages
Proper sampling gives reliable results.
6. Measures of Central Tendency (Detailed Explanation)
Measures of central tendency help us find a single value that represents the whole data.
They help answer:
👉 What is the typical value?
Three measures are used:
Mean
Median
Mode
7. Mean (Average) – Detailed Explanation
Mean is the most commonly used average.
Formula:
Mean = Sum of all values / Number of values
Example:
Marks = 60, 70, 80
Mean = (60+70+80)/3 = 70
Advantage:
Easy to calculate
Disadvantage:
Affected by extreme values
Example:
Salaries = 10k, 15k, 20k, 1,00,000 → Mean becomes misleading
8. Median – Detailed Explanation
Median is the middle value when data is arranged in order.
Steps:
1. Arrange data in ascending order
2. Find middle value
Example:
Marks = 50, 60, 90
Median = 60
Advantage:
Not affected by extreme values
Best used for income, salary data
9. Mode – Detailed Explanation
Mode is the value that occurs most frequently.
Example:
Marks = 60, 70, 70, 80
Mode = 70
Useful when data is categorical
Example:
Most preferred mobile brand
10. Measures of Variation (Detailed Explanation)
Measures of variation tell us how data values differ from each other.
They help answer:
Are values close together or spread out?
Main measures:
Range
Variance
Standard Deviation
11. Variance and Standard Deviation (Detailed
Explanation)
Variance
Variance measures the average squared distance from the mean.
Higher variance means data is more spread out.
Standard Deviation (SD)
Standard deviation is the square root of variance.
Why SD is important:
Same unit as data
Easy to interpret
Example:
Low SD → consistent marks
High SD → inconsistent marks
12. Correlation (Detailed Explanation)
Correlation measures the strength and direction of relationship between two variables.
Types:
Positive correlation: both increase together
Negative correlation: one increases, other decreases
No correlation: no relationship
Example:
Temperature ↑ → Ice cream sales ↑ (positive)
Correlation does NOT mean causation.
13. Percentiles (Detailed Explanation)
Percentiles show the relative position of a value in a dataset.
Data is divided into 100 equal parts
Used to compare performance
Example:
If a student is in the 90th percentile, it means the student scored better than 90% of students.
Used in competitive exams, rankings, and performance analysis.
14. Quartiles (Detailed Explanation)
Quartiles divide data into four equal parts.
Q1 (25%) – Lower quartile
Q2 (50%) – Median
Q3 (75%) – Upper quartile
Use:
Helps understand data distribution and detect outliers.
15. Normal Distribution & Empirical Rule (Detailed
Explanation)
Normal Distribution
Bell-shaped curve
Most values are near the mean
Empirical Rule (68–95–99.7 Rule)
68% of data lies within 1 SD
95% of data lies within 2 SD
99.7% of data lies within 3 SD
Used to understand how data is spread around the mean.