0% found this document useful (0 votes)
2 views13 pages

Unit1 Mynotes

The document provides an overview of random variables, including discrete and continuous types, their probability distributions, and covariance. It explains key concepts such as the Central Limit Theorem, Chebyshev’s Inequality, and measures of central tendency and dispersion. Additionally, it discusses common discrete and continuous distributions, their characteristics, and applications in statistics and data analysis.

Uploaded by

snehacode128
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views13 pages

Unit1 Mynotes

The document provides an overview of random variables, including discrete and continuous types, their probability distributions, and covariance. It explains key concepts such as the Central Limit Theorem, Chebyshev’s Inequality, and measures of central tendency and dispersion. Additionally, it discusses common discrete and continuous distributions, their characteristics, and applications in statistics and data analysis.

Uploaded by

snehacode128
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Random Variable

 A random variable is a numerical outcome of a random experiment.

 It maps outcomes of a probabilistic process (sample space) into numerical values.

 Two types: Discrete Random Variable and Continuous Random Variable.

 random variable is denoted with a capital letter

 The probability distribution of a random variable X tells what the possible values of X are and how
probabilities are assigned to those values

 A random variable can be discrete or continuous

1. Discrete Random Variable

 Takes only finite or countably infinite values.

 Values are usually whole numbers (0, 1, 2, …).

 Probability distribution is described using a Probability Mass Function (PMF).

 Example:

o Tossing a coin → X = {0, 1} (0 = tails, 1 = heads).

o Rolling a die → X = {1, 2, 3, 4, 5, 6}.

 Key Point: Each value has a non-zero probability, and sum of probabilities = 1.

2. Continuous Random Variable

 Takes any value within an interval (uncountably infinite).

 Values are usually real numbers.

 Probability distribution is described using a Probability Density Function (PDF).

 Probability of any single value is zero, but probability over an interval is > 0.

 Example:

o Height of students in a class (say 150 cm – 190 cm).

o Time taken by a system to respond (continuous real values).

 Key Point: Probabilities are computed over a range using integration

Point Discrete Random Variable Continuous Random Variable

Takes uncountably infinite values within an


1. Definition Takes finite or countably infinite values.
interval.

Values are usually integers / whole Values are real numbers (fractions, decimals
2. Nature of Values
numbers. possible).

Described by Probability Mass Function Described by Probability Density Function


3. Function Used
(PMF). (PDF).
Point Discrete Random Variable Continuous Random Variable

4. Probability of Single
Can be non-zero. Always zero (probability only over intervals).
Value

5. Graphical
Bar graph / spikes. Smooth curve / continuous line.
Representation

6. Examples Coin toss, dice roll, number of students. Height, weight, time, temperature.

7. Summation vs Probabilities calculated using


Probabilities calculated using integration.
Integration summation.

Covariance

Definition

 Covariance is a statistical measure that tells us the direction of the linear relationship between two random
variables.

 It checks how much two variables vary together.

Formally:

Cov(X, Y) = E[(X - E[X]) (Y - E[Y])]

where

 E[X]= Mean of X,

 E[Y] = Mean of Y.

Interpretation

 Cov(X, Y) > 0 → Positive covariance → as X increases, Y also tends to increase.

 Cov(X, Y) < 0 → Negative covariance → as X increases, Y tends to decrease.

 Cov(X, Y) = 0 → No linear relationship (independent variables).

 Positive Covariance: When two variables tend to increase or decrease together. For example, height and
weight often have a positive covariance; as height increases, weight tends to increase as well.
 Negative Covariance: When one variable tends to increase as the other decreases, and vice versa. For
instance, the number of hours spent studying and the number of incorrect answers on a test might have a
negative covariance; more study time usually leads to fewer incorrect answers.
 Zero Covariance: When there is no clear linear relationship between the variables. Changes in one
variable do not correspond to predictable changes in the other. For example, there might be no consistent
relationship between a person's shoe size and their vocabulary level, resulting in a zero covariance.
Simplified Formula (for n data points)

Cov(X, Y) = 1/ n ∑ I = 1 to n (xi−xˉ)(yi−yˉ)

where

 xi ,yi = data points,

 xˉ,yˉ= means of X and Y.

Example

Suppose we have:

X = [1, 2, 3], Y = [2, 4, 6]

 Mean of X = 2, Mean of Y = 4.

Cov(X,Y)=(1−2)(2−4)+(2−2)(4−4)+(3−2)(6−4) / 3

= (−1)(−2) + (0)(0) + (1)(2) / 3

= 2+0+2 / 3 = 4/ 3 > 0

→ Positive covariance = strong upward relationship.

Relation with Correlation

 Correlation is the normalised version of covariance.

Corr(X, Y) = Cov(X, Y)/ σX σY

 Correlation is bounded between -1 and +1, whereas covariance is unbounded.

Applications
1. Feature Selection → Helps identify relationships between attributes.

2. Portfolio Analysis (Finance) → To see how two stocks move together.

3. PCA (Principal Component Analysis) → Uses covariance matrix to find directions of maximum variance.

4. Multivariate Data Modelling → Shows dependencies across variables.

1. Discrete Distributions

A discrete probability distribution describes the probability of occurrence of each value of a discrete random
variable.

Key Characteristics

 Values are countable (finite or infinite).

 Represented using Probability Mass Function (PMF).

 Probability of each value ≥ 0, and total probability = 1.

 Often visualised as a bar graph.

Common Discrete Distributions

1. Bernoulli Distribution

o Represents a single trial with 2 outcomes (success/failure).

o Example: Tossing a coin (Head = 1, Tail = 0).

2. Binomial Distribution

o Repeated independent Bernoulli trials (n trials, probability p).

o Example: Number of heads in 10 coin tosses.

3. Poisson Distribution

o Models rare events in a fixed time/space interval.

o Example: Number of phone calls received in 1 minute.

4. Geometric Distribution

o Models number of trials until first success.

o Example: Number of dice rolls needed to get a “6”.

2. Continuous Distributions

A continuous probability distribution describes the probability of occurrence of a continuous random variable

Key Characteristics

 Values are uncountably infinite (real numbers).

 Represented using Probability Density Function (PDF).

 Probability of a single point = 0; only intervals have probabilities.

 Visualised as a smooth curve (area under curve = 1).


Common Continuous Distributions

1. Uniform Distribution (Continuous)

o Equal probability across a range [a, b].

o Example: Random number generator between 0 and 1.

2. Normal (Gaussian) Distribution

o Symmetrical bell curve with mean μ and variance σ².

o Example: Human height, exam scores.

3. Exponential Distribution

o Models time between events in a Poisson process.

o Example: Time between arrivals of buses.

4. Chi-Square Distribution

o Used in hypothesis testing, variance estimation.

Central Limit Theorem


 If the population from which the sample has a been drawn is a normal population then the sample means would
be equal to population mean and the sampling distribution would be normal(bell curve) When the more
population is skewed, then the sampling distribution would tend to move closer to the normal distribution,
provided the sample is large (i.e. greater then 30).

 According to Central Limit Theorem, for sufficiently large samples with size greater than 30, the shape of the
sampling distribution will become more and more like a normal distribution, irrespective of the shape of the
parent population.

 This theorem explains the relationship between the population distribution and sampling distribution. It
highlights the fact that if there are large enough set of samples then the sampling distribution of mean
approaches normal distribution.

 The significance of the central limit theorem lies in the fact that it permits us to use sample statistics to make
inferences about population parameters without knowing anything about the shape of the frequency
distribution of that population other than what we can get from the sample.

key Points
1. Population may not be normal → CLT still works.
2. Sample size matters → Usually, n≥30 is considered large enough.
3. Sampling distribution of mean →
o Mean of sample means = Population mean (μ).
o Variance of sample means = σ ^2 / n , where σ^2 is population variance.

Mathematical Statement
If X1 ,X2 ,…, Xn are independent random samples from a population with mean μ and variance σ^2, then the sample
mean:
Xˉ= 1 / n ∑i=1 to n Xi

follows approximately:
Xˉ∼N(μ, σ^2 /n) for large n

Z=Xˉ− μ/ (σ/ root n) ∼N(0,1)


And in standardised form:
Example (Simple)
Suppose we want to study the average height of students in a college.
 The population distribution may be skewed (some very tall, some short).
 If we take many samples of size 40 and calculate the mean height for each sample →
The distribution of these means will be bell-shaped (Normal), even though the original population was not
normal.

Why CLT is Important?


1. It allows us to use the Normal Distribution even when the data is not normal.
2. Forms the basis for Hypothesis Testing (Z-test, t-test).
3. Helps build Confidence Intervals for population parameters.
4. Widely used in Data Science & Machine Learning when dealing with averages.

Chebyshev’s Inequality
Chebyshev’s inequality is a rule in probability theory that tells us how data is spread around the mean (average).
It gives a guarantee that, for many different probability distributions (not just normal), only a certain fraction of
values can be very far away from the mean.
Chebyshev’s inequality is a statistical theorem that gives an estimate of the probability that the value of Random
variable lies within a certain number of standard deviations from the mean. It is applicable to any probability
distribution, regardless of its shape, as long as the distribution has a finite mean and variance.
• In probability theory, Chebyshev’s inequality guarantees that, Most values will always lie close to the mean, and
only a limited fraction can lie very far away.

Formula

Explanation
 K = number of standard deviations away from the mean.
 σ (sigma) = standard deviation (spread of data).
 μ (mu) = mean (average).
 So, if we choose K = 2, then:
o At most 1/4=25%1/4 = 25\%1/4=25% of data will lie outside 2 standard deviations.
o At least 1−1/4=75%1 - 1/4 = 75\%1−1/4=75% of data will lie within 2 standard deviations.

Examples
1. For K = 2: At least 75% of data is within ±2σ from the mean.
2. For K = 3: At least 88.9% of data is within ±3σ from the mean.
3. For K = 4: At least 93.75% of data is within ±4σ from the mean.
Example of Chebyshev’s Inequality
Suppose the mean (μ) of marks in a class test is 50, and the standard deviation (σ) is 10.
We want to know: How many students’ marks will lie within 2 standard deviations of the mean?

Step 1: Recall the formula


Chebyshev’s inequality says:
P(∣X−μ∣<kσ)≥1−1k2P(|X - μ| < kσ) \geq 1 - \frac{1}{k^2}P(∣X−μ∣<kσ)≥1−k21

Step 2: Put values


Here, k=2k = 2k=2.
So,
P(∣X−μ∣<2σ)≥1−122P(|X - μ| < 2σ) \geq 1 - \frac{1}{2^2}P(∣X−μ∣<2σ)≥1−221 P(∣X−μ∣<2σ)≥1−14P(|X - μ| < 2σ) \geq
1 - \frac{1}{4}P(∣X−μ∣<2σ)≥1−41 P(∣X−μ∣<2σ)≥0.75P(|X - μ| < 2σ) \geq 0.75P(∣X−μ∣<2σ)≥0.75

Step 3: Interpret the result


 This means at least 75% of students’ marks will lie within 2 standard deviations from the mean.
 Here, 2σ = 2 × 10 = 20.
 So, the interval is: 50 – 20 to 50 + 20 → [30, 70].

Applications
1. Outlier detection → If a value lies far outside Kσ, it is unusual.
2. Works for all distributions → unlike Empirical Rule which only works for Normal distribution.
3. Guarantees minimum coverage → Useful when the population shape is unknown.
4. Risk analysis → Ensures most data is near the mean in finance, ML, and statistics.

Measures of central tendency are numerical values that represent the centre or average of a dataset.
They describe where most of the data points lie and help summarise the dataset into a single representative value.

1. Mean (Arithmetic Average)


 Formula:
Mean=Sum of all valuesNumber of values\text{Mean} = \frac{\text{Sum of all values}}{\text{Number of
values}}Mean=Number of valuesSum of all values
 Example: Data = {2, 4, 6, 8} → Mean = (2+4+6+8)/4 = 5
 Pros: Easy to calculate, uses all values.
 Cons: Affected by extreme values (outliers).

2. Median
 Definition: The middle value when data is arranged in ascending or descending order.
 If odd number of values → median = middle value.
 If even number of values → median = average of two middle values.
 Example: Data = {3, 5, 7, 9, 11} → Median = 7
 Pros: Not affected by outliers.
 Cons: Ignores the exact values of data except the middle.

3. Mode
 Definition: The value that occurs most frequently in a dataset.
 Example: Data = {2, 4, 4, 6, 8} → Mode = 4
 A dataset can have:
o No mode (if no value repeats)
o One mode (unimodal)
o More than one mode (bimodal, multimodal)
 Pros: Useful for categorical data.
 Cons: May not be unique, sometimes not representative.

Measures of Dispersion
Definition
 Measures of dispersion tell us how spread out or scattered the data is around its central value (like mean or
median).
 While measures of central tendency (mean, median, mode) describe the centre of the data, dispersion shows
the variability or consistency of the data.
 Example: Two classes may have the same average marks (mean = 60), but in one class most students scored
around 60, while in the other scores varied widely from 30 to 90.
→ Here, dispersion differentiates the two datasets.
Types of Measures of Dispersion
1. Range
 Simplest measure of dispersion.
 Formula:
Range=Maximum Value – Minimum Valu
 Example: Marks vary between 20 and 95 → Range = 75.
 Limitation: Only depends on extreme values, ignores rest of data.

2. Mean Deviation
 Average of the absolute differences between each value and the mean (or median).
 Formula:
MD=∑∣X−Xˉ∣/ N
 Tells us the average deviation from the centre.
 More accurate than Range but less used compared to Standard Deviation.

3. Variance
 Measures average of the squared differences from the mean.
 Formula:
σ^2=∑(X−Xˉ)^2 / N
 Useful in probability and statistics.
 Unit problem: variance is in squared units of data, so harder to interpret directly.

4. Standard Deviation (SD)


 Square root of variance.
 Formula:
σ= root of ∑(X−Xˉ)2 / N
 Most widely used measure.
 Tells us how much values deviate, on average, from the mean.
 Example: Smaller SD = data is consistent; Larger SD = data is spread out.

5. Coefficient of Variation (CV)


 Relative measure of dispersion (percentage form).
 Formula:
CV=σ / Xˉ ×100
 Used to compare variability of two datasets with different units or means.
 Example: Compare sales variability of two companies with different average sales.

Graphical Statistics
Definition
 Graphical statistics means representing data using diagrams, charts, or graphs instead of just numbers and
tables.
 It helps us see patterns, trends, and comparisons quickly.
 "A picture is worth a thousand words" → Graphs make data easy to understand.

Types of Graphical Statistics


1. Bar Graph
 Uses rectangular bars to show data.
 Height/length of bar = value.
 Types:
o Simple Bar Graph → one variable.
o Multiple Bar Graph → compare two or more variables.
o Stacked Bar Graph → parts of a whole shown in one bar.
 Example: Comparing sales of a company in 4 quarters.

2. Histogram
 Special type of bar graph for continuous data.
 Bars are joined together (no gaps).
 Shows frequency distribution (how many values fall in different ranges).
 Example: Distribution of students’ marks in intervals (0–10, 10–20, etc.).

3. Pie Chart
 Circular chart divided into slices.
 Each slice = proportion of the whole.
 Useful for showing percentage or share.
 Example: Market share of different companies.

4. Line Graph
 Shows data points connected by lines.
 Best for time series data (changes over time).
 Example: Stock prices over a month.

5. Frequency Polygon
 Similar to histogram but uses lines instead of bars.
 Plots class intervals against frequencies and joins them with straight lines.
 Example: Marks distribution in a class.

6. Ogive (Cumulative Frequency Curve)


 Graph of cumulative frequencies.
 Used to find median, quartiles, percentiles.
 Example: To see how many students scored less than 50 marks.

7. Scatter Plot (Scatter Diagram)


 Uses dots/points to show relationship between two variables.
 Helps identify correlation (positive, negative, or none).
 Example: Study hours vs. marks scored.

Why Use Graphical Statistics?


 Easy to understand large data.
 Shows patterns and trends clearly.
 Useful for decision-making in business and research.

1. Method of Moments (MoM)


Definition
The Method of Moments is a technique of estimating the unknown parameters of a population distribution by
equating the sample moments (calculated from data) to the theoretical moments (defined by the probability
distribution).

Explanation
 A moment is a statistical measure like mean, variance, skewness etc.
 In MoM, we use the fact that the theoretical distribution has certain moments in terms of its parameters.
 We calculate the same moments from sample data.
 Then, we set them equal to the theoretical moments and solve for the parameters.

Steps
1. Write theoretical moments (e.g., population mean, variance) in terms of parameters.
2. Calculate sample moments from data (e.g., sample mean, sample variance).
3. Equate sample moments to theoretical moments.
4. Solve the resulting equations to estimate parameters.

Example
Suppose X∼Exponential(λ)
 Theoretical mean = 1 /λ.
 Sample mean = Xˉ.

Xˉ=1 / λ ⇒ λ = 1/ Xˉ
Using MoM:

Advantages
 Simple and easy to compute.
 Provides quick estimates.
Disadvantages
 Not always efficient (estimates may not be very accurate).
 Sometimes does not give valid solutions if equations are difficult.

2. Maximum Likelihood Estimation (MLE)


The Maximum Likelihood Estimation method finds the parameter values that maximise the likelihood function, i.e.,
the probability of observing the given sample data under the assumed distribution.

Explanation
 The likelihood function is the joint probability of sample data, treated as a function of the parameters.
 MLE chooses parameter estimates that make the observed data most probable.

Steps
1. Write the probability density/mass function (PDF/PMF) of the distribution.
2. Write the likelihood function L(θ)L(\theta)L(θ) = product of probabilities of sample observations.
3. Take the log-likelihood function (log simplifies calculations).
4. Differentiate log-likelihood with respect to parameter(s) θ\thetaθ.
5. Solve the equation ∂∂θln⁡L(θ)=0\frac{\partial}{\partial \theta} \ln L(\theta) = 0∂θ∂lnL(θ)=0 to get the
estimate.

Process (Steps in Simple Words)


1. Write probability distribution for the data with unknown parameter(s).
2. Form likelihood function (L) → multiply probabilities of all observed data points.
3. Take log (log-likelihood) to simplify calculations.
4. Differentiate log-likelihood with respect to parameter(s).
5. Solve equation = 0 → this gives the MLE value of the parameter.
6. Check maximum (second derivative or logic).

Advantages
 Estimates are often efficient and consistent.
 Widely used in statistical modelling and machine learning.
Disadvantages
 Sometimes mathematically complex.
 Requires differentiability of functions.

1. Geometric Distribution
The Geometric Distribution is a probability distribution that models the number of trials required to get the first
success in repeated independent Bernoulli trials (yes/no experiments).
 Each trial has only two outcomes: success (probability ppp) or failure (probability q=1−pq = 1-pq=1−p).
 The trials are independent (outcome of one trial does not affect the other).
 It tells us: "What is the probability that the first success happens on the kth trial?"

Probability Mass Function (PMF)


P(X=k) = (1−p)^k−1 * p , k=1,2,3,...
Here k = trial number when the first success occurs.
p = probability of success.

Mean (Expected value)


E[X]=1 / p
Variance
Var(X)=1−p / p^2

Example
Suppose probability of getting a head on a coin toss = 0.5.
 What is the probability that the first head appears on the 3rd toss?
P(X=3)=(1−0.5)2(0.5)=(0.5)2×0.5=0.125P(X=3) = (1-0.5)^{2} (0.5) = (0.5)^2 \times 0.5 =
0.125P(X=3)=(1−0.5)2(0.5)=(0.5)2×0.5=0.125
✅ So, there is a 12.5% chance that first head appears exactly on the 3rd toss.

2. Binomial Distribution
The Binomial Distribution gives the probability of getting exactly k successes in n independent trials, where each
trial has:
 Two outcomes (success/failure).
 Probability of success = ppp, probability of failure = q=1−pq=1-pq=1−p.
 Each trial is independent.

Probability Mass Function (PMF)


P(X=k)=(nk)pk(1−p)n−k,k=0,1,2,...,nP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, 2, ..., nP(X=k)=(kn
)pk(1−p)n−k,k=0,1,2,...,n
 (nk)=n!k!(n−k)!\binom{n}{k} = \frac{n!}{k!(n-k)!}(kn)=k!(n−k)!n! (number of ways to choose kkk successes
from nnn trials).
 pkp^kpk = probability of kkk successes.
 (1−p)n−k(1-p)^{n-k}(1−p)n−k = probability of remaining failures.

Mean (Expected value)


E[X]=np
Variance
Var(X)=npq = np(1−p)

Example
A coin is tossed 5 times (n=5n=5n=5), with p=0.5p=0.5p=0.5 for a head.
 Probability of getting exactly 2 heads:
P(X=2)=(52)(0.5)2(0.5)3P(X=2) = \binom{5}{2}(0.5)^2 (0.5)^3P(X=2)=(25)(0.5)2(0.5)3 =5!2!3!(0.25)
(0.125)=10×0.03125=0.3125= \frac{5!}{2!3!} (0.25)(0.125) = 10 \times 0.03125 = 0.3125=2!3!5!(0.25)
(0.125)=10×0.03125=0.3125
✅ So, there is a 31.25% chance of getting exactly 2 heads in 5 tosses.

Independent Variable
An independent variable is the variable that a researcher manipulates or controls to study its effect on another
variable (called the dependent variable).
 It is the cause or input in an experiment.
 The dependent variable is the effect or outcome.
 In data modelling or statistics, independent variables are also called predictors, features, or explanatory
variables.
📌 Example:
If we are studying “Effect of study hours on exam marks”:
 Independent variable = number of study hours.
 Dependent variable = marks obtained.

Types of Independent Variables


1. Manipulated Independent Variable
o The researcher deliberately changes or controls it.
o Used in experiments.
o Example: Giving different groups of patients different drug doses.
2. Subject / Participant Independent Variable
o Based on characteristics of subjects that cannot be controlled.
o Example: Age, gender, intelligence level.
o Researcher only observes, doesn’t manipulate.
3. Situational Independent Variable
o Refers to the environment or setting that affects outcomes.
o Example: Testing memory in noisy vs. quiet rooms.
4. Controlled Independent Variable
o Variables kept constant to ensure they don’t influence results.
o Example: Keeping room temperature constant while testing effect of light on plant growth.
5. Categorical Independent Variable
o Takes discrete categories.
o Example: Type of diet (veg, non-veg, vegan).
6. Continuous Independent Variable
o Can take any numerical value within a range.
o Example: Time spent studying (2.5 hrs, 3 hrs, etc.).
1. Subtypes and Supertypes
Concept
 In data modelling, a supertype is a general entity, while subtypes are specialised versions of that entity.
 It is like a parent-child relationship.
 Subtypes inherit the attributes of the supertype but can also have their own unique attributes.
📌 Example:
 Supertype: Employee (common attributes: EmpID, Name, Salary).
 Subtypes:
o Manager (extra attribute: Bonus).
o Engineer (extra attribute: ProgrammingLanguage).
Why useful?
 Helps avoid repetition of attributes.
 Makes the model more organised and flexible.

2. Hierarchical Data
Concept
 Hierarchical data is organised in a tree structure (parent–child).
 One record (parent) can have multiple child records, but each child has only one parent.
📌 Example:
 Company structure: CEO → Managers → Employees.
 File system: Folder → Subfolders → Files.
Why useful?
 Easy to represent organisational structures.
 Natural way to store data that has "levels".

3. Recursive Relationships
Concept
 A recursive relationship happens when an entity has a relationship with itself.
 It is used when records of the same entity are related in a hierarchical or network way.
📌 Example:
 Employee table:
o An employee can be a manager of another employee.
o Relationship: Employee → manages → Employee.
Why useful?
 Allows modelling of self-referencing data like family trees, organisational charts, or bill-of-materials.

4. Historical Data
Concept
 Historical data means keeping track of past changes in data instead of just storing the latest value.
 Useful for analysis, auditing, and understanding trends over time.
📌 Example:
 Customer Address: Instead of only storing the current address, we keep all past addresses with "effective
dates".
 Salary history: Employee’s salary record stored year by year.
Why useful?
 Helps in time-based analysis (e.g., sales growth).
 Important in industries like banking, healthcare, and retail where changes must be recorded for compliance.

You might also like