Understanding Random Variables and Distributions
Understanding Random Variables and Distributions
Bernoulli requirements :
Random variable:
- a variable is considered random if the numerical values it can take occur as the outcomes Of A probability experiment.
P(X)
-X = a statistical variable is a characteristic of a population or sample that can vary
function
3 aE[X] Eb
2 = 1, 2 , 34
-A function made of probabilities
P(X x) = = E x)x
-
=
8
·
[(aX = b) =
other
·
P(X x)= =
f(x) for domain
function ·
var [aX = b] =
aVar[X]
GP(X x) = = 1
·
SD[aXIb] = aSD[X]
E[X]-f(x]
H
·
Var[X] =
Cliu) Paris or
·
Solve E[x] on main
El ,
2
,
3
,
43 +
50 25
.
,
0 29
.
,
0 25 0
.
,
.
753 Same 3 do get #(x]
·
Solve all on stats :
as
mejettie
+ 2
pl
C stats ,
Call
,
one var, free list X =
E[X] Ox =
SD
·
Solve Bin distribution Stret on graph mode table
·
mode =
P(X () = =
highest Probability
Range max-min
·
= outcomes
·
Gu
P(X x) X- Bin (n p)
=
,
= "G: p" " . 3) the peak of the histogram is The expected value
accounts for the diff combinations in order
4) the histogram is symmetrical on both sides of the
·
E(X] =
nP expected value
Ejafkadac
1) f(x)] 0 xLa
2 .
g 401x(4S 3
2) ( : + (x)dx
.
= 1 f(x) =
P(X =x) =
P(x() =
4S1xLSC 2
3) P(X x) = =
0 for (RV fundamental theorem of Calc says
1
,
x ?b
2) P(42 -x(48)
↓
(f(x)] =
=
an
f(x) = ↓
frequency of 40-45
.
Mean Standard deviation Percentile
E(x] =
fac f(x) dx SD(x] =
var(X] we are
identifying the score for which K % of the data lies below
e .
g finding 99th Percentile P(X1k) = 0 . 99 : Sa"fad = 0 . 99
Variance Median
var(y] = (a(x -
u) f(x) dx is the value of X When there is so % of data on either side of the score
or
2
P(a = x = m) =
P(m = x = b) = 0 .
3
=
(c + (c)dx- [Saxf(x)dx] or using the CDF
E(X -
E(x]2 f(m) =
P(x[m) = 0 . 9 so Catades
ba
Var(aX + b]]
E
avar(X] k =
so f(x) = Aaa2xzb
k
· Otherwise
SD(aX = b] = laISD(x]
Expected ,
var and median ofu crv
'
,
j
(b a)
Triangular distribution [(x] "I :
-
=
median Varixs =
12
f(x) P(T(x) = =
prof23 Jad
+
=
"
al + ]
:
all
Normal distribution
Def: we say that a continuous random variable x is normally distributed with two parameters; the mean M And variance ja. A normally distributed
random variable can be denoted using the notation
f(x) = de ) for -
x(x( *
There are two points of inflection, either side of the mean. It can be shown using O
is going to change the spread, that is, a greater
3) calculus techniques that the points of inflections occur at u -0 u + 0 standard deviation will create a greater spread.
,
4S The total area under the curve is equal to 1, making it a valid probability density function.
Standard
Score 68i .
as% · 99 7 %. . rule
:
-
z =
xjM · I score-how many SD away from mean
-
P( 322(3)
-
= 0 .
68 =
0 . 93 = 0 . 997
:12
z-N(0 1) +(2) ,
= for -11210
Properties :
1) the peak of the bell curve is at the mean value of 0, which is also the median
and mode.
-1 sa 8
F is in
--
ido
P(X(k) P( k(x(k)
-
P(X)k)
2) the curve is asymptotic to the z-axis and continues infinitely in both directions Inv Norm (Df :
Identify tail
,
insert Prob of the CDE ,
insert SD
,
insert mean
e a c h Standardise
·
score and compare which is the 2 Score of each test the k""Percentile
greatest ·
g P(X[x)
= c
: &
solving for
1 using z =
·
Random
Sampling
Population: The set of all eligible members of a group which statisticians intend to study is called a population
Parameter: is a numerical measure of a population, such as mean, median, mode, standard deviation etc. can only be calculated using data from census
However, an entire population can be very large or can be difficult to access, thus it would be very impractical to conduct a census
every time we wish to collect data.
Survey: can then be used to obtain the same information from each member of our sample, which is much quicker and cheaper than dealing with the
whole population.
Statistic: what we call a numerical measure of a sample of a target population. These sample statistics can be used to make inferences about the
population parameters. ·
mode median mean Range Max Min , , , ,
,
Representative sample: one that is representative of the population, as a result we would expect a fair and representative sample to produce
sample statistics close to the population parameters
Rias
Bias: A biased sample will not be representative of the population, as it will favour some section of the population.
Response
bias
Sampling bias
·
Spatial bias location bias ·
Under-coverage bias -
Includes an under or overrepresentation of the population
·
Non-response bias-due to an
unwillingness or an inability to respond
·
Self-Selection bias-as a result of an opt-in process ·
Certain response
Sampling
Probability Random sampling
( Simple random sampling 3) Stratified Sampling
every member of the Population has an equal chance of being included in the sample
·
The Population is divided into strata based on common
·
can be done Using Randlist (n ,
a
,
b) characteristics
·
n =
intergors generated from values a b
·
Then relevant proportions are taken from each subgroups
Systematic
sampling 2 .
9 7 8 a lo 11 12 S = Strata Size
Th Population is
·
sorted in an order and every nt" number of the population n in in in in in in
·
with this method it is important there is no hidden biased list order
&
Randlist (1 , a
,
b) o free: , every f from random numbe 4) Cluster Sampling
Rounded down ·
The Population is divided into clusters With each
from used to
·
Sampling
Non-Probability sampling
1) convience sampling 2 ) Quota
.
sampling
cass accessible sampling e .
g data for Primary school selected local data is collected until Quota filled
Simulating
~
samples
Variability of Samples: Each of the different samples taken from a population are likely to produce variable statistics, simply due to the
randomness of sampling
We can observe variation in samples through a process of simulation.
Simulation: A simulation is an equivalent situation to model the events of a random probability experiment, without actually conducting the
experiment
Parent Distribution: A parent distribution is the assumed distribution from which the sample is beinglaken from. \
a + (b a) Randlis + (m)
-
As n- > N ,
the Sample Statistics should
* clives m values between 0 and tend towards the Population Parameteres , but
it
creates interval for which sample selected may still vary due of random
Y be
may to nature
RandBin(1 ,
P
,
m)
·
Represents distribution of m bernauhi trials Normal a proximations to Binomial
RandBin (n ,
p
,
m) ·creates Set where Eas no of successes
&
Denoted 8 In :
,
x =
number of successes n =
sample Size Note : P is a constant true value ,
however due
that will
vary from sample to sample
Denoted & = in X =
,
XeBin(n p) ,
f(x] =
nP Varx] =
nOCI-p)
·
This distribution is used as the sample is interviewed via Bernoulli trials (Independent Answer ,
=
Yes/no)
= P = np(1 p) -
N
used as a point estimate for P
P
=
x
Sampling distribution of sample proportions: Distributions created by repeated samples of the population
Central limit theorem: given that p is approximately = 0.5 (meaning symmetry) and n is sufficiently large. Or given that p is not = 0.5 but np >10 and n(1-
p) >10 then the sampling distribution of sample proportions is approximately normally distributed. So as n—> infinity the distribution becomes more
normal
X-
Note if
Bin (n P)
conditions
,
arent met must ap
a
How many SD
away from true p ·
Probability interval
2 x P4))
p)
-p zx0( pap +
-
P(p
- <
-
z -
=
08s as =in ·
The area between 2 Points on a pencu n)
n ,
p -yaknemn
③ marks do inv right tail P(22k) = 0 . 01
,
then =
SD@ knour
Confidence Interval
Provides a range of values within which the population mean could lie in a sample proportion distribution
1 .
645 1 960 .
2 . S76 represented via (0-z ] z N
finding values
=
a + b
find p given a Cl [a b) ,
: 2
b
find min n Given given max value of f
a
=
constant
constant
findt given a (I (a b) ,
: w = b - a
,
f : with ,
E = b. ,
E = P -
a & use historical data for 0 :
CI interpretations : do in context
Comment on likelihood the true value of p lies within a single CI: likelihood cannot be inferred from a single observed sample as the CI either
contains or doesn’t contain the true value but we never know for certain due to the nature of Random sampling
If n amount of samples were to be taken from a c% confidence interval how many are expected to contain the true amount:
n*c
Precision: The precision of a confidence interval is a qualitative measure of how close the estimate is to the
true value of the parameter. To obtain a better interval estimate the width of the CI should be decreased
whilst preserving the confidence level
do in
Comparing 2 samples context
When given 2 samples you may be asked to comment where something has affected the sample proportions
A I * A I * *
I
* & * * I * A ,
If one or both of the sample proportions If there is no overlap at all then = If there is partial overlap between the 2
Are contained then = confidence intervals such that neither sample
There is sufficient evidence to proportion is contained then =
There is insufficient evidence to suggest suggest that a changed condition
that a changed condition had an impact on may have had an impact on the There is insufficient evidence to conclude
the population population anything definitive about the 2 samples
Example