Probability
and
Sampling Distributions
R as a set of statistical tables
• The R suite of programs provides a simple way for statistical tables of just about
any probability distribution of interest and also allows for easy plotting of the form
of these distribution
General syntax for distribution functions
• There are four basic R commands that apply to the various distributions defined in R.
• Letting DIST denotes the particular distribution and parameters the parameters to
specify that particular distribution
• d DIST(x, parameters)--- probability density of DIST evaluated at x
• p DIST(x, parameters)- returns Pr(DIST(parameters)<=x)
• q DIST(p, parameters)-- returns x satisfying Pr(DIST(parameters)<=x)=p
• rDIST(n, parameters)- generates a random variables from DIST(parameters)
Distribution Parameters R Function
Binomial size, prob binom
Poison lambda pois
Geometric prob geom
Negative binomial size, prob nbinom
Normal mean , sd norm
Student t df t
Chi- square df chisq
F df1,df2 f
Exponential rate exp
Gamma Shape, rate gamma
Uniform min , max unif
Probability density function
• dnorm(2) # returns probability density of standard normal distribution
evaluated at z=2
## let z be a sequence of random normal deviates ranging b/n -3.5 and 3.5, increasing by 0.1
• z<-seq(from=-3.5,to=3.5,by=0.1) # defines z-values for the pdf
• f1<-dnorm(z,0,1)
• par(mfrow=c(2,2))
• plot(z,f1,type="l") # type=“l” will plot the lines
• title("Normal distribution")
• y<-seq(0,5,length=100)
• f2<-dchisq(y,1) # computes the probability of y for χ2 with
df=1
• plot(y,f2,type="l")
• title("Chi-square distribution")
Normal distribution Chi-square distribution
0.0 0.1 0.2 0.3 0.4
0.0 0.5 1.0 1.5
f1
f2
-3 -2 -1 0 1 2 3 0 1 2 3 4 5
z y
Probability distributions
>z=seq(-4,4,by=.1)
>y=dnorm(z)
>plot(z,y,type="l", main="Standard Normal and t (df=3) Densities")
>y2=dt(z,df=3)
>lines(z,y2,type="l",col="red")
• One can see that Student's t density is very similar to standard Normal
density except that the t density has an additional parameter called degrees
of freedom (df).
• Each new choice of df will produce a new t density.
• If df=100 or larger, t density is almost the same as standard Normal.
Exercise: Change df=3 to various values (e.g. df=30) and see the two curves
Note: t density in red color has fatter tails
Probability mass function
• For discrete distributions, where variables can take on only distinct values, it is preferable to
draw a pin diagram,
• Example1 the binomial distribution with n = 50 and p = 0.33
• par(mfrow=c(2,1))
> x <- 0:50 # The Binomial distribution takes values 0,...,n
> plot(x,dbinom(x,size=50,prob=.33),type="h",main = "Binomial mass")
> plot(x, pbinom(x,size=50,prob=.33),type = "s", main = "Binomial distribution")
• The distribution drawn corresponds to, for example, the number of 5s or 6s in 50 throws of a
symmetrical die.
Example 2 the Poisson distribution with lambda 0.2
> x<-0:10
> y<-dpois(0:10,0.2)
>[Link]("Prob"=y,[Link]=x)
> plot(0:10, dpois(0:10,0.2), type='h', xlab="Sequence Errors", ylab="Probability" )
Cumulative distribution & Quantiles
• R has useful mechanism for determining p-values instead of searching
through statistical tables and they can be easily achieved using the p(dist)
and q(dist) functions. Some examples are shown below.
> pnorm(1.96, 0,1) # the probability that Z is less than or equal to 1.96
[1] 0.9750021
>2*pnorm(-1.96) # 2-sided p-value for normal distribution
[1] 0.0249979
> qnorm(0.975)
[1] 1.959964
> 2*pt(-2.43,df=13) # 2-sided p-value for t distribution
[1] 0.0303309
To find the probability of getting t=1.50 (or greater) when df=15.
• Use either of the methods
• Method 1
• > pt(1.50, df=15, [Link] = FALSE)
• [1] 0.07718333
• Method 2
• > 1- (pt (1.50, df = 15))
• [1] 0.07718333
what’s the probability of getting 12.1 or greater for a chi-square distribution with 8 degrees of
freedom?
• #Method 1
• > pchisq(12.1, df =8, [Link]= FALSE)
• [1] 0.1467976
• #Method2
• > 1 - pchisq(12.1, df =8)
• qt() calculates the quantile for a given prob-value and degrees of freedom
separately.
• qt(p, df, [Link] = TRUE)
• The default argument [Link] = TRUE is used for two-sided and one-sided
seceding tests (X <=x). It has to be set on FALSE for a one-sided acceding test.
>qt(0.03391, 13, [Link] = FALSE)
qt(0.025,df=13) # to calculate the quantile for probability
0.025
[1] -2.160369
> qchisq(0.975,1)
[1] 5.023886
>qf(0.01, 2, 7, [Link] = FALSE) ## upper 1% point for an F(2, 7)
Random number generators in R
• E.g. random numbers from the normal distribution can be drawn using
rnorm() function
• rnorm(10) # draws 10 random numbers from a standard
normal distribution
• rnorm(10,5,2) #draws 10 random samples from a normal
distribution with mean 5 and standard deviation of 2
• rnorm(1000,5,2) #draws 1000 random samples from a normal
distribution with mean 5 and standard deviation of 2
• runif(10)
• rbinom(20,size=5,prob=.2)
• rpois(20,6) # to generate a random sample of size=20
Continued…
• x <- rnorm(100, 50, 10)
• hist(x, probability = TRUE)
• mean(x)
• sd(x)
• xnew <- seq(min(x), max(x), length=100)
• lines(xnew, dnorm(xnew, mean(x), sd(x)), col="red")
Simulating the Sample Distribution of the Mean
• The CLT is perhaps the most important concept in statistics
>data<-rnorm(25,100,15)
mean(data)
sd(data)
• We know that when the population is normal, the sample mean has a
normal distribution with mean 100 and a standard deviation of 3.
• Let’s verify that with a statistical simulation
>mean(rnorm(25,100,15))
>replicate(10, mean(rnorm(25,100,15))) #replicate 10 times
>data<- replicate(100000, mean(rnorm(25,100,15))) # replicate
100000 times and save in data
>mean(data)
>sd(data)
• These results are very close to our theoretical expectation
Simulating the Sample Distribution of the Mean
• Let’s look at histogram of our means
>hist(data,breaks=100, main=“Histogram of Sample Means”,
col="blue“, xlab = "Sample Means")
>[Link](density(data)) # it certainly looks normal
Example2
>x <- rnorm(1000,5,2)
>hist(x, probability = TRUE)
>mean(x)
>lines(density(x), col="blue")
>xnew <- seq(min(x), max(x), length=100)
>lines(xnew, dnorm(xnew, mean(x), sd(x)), col="red"
Examining the distribution of a set of data
• Given a (univariate) set of data we can examine its distribution in a large number of
ways. The simplest is to examine the numbers.
>attach(faithful)
> summary(eruptions)
• Min. 1st Qu. Median Mean 3 rd Qu. Max.
1.600 2.163 4.000 3.488 4.454 5.100
Graphical examination of normality
• R has a function hist to plot histograms
>hist(eruptions) # It has an unusual
distribution.
>hist(eruptions, seq(1.6, 5.2, 0.2),prob=TRUE)
>lines (density(eruptions), bw=0.1)
>rug(eruptions) # show the actual data
points
##Q-Q plots can help us examine this more carefully
>qqnorm(eruptions)
>qqline(eruptions)
Cont’d…..
• We might want a more formal test of agreement with normality (or not).
• R provides the Shapiro-Wilk test
> [Link](eruptions),
Shapiro-Wilk normality test
data: eruptions),
W = 0.8459, p-value = 9.036e-16
• Since the p-value is very small, so we reject H0 and conclude that the sample are unlikely to be from a
normal distribution
Kolmogorov-Smirnov test
> long <- eruptions[eruptions > 3]
> [Link](long, "pnorm", mean = mean(long), sd = sqrt(var(long)))
• One-sample Kolmogorov-Smirnov test
data: long
D = 0.0661, p-value = 0.4284 alternative hypothesis: two-sided
• (Note that the distribution theory is not valid here as we have estimated the parameters of the normal
distribution from the same sample.)
Hypothesis Tests in R
Recall the basic structure of hypothesis tests:
• An overall model and related assumptions are made. (The most common
being observations following a normal distribution.)
• The null (H0) and alternative (H1 or HA) hypothesis are specified. Usually
the null specifies a particular value of a parameter.
• With given data, the value of the test statistic is calculated.
• Under the general assumptions, as well as assuming the null hypothesis is
true, the distribution of the test statistic is known.
• Given the distribution and value of the test statistic, as well as the form of
the alternative hypothesis, we can calculate a p-value of the test.
• Based on the p-value and pre-specified level of significance, we make a
decision. One of:
--Fail to reject the null hypothesis.
One and two sample tests
• Consider the following data on the latent heat of fusion of ice
(cal/gm) from rice (1995, p490).
>Method_1<-c(79.98, 80.04, 80.02, 80.04, 80.03, 80.03, 80.04, 79.97,
80.05, 80.03, 80.02, 80.00, 80.02)
>Method_2<-c(80.02, 79.94, 79.98, 79.97, 79.97, 80.03, 79.95, 79.97)
One Sample t-test
Suppose we want to test if the mean of Method _1 is 80:
We specify µ0 with mu=80.
>[Link](Method _1,mu=80) or
> [Link](Method_1,mu=80, alternative="[Link]", [Link] = 0.95)
Two Sample t-test
For the previous sample data
a) Test for the equality of means of the two samples
(b) Test for equality of the variances of the two samples.
#Box plots provide a simple graphical comparison of the two samples.
>boxplot(Method _1, Method _2) #which indicates that the first group tends to give
higher results than the second.
• To test for the equality of the means of the two examples, we can use an unpaired t-
test by
>[Link](Method _1, Method _2, alternative = c("[Link]"))
## which does indicate a significant difference (p < 0.05)., assuming normality.
Comparison of variances
• By default the R function does not assume equality of variances in the
two samples. We can however use the F-test to test for the equality of
variances in the two samples provided that the two samples are from
normal populations.
• Checking homogeneity (approximate equality) of variances is, on the one
hand, a necessary precondition for a number of methods (for example
comparison of mean values) and on the other hand the heart of a number
of more sophisticated methods (such as analysis of variance).
• F-test for variance equality
• This test depends on the ratio of variances. The null hypothesis asserts
the ratio to be one.
• Given the above data sets, the R code for this test where the level of confidence is
0.95 will be:
##which shows no evidence of a significant difference, and so we can use the classical
t-test that assumes equality of the variances.
[Link](Method _1, Method _2,[Link]=T, alternative = c("[Link]"))
Paired t-Test
• Paired tests are used when there are two measurements on the same
experimental unit.
• The theory is essentially based on taking differences and thus
reducing the problem to that of a one-sample test.
• First generate the following data
• x<-sample(Method_1,7,replace=FALSE)
• y<-sample(Method_2,7,replace=FALSE)
• [Link](x,y,paired=TRUE)
• All the tests seen so far assume normality of the two samples. The
two-sample Wilcoxon (or Mann Whitney) test only assumes a
common continuous distribution under the null hypothesis.
> [Link](A, B)
Wilcoxon rank sum test with continuity correction
data: A and B
W = 89, p-value = 0.007497
alternative hypothesis: true location shift is not equal to 0
• Package exactRankTests is required when there is ties in the data to
conduct a better test.
Nonparametric Tests of Group Differences
R provides functions for carrying out Mann-Whitney U, Wilcoxon
Signed Rank, Kruskal Wallis, and Friedman tests.# independent 2-
group Mann-Whitney U Test
[Link](y~A)
# where y is numeric and A is A binary factor
• # independent 2-group Mann-Whitney U Test
[Link](y,x) # where y and x are numeric
• # dependent 2-group Wilcoxon Signed Rank Test
[Link](y1,y2,paired=TRUE) # where y1 and y2 are numeric
• # Kruskal Wallis Test One Way Anova by Ranks
[Link](y~A) # where y1 is numeric and A is a factor
• # Randomized Block Design - Friedman Test
[Link](y~A|B)
# where y are the data values, A is a grouping factor
# and B is a blocking factor
χ2 test for I × J contingency table
Consider the following two categorical variables:
x<-
[Link](c("Milk","Milk","Milk","Milk","Tea","Tea","Tea","T
ea"))
y<-
[Link](c("Milk","Milk","Milk","Tea","Milk","Tea","Tea","T
ea"))
These vectors of categorical variables are converted into contingency
tables by the R as:
table(x,y)
y
x Milk Tea
Milk 3 1
Tea 1 3
For making a test, we use the following R code:
[Link](x,y)
R output:
Pearson's Chi-squared test with Yates' continuity correction #read more
on Yate’s continuity correction,..
data: x and y
X-squared = 0.5, df = 1, p-value = 0.4795
• There is no significant association between the two variables.
Correlation coefficients for continuous
variables
• x <- c(1,2,3,5,7,9)
• y <- c(3,2,5,6,8,11)
>[Link](x, y, method="pearson")
• If the linearity of a relationship or the normality of the residuals is
doubtful, a rank correlation test can be carried out. Mostly,
Spearman’s rank correlation coefficient is used:
>[Link](x, y, method="spearman")
Exercises
5.1 Do the values of the react data set (notice that this is a single
vector, not a data frame) look reasonably normally distributed? Does
the mean differ significantly from zero according to a t test?
• 5.2 In the data set vitcap, use a t test to compare the vital capacity
for the two groups. Calculate a 99% confidence interval for the
difference. The result of this comparison may be misleading. Why?
• 5.3 Perform the analyses of the react and vitcap data using
nonparametric techniques.
• 5.4 Perform graphical checks of the assumptions for a paired t test