0% found this document useful (0 votes)
19 views19 pages

Introduction to R Statistical Package

This document provides a 3-sentence summary of a document that introduces R: [1] R is a widely used statistical software package that allows users to perform statistical analyses, generate graphics and customize functions. [2] The document reviews R's main features and capabilities for basic statistics, graphics, and creating custom functions across 5 sections. [3] It also provides instructions on downloading, installing and using R, as well as examples of basic calculations, data manipulation, and common statistical analyses that can be performed in R like t-tests and linear regression.

Uploaded by

Joel Tan Yi Jie
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views19 pages

Introduction to R Statistical Package

This document provides a 3-sentence summary of a document that introduces R: [1] R is a widely used statistical software package that allows users to perform statistical analyses, generate graphics and customize functions. [2] The document reviews R's main features and capabilities for basic statistics, graphics, and creating custom functions across 5 sections. [3] It also provides instructions on downloading, installing and using R, as well as examples of basic calculations, data manipulation, and common statistical analyses that can be performed in R like t-tests and linear regression.

Uploaded by

Joel Tan Yi Jie
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A Quick Introduction to R

This document very briefly reviews the main features of the R statistical package. For full
details, please refer to the various manuals and help files that come packaged with R, or the
extra material referred to at the end of this document.

This document is in divided into five sections:

1. Introduction: How to download and install, overview of capabilities.

2. R Basics: Covers using R as a calculator, entering data, types of variables, random


number generation, and getting help.

3. Basic Statistics: Covers t-tests with confidence intervals for means, chi-square tests
with confidence intervals for proportions, non-parametrics, simple regression, general-
ized linear models including logistic regression.

4. Graphics: scatter plots, histograms, 3-D plots, boxplots

5. Creating your own functions: We will see here how to create our own functions,
to extend the capabilities of R.

1 Introduction

1.1 What is R?

Quoted directly from the R WWW page ([Link]

“R is a language and environment for statistical computing and graphics. It is


a GNU project which is similar to the S language and environment which was
developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by
John Chambers and colleagues. R can be considered as a different implemen-
tation of S. There are some important differences, but much code written for S
runs unaltered under R.
“R provides a wide variety of statistical (linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering, ...) and graphical
techniques, and is highly extensible. The S language is often the vehicle of choice

1
for research in statistical methodology, and R provides an Open Source route to
participation in that activity.
“One of R’s strengths is the ease with which well-designed publication-quality
plots can be produced, including mathematical symbols and formulae where
needed. Great care has been taken over the defaults for the minor design choices
in graphics, but the user retains full control.
“R is available as Free Software under the terms of the Free Software Foundation’s
GNU General Public License in source code form. It compiles and runs on a wide
variety of UNIX platforms and similar systems (including FreeBSD and Linux),
Windows and MacOS.”

In short: R is “free Splus”, a very sophisticated statistics and graphics package, used by
most statisticians these days for modern analysis and statistical research.

2 Installation

Installation follows a few trivial steps:

1. Go to R web page, at [Link]


2. On the left hand side menu on the screen, click on “CRAN” which is under the “Down-
load” item.
3. Pick a country site from which to download (for example, University of Toronto under
Canada, but really you can pick any, all this effects is download speed).
4. Near the top of the page (e.g., [Link] inside the box labelled “Pre-
compiled Binary Distributions”, pick the right file to download, depending on your
operating system (e.g., click on “Windows (95 and later)” if you have any Windows
system).
5. This brings you to a page where you select the part of R you need. While you may later
want to download the set of user contributed functions, for now just click on “base”,
which gets you the basic R program.
6. At the next screen, click on “[Link]” (or similar). At this point you should
be asked (via a prompt box) where you want to save the file. Pick a place on your
computer to save this (e.g., c:\temp).
7. After file has downloaded, in your explorer (in Windows, use equivalent if you have an-
other operating system), find where you have saved it, and click on “[Link]”
(or similar).

2
8. Follow directions on screen until program is installed. In general, you can just leave
all options unchanged.

9. After installation is completed, click on the “R” icon on your desktop to begin the
program. At this point, your screen should have an R prompt on it, and you are ready
to begin using the program.

3 Adding Packages

You can easily extend R’s basic capabilities by adding in functions that others have created
and made freely available through the web. Rather than surfing the web trying to find,
them, R makes adding these capabilities trivially easy by using the “Packages” menu items.

For example, to add a new package not previously downloaded onto your computer, click on
“Packages −→ Install Package(s)”, pick a site from the list (usually a Canadian site) and
choose the package you want to install from the long list (currently, about 900 packages are
available). To activate this package, you then need to click on it in the ‘Packages −→ Load
Package” menu item. You are then ready to use it, and the help files for the package will
also be automatically added to your installation. See, for example, the “Help −→ HTML
help” menu item, and click on “Packages” in the browser window that opens.

4 R Basics

R can be used in many ways:

Use R as a Calculator:

> 2+3
[1] 5
> pi
[1] 3.141593
> 2+3*pi
[1] 11.42478
> log(2+3*pi)
[1] 2.435785
> exp(2.435785)
[1] 11.42478

3
> 6/7
[1] 0.8571429
> cos(pi)
[1] -1

Entering and Manipulating Data in R:

Scaler variables:

> a <- 3
> a
[1] 3
> b <- 5
> b
[1] 5
> b-a
[1] 2
> b/a
[1] 1.666667

Vector variables:

> x<-c(2,3,1,5,4,6,5,7,6,8)
> x
[1] 2 3 1 5 4 6 5 7 6 8
> y <- c(10, 12, 14, 13, 34, 23, 12, 34, 25, 43)
> y
[1] 10 12 14 13 34 23 12 34 25 43
> x[4]
[1] 5
> y[3]
[1] 14
> x[1:10]
[1] 2 3 1 5 4 6 5 7 6 8
> x[6:8]
[1] 6 5 7
> x[x > 5]
[1] 6 7 6 8

4
Functions on vectors:

> length(x)
[1] 10
> length(y)
[1] 10
> sum(x)
[1] 47
> sum(y)
[1] 220
> sum(x^2)
[1] 265
> mean(x)
[1] 4.7
> mean(y)
[1] 22
> var(x)
[1] 4.9
> var(y)
[1] 136.4444
> sqrt(var(x))
[1] 2.213594
> sqrt(var(y))
[1] 11.68094
> sum((x-mean(x))^2)
[1] 44.1
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.25 5.00 4.70 6.00 8.00
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.00 12.25 18.50 22.00 31.75 43.00
> summary(x^2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 10.75 25.00 26.50 36.00 64.00

Matrices:

> z<- matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, byrow=T)


> z

5
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> [Link]<- matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, byrow=F)
> [Link]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> z[2,3]
[1] 6
> z[,1]
[1] 1 4 7
> z[1,]
[1] 1 2 3
> z[3,]
[1] 7 8 9

You can also have character variables:

> sex <- c("male", "female", "female", "male", "male", "female", "female")
> sex
[1] "male" "female" "female" "male" "male" "female" "female"
> smoking <- c("yes", "no", "no", "yes", "no", "yes", "no", "no", "yes", "no")
> smoking
[1] "yes" "no" "no" "yes" "no" "yes" "no" "no" "yes" "no"

You can declare these variables to be factor variables for regression analyses:

> [Link]<- [Link](sex)


> [Link]
[1] male female female male male female female
Levels: female male

It is trivially easy to generate random numbers in R, useful for all kinds of


statistical analyses and simulations:

> x<-rnorm(10, mean=3, sd=2)

6
> x
[1] 2.135011 7.573118 3.840814 5.544555 3.393277 2.731562
[7] 1.780588 4.576343 3.250077 3.908082
> y<-rbinom(10, size=1, prob=0.4)
> y
[1] 0 1 1 1 0 0 1 0 1 0
> z<- 5 + 2*x + 3*y +rnorm(10, mean=0, sd=0.5)
> z
[1] 9.44732 23.30763 15.34688 19.62479 11.30383 10.89258
[7] 11.22488 13.33018 15.11767 12.81894

Data Frames:

Most data sets are entered into R as data frames. To create a data frame, you
can enter:

> [Link]<-[Link](x,y,z)
> [Link]
x y z
1 2.135011 0 9.44732
2 7.573118 1 23.30763
3 3.840814 1 15.34688
4 5.544555 1 19.62479
5 3.393277 0 11.30383
6 2.731562 0 10.89258
7 1.780588 1 11.22488
8 4.576343 0 13.33018
9 3.250077 1 15.11767
10 3.908082 0 12.81894

To get help on any R function, simply type

> help(sum)

and a help window will pop up, in this case for the function “sum”. R also comes with
extensive manuals, html help files, a nice introduction, etc. Extremely Useful Suggestion for
this Course: Look at the help files for [Link] to see how to enter data from a text file on
your computer.

7
5 Basic Statistical Routines

You can use R to do any of the standard statistical routines you will typically need. R
includes everything you will learn in 607/621, and many other routines, with more added
regularly. R has many more modern (frequentist) routines available compared to other
packages such SAS or STATA, and whatever is not in the base package is likely to be in
the contributed section. In addition, whatever is not available in either of these sources,
“built-in” routines can easily be added on your own, as we will soon see.

T-Tests

> group1 <- x[y==1]


> group0 <- x[y==0]
> group1
[1] 7.573118 3.840814 5.544555 1.780588 3.250077
> group0
[1] 2.135011 3.393277 2.731562 4.576343 3.908082
> [Link](group1, group0)

Welch Two Sample t-test

data: group1 and group0 t = 0.9667, df = 5.431, p-value = 0.3748


alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.675244 3.773195
sample estimates: mean of x mean of y
4.397831 3.348855

Chi-square tests

Suppose you have a two-by-two table of data, such as:

Patient Independent Patient Dependent


Stroke Unit 67 34
Medical Unit 46 45

You can simply enter:

> [Link](c(67,46), c(67+34, 46+45) )

2-sample test for equality of proportions with continuity correction

8
data: c(67, 46) out of c(67 + 34, 46 + 45) X-squared = 4.2965, df
= 1, p-value = 0.03819 alternative hypothesis: [Link] 95
percent confidence interval:
0.009420794 0.306322868
sample estimates:
prop 1 prop 2
0.6633663 0.5054945

More generally:

> smokers <- rbinom(100, 1, 0.2)


> [Link]<-rbinom(200, 1, 0.12)
> [Link](c(sum(smokers), sum([Link])), c(100,200), correct=F)

2-sample test for equality of proportions without continuity


correction

data: c(sum(smokers), sum([Link])) out of c(100, 200)


X-squared = 2.1319, df = 1, p-value = 0.1443 alternative
hypothesis: [Link] 95 percent confidence interval:
-0.02659294 0.15659294
sample estimates: prop 1 prop 2
0.200 0.135

Linear Models

Recall our data frame:

> [Link]
x y z
1 2.135011 0 9.44732
2 7.573118 1 23.30763
3 3.840814 1 15.34688
4 5.544555 1 19.62479
5 3.393277 0 11.30383
6 2.731562 0 10.89258
7 1.780588 1 11.22488
8 4.576343 0 13.33018
9 3.250077 1 15.11767
10 3.908082 0 12.81894

9
Suppose we want simple linear regression model of z on x and y.

> [Link]<-lm(z ~ x + y)
> summary([Link])

Call:
lm(formula = z ~ x + y)

Residuals:
Min 1Q Median 3Q Max
-0.6821 -0.4339 0.0893 0.3849 0.5679

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8644 0.4245 11.460 8.66e-06 ***
x 1.9989 0.1064 18.792 3.00e-07 ***
y 3.2690 0.3450 9.475 3.05e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5162 on 7 degrees of freedom


Multiple R-Squared: 0.9889, Adjusted R-squared: 0.9857
F-statistic: 311.6 on 2 and 7 DF, p-value: 1.444e-07

Recall that z was generated by

> z<- 5 + 2*x + 3*y +rnorm(10, mean=0, sd=0.5)

so our estimates are close to expected (including sd estimate).

Other types of regression equations are available through generalized linear models:

For logistic regression, type (assuming z is dichotomous)

> glm(z ~ x + y, family=binomial(link="logit"))

For poisson regression, type (assuming z is a count variable)

> glm(z ~ x + y, family=poisson(link = "log"))

and so on.

10
6 Graphics

R is capable of producing very high quality graphics in a variety of formats (wmf, ps, jpg,
pdf, png, bmp).

For a scatter plot, simply type:

> plot(x,z)

Figure 1: A simple scatter plot

Adding titles and labels for each axis is easy:

> plot(x,z, main="Scatter plot of age versus outcome", xlab="age", ylab="outcome")

Adding the best fitting regression line to the plot as well:

> plot(x,z, main="Scatter plot of age versus outcome", xlab="age", ylab="outcome")


> lsfit(x,z)$coef

11
Figure 2: A simple scatter plot with labels and title

Intercept X
5.236471 2.324865
> [Link] <-seq(2,7,by=0.01)
> [Link] <- 5.236471 + 2.324865*[Link]
> points([Link], [Link], type="l")

You can plot more than one curve on a single plot, and label them via a legend:

> range <- seq(-10, 10, by = 0.001)


> norm1 <- dnorm(range, mean=0, sd=1)
> norm2 <- dnorm(range, mean=1, sd=2)
> plot(range, norm1, type="l", lty=1, main="Two Normal Distributions",
xlab="Range", ylab="Probability Density")
> points(range, norm2, type="l", lty=2)
> legend(x=-10, y=0.4, legend= c("N(0,1)", "N(1,2)"), lty=c(1,2))

12
Figure 3: Scatter plot with best fitting line added

Figure 4: Distributions with legend

13
You can also do histograms . . .

> hist(rnorm(1000,mean=0, sd=1), main = "Sample From N(0,1)


Distribution", xlab = "Range", ylab="Frequency")

Figure 5: Histogram

. . . and boxplots.

> boxplot(rnorm(1000, mean=0, sd=1), rnorm(1000, mean=1, sd=2),


names=c("N(0,1)", "N(1,2)"))

14
Figure 6: Boxplot

15
7 Writing your Own Functions

It is very easy to write your own functions in R. Once these are written, they become
permanent functions in your version of R, available just like any other function. Here I
present two examples, one easy, one (slightly) more complex.

First the easy example, creating a function to add two numbers. Of course, this is a silly
example, since R already does this easily, but its simplicity will allow us to see how functions
are constructed in general.

> [Link] <- function(a,b)


{
sum = a+b
return(sum)
}
> [Link](3,4)
[1] 7
> pi
[1] 3.141593
> exp(1)
[1] 2.718282
> [Link](pi, exp(1))
[1] 5.859874

That is all one needs to do to make a function!

Now for something more complex, we will create a function to carry out Bayesian inferences
for a single normal mean, when the variance is assumed known.

The theory is the following (to remind you of what we covered last class):

Suppose that your data set, X = (x1 , x2 , . . . , xn ), with sample size n is known to arise from
a Normal distribution, xi ∼ N (µ, σ 2 ), i = 1, 2, . . . , n, i.e., the likelihood function for the data
is normal. Suppose also that the variance of the data, σ 2 is a known constant, and so the
only unknown parameter to be estimated is the mean. Suppose that the prior distribution
for the unknown mean µ is N (θ, τ 2 ).

If we combine the information in the prior distribution with the information in the data set
via Bayes Theorem, we derive the posterior distribution, which summarizes current informa-
tion about the mean µ. It happens that the posterior is also normal, and the exact formula

16
is:
θ
+ nx
−1 !
1 n

τ2 σ2
µposterior ∼ N 1 n , 2
+ 2 ,
τ2
+ σ2 τ σ
where x is the mean of the observed data, and [·]−1 indicates matrix inversion.

A function that implements this formula in R is:

> [Link] <- function(x, [Link], [Link], [Link])


{
####################################################################
# R function for Bayesian analysis of normal mean, variance known #
# Parameters included are: #
# #
# Inputs: #
# #
# x = vector of data #
# [Link] = prior mean #
# [Link] = prior variance #
# [Link] = assumed known variance of data #
# #
# Outputs: #
# #
# [Link] = posterior mean #
# [Link] = posterior variance #
# #
####################################################################

n<- length(x)
[Link] <- mean(x)
[Link] <- [Link]/[Link] + n*[Link]/[Link]
[Link] <- 1/[Link] + n/[Link]
[Link] <- [Link]/[Link]
[Link] <- (1/(1/[Link] + n/[Link]))

a <- "Post mean = "


b <- "Post Var = "

cat(a, [Link], ",", b, [Link], "\n")


}
> [Link](1:10, 0, 1000, 1)
Post mean = 5.49945 , Post Var = 0.09999

Another useful way to create output in functions is illustrated below. This format is especially
useful if you want to use the program’s results later on in your R session.

17
> [Link] <- function(x, [Link], [Link], [Link])
{
####################################################################
# R function for Bayesian analysis of normal mean, variance known #
# Parameters included are: #
# #
# Inputs: #
# #
# x = vector of data #
# [Link] = prior mean #
# [Link] = prior variance #
# [Link] = assumed known variance of data #
# #
# Outputs: #
# #
# [Link] = posterior mean #
# [Link] = posterior variance #
# #
####################################################################

n<- length(x)
[Link] <- mean(x)
[Link] <- [Link]/[Link] + n*[Link]/[Link]
[Link] <- 1/[Link] + n/[Link]
[Link] <- [Link]/[Link]
[Link] <- (1/(1/[Link] + n/[Link]))

[Link] <- list([Link]= [Link], [Link] = [Link])


return([Link])
}
> [Link](1:10, 0, 1000, 1)
$[Link]
[1] 5.49945

$[Link]
[1] 0.09999

A useful function relating to functions is args, which reminds you of the required arguments
of the function. For example:

> args([Link])
function (x, [Link], [Link], [Link])

18
8 Conclusion

R is a very powerful package, we have just scratched the surface of what is available. Its
interactive nature makes it more pleasant to use than “batch” programs like SAS, and
its developmental environment makes it the first choice of many, if not the majority, of
statisticians today.

Some other resources include:

• A useful book titled “Analysis of Epidemiological Data Using R and Epicalc,” available
at

[Link]

• A series of useful tips are available at

[Link]

I would be happy to answer any questions you have about R, just email me.

19

Common questions

Powered by AI

R offers a comprehensive suite of statistical tests, such as t-tests, chi-square tests, and regression models (including linear, logistic, and Poisson regression), which are integral to data analysis and hypothesis testing in research. Compared to other tools like SAS and STATA, R presents several advantages: it is known for being at the forefront of methodological development, offering more modern and frequentist statistical techniques. R's continued community support and open-source model allow for rapid incorporation of the latest statistical methods, often surpassing proprietary software in terms of variety and innovation. Additionally, users can create custom functions to implement new methodologies not yet available in R or other software, enhancing its adaptability and appeal to statisticians who prioritize customization and cutting-edge techniques .

R ensures high-quality graphical capabilities by providing default settings that are carefully chosen for minor design choices, while also allowing users full control over the graphical parameters. This enables users to produce well-designed publication-quality plots, which can include mathematical symbols and formulae where needed. R supports various formats such as wmf, ps, jpg, pdf, png, and bmp, thus making it versatile in adapting to different publication standards and media. The combination of thoughtful defaults and flexible user control helps maintain both consistency and high quality in visual outputs .

Performing a Bayesian analysis of a normal mean in R, assuming the variance is known, involves defining the prior distribution and data with the assumed variance. The process uses Bayes' Theorem to combine prior information with observed data. The R function for Bayesian analysis calculates the posterior mean and variance, utilizing inputs such as prior mean, prior variance, the known data variance, and the observed data vector. The posterior distribution reflects updated beliefs about the mean, summarizing current information and providing statistical insights based on both prior knowledge and new data .

To install R on a Windows system, users must follow these steps: 1) Visit the R web page at http://www.r-project.org/. 2) Click on “CRAN” under the “Download” section. 3) Choose a mirror from which to download, considering download speeds (closer mirrors like those in your country will often be faster). 4) Select 'Windows' under 'Pre-compiled Binary Distributions'. 5) Click on the appropriate R installer file (e.g., 'R-2.3.1-win32.exe'). 6) Save the file to a specific location on the computer. 7) Run the downloaded executable file and follow on-screen instructions, generally leaving options unchanged for typical installations. Once installed, users can start R by clicking the icon on the desktop .

Writing custom functions in R offers unique advantages, particularly for advanced users who require specialized calculations or repeated processes not covered by existing packages. Custom functions enable researchers to encapsulate complex workflows into reusable scripts, ensuring consistency and efficiency across similar tasks. This practice not only saves time and reduces the potential for error compared to writing and running individual commands manually, but it also fosters experiment reproducibility and transparency. For example, functions can be tailored to implement new statistical methodologies or custom algorithms, enabling advanced users to stay at the forefront of statistical analysis and extend R's capabilities beyond out-of-the-box functionalities .

Confidence intervals in R provide a range of values that are estimated to contain the population parameter with a certain level of confidence, usually 95%. This feature improves the interpretation of statistical test results by offering more information than a simple p-value. While a p-value indicates whether or not there is evidence against a null hypothesis, confidence intervals help quantify the precision and reliability of the estimate. They give context to the significance of the test and allow researchers to make educated inferences about the parameter in question, enhancing overall decision-making in statistical analysis .

R allows users to extend its basic capabilities by installing additional packages, which are collections of functions and datasets. These packages can add a wide range of functionalities not included in the base R distribution. Users can easily install these packages from the Comprehensive R Archive Network (CRAN) via the 'Packages' menu in R. This extensibility provides several advantages: it enables users to tailor R to their specific needs by incorporating niche statistical techniques and tools; it fosters a community-driven approach to software development, allowing users to benefit from the contributions of statisticians and programmers worldwide; and it secures regular updates and new methods, ensuring that R remains a dynamic and cutting-edge tool for statistical analysis .

R's graphics capabilities benefit users by allowing them to produce high-quality visual representations of data, which are essential for understanding complex datasets and communicating findings effectively. R supports a wide range of graphical formats and types, such as scatter plots, histograms, and boxplots, enhancing flexibility in visualization. The ability to customize plots with labels, titles, and legends helps in tailoring visual outputs to specific audiences, improving clarity and comprehension. R’s graphics also facilitate exploratory data analysis by enabling quick and insightful examinations of data distributions and relationships, guiding subsequent statistical modeling .

T-tests in R are used to determine if there is a statistically significant difference between the means of two groups. They include the Welch Two Sample t-test to accommodate groups with unequal variances. Linear regression models, on the other hand, evaluate the relationship between a dependent variable and one or more independent variables, allowing for predictions and assessments of variable influence. While t-tests provide insights into mean differences, linear regression offers a detailed understanding of the relationship dynamics between variables, including effect size and the direction of association. Both tests come with confidence intervals, which help in interpreting the reliability and precision of the estimates .

R’s help files and online documentation are crucial for users, providing comprehensive guidance on the functionality of various commands and packages. For new users, they offer an essential resource for learning how to perform specific tasks, troubleshoot issues, understand function arguments, and explore detailed examples of command usage. This support system reduces the learning curve associated with mastering R's extensive capabilities and encourages user independence in problem-solving. The documentation also facilitates keeping pace with R's evolving features, benefiting all users by ensuring access to up-to-date instructions and best practices .

You might also like