0% found this document useful (0 votes)

22 views31 pages

Applied Statistics Using R Techniques

Q: Why is the use of packages in R considered a major advantage, as described in the document?

R's package system is a major advantage because it extends the functionality of R by adding additional functions and datasets. It allows users to easily find and install packages that meet their specific needs, enhancing R's flexibility and utility as an open-source project .

Q: What key advice does the document provide on effectively communicating programming errors when seeking help?

To effectively communicate programming errors, the document advises including a description of expected code behavior, the end goal, the full text of any error messages, sufficient code to reproduce the error, and possibly a screenshot of the RStudio environment. This thorough communication helps in getting the issue resolved quickly and can increase learning from the process .

Q: How does the document suggest managing consistency in variable assignment in collaborative projects?

The document suggests that the choice of assignment operator (= or <-) should be consistent across a project. It recommends adhering to the style already in use for larger collaborative projects to maintain code readability and ease of collaboration. Consistency aids in minimizing errors and misunderstandings among team members .

Q: Describe the approach R takes towards handling vectors and vectorized operations as outlined in the document.

R handles vectors by performing operations element-wise across the vector, allowing for efficient calculations without explicit loops. This is exemplified in operations like addition, multiplication, and comparison, with R automatically performing these operations across all elements of a vector. This feature makes R well-suited for data analysis tasks that involve large datasets .

Q: What are the considerations for installing and using R packages as detailed in the document?

The document advises using the install.packages() function to install packages and the library() function to load them into an R session. Packages must be loaded each time R is restarted, although they only need to be installed once. This process optimizes resource usage and ensures that project-specific dependencies are managed efficiently .

Q: What is the primary purpose of using RMarkdown and GitHub in collaborative projects as described in the document?

RMarkdown and GitHub are used in collaborative projects to enable seamless integration of documentation and code, allowing contributors to suggest edits, fix issues, and improve the content through pull requests. This system provides a structured way to manage versions and contributions while maintaining the transparency and efficiency of the development process .

Q: Discuss the strengths and weaknesses of using R’s vectorized logical operators as highlighted in the document.

R's vectorized logical operators are powerful because they allow element-wise comparisons without requiring explicit loops, which is highly efficient for large datasets. A weakness, however, arises from a lack of understanding of these concepts, leading some users to incorrectly perceive R as slow. Properly leveraging these features is key to optimizing performance .

Q: How should logical operators be used for subsetting in R, according to the document?

Logical operators can be used for subsetting by creating logical vectors that filter elements based on a condition. For instance, using 'x[x > 3]' returns all elements of 'x' greater than 3. Such operations utilize R's ability to treat logical values as numerics (TRUE as 1, FALSE as 0), enabling efficient subsetting and filtering .

The document is a comprehensive guide on applied statistics using R, covering topics such as data types, programming basics, summary statistics, probability, and linear regression. It includes detailed sections on hypothesis testing, model building, and resources for further learning. The book is structured into chapters that progressively build on statistical concepts and their application in R.

Uploaded by

Gianni Micha E. Ansaldo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views31 pages

Applied Statistics Using R Techniques

Uploaded by

Gianni Micha E. Ansaldo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Applied Statistics with R

David Dalpiaz
2
Contents

1 Introduction 11
1.1 About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Introduction to R 15
2.1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Basic Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Installing Packages . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Data and Programming 21

3.1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 Logical Operators . . . . . . . . . . . . . . . . . . . . . . 27
3.2.4 More Vectorization . . . . . . . . . . . . . . . . . . . . . . 29
3.2.5 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.6 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.7 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Programming Basics . . . . . . . . . . . . . . . . . . . . . . . . . 51

3
4 CONTENTS

3.3.1 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Summarizing Data 57
4.1 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2 Barplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.3 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.4 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 Probability and Statistics in R 67

5.1 Probability in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Hypothesis Tests in R . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.1 One Sample t-Test: Review . . . . . . . . . . . . . . . . . 69
5.2.2 One Sample t-Test: Example . . . . . . . . . . . . . . . . 70
5.2.3 Two Sample t-Test: Review . . . . . . . . . . . . . . . . . 73
5.2.4 Two Sample t-Test: Example . . . . . . . . . . . . . . . . 73
5.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.1 Paired Differences . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 Distribution of a Sample Mean . . . . . . . . . . . . . . . 80

6 R Resources 85
6.1 Beginner Tutorials and References . . . . . . . . . . . . . . . . . 85
6.2 Intermediate References . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 Advanced References . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4 Quick Comparisons to Other Languages . . . . . . . . . . . . . . 86
6.5 RStudio and RMarkdown Videos . . . . . . . . . . . . . . . . . . 86
6.6 RMarkdown Template . . . . . . . . . . . . . . . . . . . . . . . . 87
CONTENTS 5

7 Simple Linear Regression 89

7.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.1.1 Simple Linear Regression Model . . . . . . . . . . . . . . 94
7.2 Least Squares Approach . . . . . . . . . . . . . . . . . . . . . . . 97
7.2.1 Making Predictions . . . . . . . . . . . . . . . . . . . . . . 99
7.2.2 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.2.3 Variance Estimation . . . . . . . . . . . . . . . . . . . . . 103
7.3 Decomposition of Variation . . . . . . . . . . . . . . . . . . . . . 104
7.3.1 Coeﬀicient of Determination . . . . . . . . . . . . . . . . . 106
7.4 The lm Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.5 Maximum Likelihood Estimation (MLE) Approach . . . . . . . . 115
7.6 Simulating SLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.7 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.8 R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8 Inference for Simple Linear Regression 123

8.1 Gauss–Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . 126
8.2 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2.1 Simulating Sampling Distributions . . . . . . . . . . . . . 128
8.3 Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.4 Confidence Intervals for Slope and Intercept . . . . . . . . . . . . 137
8.5 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.6 cars Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.6.1 Tests in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.6.2 Significance of Regression, t-Test . . . . . . . . . . . . . . 142
8.6.3 Confidence Intervals in R . . . . . . . . . . . . . . . . . . . 143
8.7 Confidence Interval for Mean Response . . . . . . . . . . . . . . . 145
8.8 Prediction Interval for New Observations . . . . . . . . . . . . . . 146
8.9 Confidence and Prediction Bands . . . . . . . . . . . . . . . . . . 147
8.10 Significance of Regression, F-Test . . . . . . . . . . . . . . . . . . 149
8.11 R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6 CONTENTS

9 Multiple Linear Regression 153

9.1 Matrix Approach to Regression . . . . . . . . . . . . . . . . . . . 157
9.2 Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . . 161
9.2.1 Single Parameter Tests . . . . . . . . . . . . . . . . . . . . 163
9.2.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . 165
9.2.3 Confidence Intervals for Mean Response . . . . . . . . . . 165
9.2.4 Prediction Intervals . . . . . . . . . . . . . . . . . . . . . 169
9.3 Significance of Regression . . . . . . . . . . . . . . . . . . . . . . 170
9.4 Nested Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.6 R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

10 Model Building 185

10.1 Family, Form, and Fit . . . . . . . . . . . . . . . . . . . . . . . . 186
10.1.1 Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.1.2 Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.1.3 Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.1.4 Assumed Model, Fitted Model . . . . . . . . . . . . . . . 188
10.2 Explanation versus Prediction . . . . . . . . . . . . . . . . . . . . 189
10.2.1 Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . 189
10.2.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
10.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
10.4 R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

11 Categorical Predictors and Interactions 195

11.1 Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
11.2 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.3 Factor Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
11.3.1 Factors with More Than Two Levels . . . . . . . . . . . . 215
11.4 Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
11.5 Building Larger Models . . . . . . . . . . . . . . . . . . . . . . . 225
11.6 R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
CONTENTS 7

12 Analysis of Variance 231

12.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

12.2 Two-Sample t-Test . . . . . . . . . . . . . . . . . . . . . . . . . . 232

12.3 One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 235

12.3.1 Factor Variables . . . . . . . . . . . . . . . . . . . . . . . 242

12.3.2 Some Simulation . . . . . . . . . . . . . . . . . . . . . . . 243

12.3.3 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

12.4 Post Hoc Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

12.5 Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 249

12.6 R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

13 Model Diagnostics 261

13.1 Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 261

13.2 Checking Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 263

13.2.1 Fitted versus Residuals Plot . . . . . . . . . . . . . . . . . 264

13.2.2 Breusch-Pagan Test . . . . . . . . . . . . . . . . . . . . . 270

13.2.3 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 272

13.2.4 Q-Q Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

13.2.5 Shapiro-Wilk Test . . . . . . . . . . . . . . . . . . . . . . 280

13.3 Unusual Observations . . . . . . . . . . . . . . . . . . . . . . . . 282

13.3.1 Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

13.3.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

13.3.3 Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

13.4 Data Analysis Examples . . . . . . . . . . . . . . . . . . . . . . . 294

13.4.1 Good Diagnostics . . . . . . . . . . . . . . . . . . . . . . . 294

13.4.2 Suspect Diagnostics . . . . . . . . . . . . . . . . . . . . . 298

13.5 R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

8 CONTENTS

14 Transformations 303
14.1 Response Transformation . . . . . . . . . . . . . . . . . . . . . . 303
14.1.1 Variance Stabilizing Transformations . . . . . . . . . . . . 306
14.1.2 Box-Cox Transformations . . . . . . . . . . . . . . . . . . 311
14.2 Predictor Transformation . . . . . . . . . . . . . . . . . . . . . . 319
14.2.1 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 322
14.2.2 A Quadratic Model . . . . . . . . . . . . . . . . . . . . . . 345
14.2.3 Overfitting and Extrapolation . . . . . . . . . . . . . . . . 350
14.2.4 Comparing Polynomial Models . . . . . . . . . . . . . . . 351
14.2.5 poly() Function and Orthogonal Polynomials . . . . . . . 354
14.2.6 Inhibit Function . . . . . . . . . . . . . . . . . . . . . . . 356
14.2.7 Data Example . . . . . . . . . . . . . . . . . . . . . . . . 357
14.3 R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

15 Collinearity 365
15.1 Exact Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . 365
15.2 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
15.2.1 Variance Inflation Factor. . . . . . . . . . . . . . . . . . . 371
15.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
15.4 R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382

16 Variable Selection and Model Building 383

16.1 Quality Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
16.1.1 Akaike Information Criterion . . . . . . . . . . . . . . . . 384
16.1.2 Bayesian Information Criterion . . . . . . . . . . . . . . . 385
16.1.3 Adjusted R-Squared . . . . . . . . . . . . . . . . . . . . . 386
16.1.4 Cross-Validated RMSE . . . . . . . . . . . . . . . . . . . 386
16.2 Selection Procedures . . . . . . . . . . . . . . . . . . . . . . . . . 390
16.2.1 Backward Search . . . . . . . . . . . . . . . . . . . . . . . 391
16.2.2 Forward Search . . . . . . . . . . . . . . . . . . . . . . . . 397
16.2.3 Stepwise Search . . . . . . . . . . . . . . . . . . . . . . . . 400
16.2.4 Exhaustive Search . . . . . . . . . . . . . . . . . . . . . . 403
CONTENTS 9

16.3 Higher Order Terms . . . . . . . . . . . . . . . . . . . . . . . . . 408

16.4 Explanation versus Prediction . . . . . . . . . . . . . . . . . . . . 413
16.4.1 Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . 413
16.4.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
16.5 R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416

17 Logistic Regression 417

17.1 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . 417
17.2 Binary Response . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
17.2.1 Fitting Logistic Regression . . . . . . . . . . . . . . . . . 421
17.2.2 Fitting Issues . . . . . . . . . . . . . . . . . . . . . . . . . 422
17.2.3 Simulation Examples . . . . . . . . . . . . . . . . . . . . . 422
17.3 Working with Logistic Regression . . . . . . . . . . . . . . . . . . 430
17.3.1 Testing with GLMs . . . . . . . . . . . . . . . . . . . . . . 431
17.3.2 Wald Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
17.3.3 Likelihood-Ratio Test . . . . . . . . . . . . . . . . . . . . 431
17.3.4 SAheart Example . . . . . . . . . . . . . . . . . . . . . . 432
17.3.5 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . 436
17.3.6 Confidence Intervals for Mean Response . . . . . . . . . . 437
17.3.7 Formula Syntax . . . . . . . . . . . . . . . . . . . . . . . . 439
17.3.8 Deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
17.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
17.4.1 spam Example . . . . . . . . . . . . . . . . . . . . . . . . 443
17.4.2 Evaluating Classifiers . . . . . . . . . . . . . . . . . . . . 445
17.5 R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

18 Beyond 453
18.1 What’s Next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
18.2 RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
18.3 Tidy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
18.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
18.5 Web Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
10 CONTENTS

18.6 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . 454

18.7 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
18.7.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 455
18.8 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
18.9 Bayesianism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
18.10High Performance Computing . . . . . . . . . . . . . . . . . . . . 456
18.11Further R Resources . . . . . . . . . . . . . . . . . . . . . . . . . 456

19 Appendix 457
Chapter 1

Introduction

Welcome to Applied Statistics with R!

1.1 About This Book

This book was originally (and currently) designed for use with STAT 420,
Methods of Applied Statistics, at the University of Illinois Urbana-Champaign.
It may certainly be used elsewhere, but any references to “this course” in this
book specifically refer to STAT 420.
This book is under active development. When possible, it would be best to
always access the text online to be sure you are using the most up-to-date
version. Also, the html version provides additional features such as changing
text size, font, and colors. If you are in need of a local copy, a pdf version
is continuously maintained, however, because a pdf uses pages, the formatting
may not be as functional. (In other words, the author needs to go back and
spend some time working on the pdf formatting.)
Since this book is under active development you may encounter errors ranging
from typos, to broken code, to poorly explained topics. If you do, please let us
know! Simply send an email and we will make the changes as soon as possible.
(dalpiaz2 AT illinois DOT edu) Or, if you know RMarkdown and are famil-
iar with GitHub, make a pull request and fix an issue yourself! This process is
partially automated by the edit button in the top-left corner of the html version.
If your suggestion or fix becomes part of the book, you will be added to the list
at the end of this chapter. We’ll also link to your GitHub account, or personal
website upon request.
This text uses MathJax to render mathematical notation for the web. Occa-
sionally, but rarely, a JavaScript error will prevent MathJax from rendering

11
12 CHAPTER 1. INTRODUCTION

correctly. In this case, you will see the “code” instead of the expected math-
ematical equations. From experience, this is almost always fixed by simply
refreshing the page. You’ll also notice that if you right-click any equation you
can obtain the MathML Code (for copying into Microsoft Word) or the TeX
command used to generate the equation.

𝑎2 + 𝑏2 = 𝑐2

1.2 Conventions
R code will be typeset using a monospace font which is syntax highlighted.

a = 3
b = 4
sqrt(a^2 + b^2)

R output lines, which would appear in the console will begin with ##. They will
generally not be syntax highlighted.

## [1] 5

We use the quantity 𝑝 to refer to the number of 𝛽 parameters in a linear model,

not the number of predictors. Don’t worry if you don’t know what this means
yet!

1.3 Acknowledgements
Material in this book was heavily influenced by:

• Alex Stepanov
– Longtime instructor of STAT 420 at the University of Illinois at
Urbana-Champaign. The author of this book actually took Alex’s
STAT 420 class many years ago! Alex provided or inspired many of
the examples in the text.
• David Unger
– Another STAT 420 instructor at the University of Illinois at Urbana-
Champaign. Co-taught with the author during the summer of 2016
while this book was first being developed. Provided endless hours of
copy editing and countless suggestions.
1.3. ACKNOWLEDGEMENTS 13

• James Balamuta
– Current graduate student at the University of Illinois at Urbana-
Champaign. Provided the initial push to write this book by intro-
ducing the author to the bookdown package in R. Also a frequent
contributor via GitHub.

Your name could be here! Suggest an edit! Correct a typo! If you submit a
correction and would like to be listed below, please provide your name as you
would like it to appear, as well as a link to a GitHub, LinkedIn, or personal
website.

• Daniel McQuillan
• Mason Rubenstein
• Yuhang Wang
• Zhao Liu
• Jinfeng Xiao
• Somu Palaniappan
• Michael Hung-Yiu Chan
• Eloise Rosen
• Kiomars Nassiri
• Jeff Gerlach
• Brandon Ching
• Ray Fix
• Tyler Kim
• Yeongho Kim
• Elmar Langholz
• Thai Duy Cuong Nguyen
• Junyoung Kim
• Sezgin Kucukcoban
• Tony Ma
• Radu Manolescu
• Dileep Pasumarthi
• Sihun Wang
• Joseph Wilson
• Yingkui Lin
• Andy Siddall
• Nishant Balepur
• Durga Krovi
• Raj Krishnan
• Ed Pureza
• Siddharth Singh
• Schillaci Mcinnis
• Ivan Valdes Castillo
• Tony Mu
• Salman Yousaf
14 CHAPTER 1. INTRODUCTION

• Yutaro Nishiyama
• Regina Sahani Goonetilleke
• Paul Zuradzki
• Will Tsai
• Ellen Veomett
• David Newman
• Yixing Zheng
• Ossama Sybesma
• Shaneal Findley

1.4 License

Figure 1.1: This work is licensed under a Creative Commons Attribution-

NonCommercial-ShareAlike 4.0 International License.
Chapter 2

Introduction to R

2.1 Getting Started

R is both a programming language and software environment for statistical com-
puting, which is free and open-source. To get started, you will need to install
two pieces of software:

• R, the actual programming language.

– Chose your operating system, and select the most recent version,
4.4.2.
• RStudio, an excellent IDE for working with R.
– Note, you must have R installed to use RStudio. RStudio is simply
an interface used to interact with R.

The popularity of R is on the rise, and every day it becomes a better tool for
statistical analysis. It even generated this book! (A skill you will learn in this
course.) There are many good resources for learning R.
The following few chapters will serve as a whirlwind introduction to R. They are
by no means meant to be a complete reference for the R language, but simply an
introduction to the basics that we will need along the way. Several of the more
important topics will be re-stressed as they are actually needed for analyses.
These introductory R chapters may feel like an overwhelming amount of infor-
mation. You are not expected to pick up everything the first time through. You
should try all of the code from these chapters, then return to them a number of
times as you return to the concepts when performing analyses.
R is used both for software development and data analysis. We will operate in a
grey area, somewhere between these two tasks. Our main goal will be to analyze

15
16 CHAPTER 2. INTRODUCTION TO R

data, but we will also perform programming exercises that help illustrate certain
concepts.
RStudio has a large number of useful keyboard shortcuts. A list of these can be
found using a keyboard shortcut – the keyboard shortcut to rule them all:

• On Windows: Alt + Shift + K

• On Mac: Option + Shift + K

The RStudio team has developed a number of “cheatsheets” for working with
both R and RStudio. This particular cheatsheet for “Base” R will summarize
many of the concepts in this document. (“Base” R is a name used to differentiate
the practice of using built-in R functions, as opposed to using functions from
outside packages, in particular, those from the tidyverse. More on this later.)
When programming, it is often a good practice to follow a style guide. (Where do
spaces go? Tabs or spaces? Underscores or CamelCase when naming variables?)
No style guide is “correct,” but it helps to be aware of what others do. The
more important thing is to be consistent within your own code.

• Hadley Wickham Style Guide from Advanced R

• Google Style Guide

For this course, our main deviation from these two guides is the use of = in place
of <-. (More on that later.)

2.2 Basic Calculations

To get started, we’ll use R like a simple calculator.

Addition, Subtraction, Multiplication, and Division

Math R Result
3+2 3 + 2 5
3−2 3 - 2 1
3⋅2 3 * 2 6
3/2 3 / 2 1.5

Exponents
2.3. GETTING HELP 17

Math R Result
32 3 ^ 2 9
2(−3) 2 ^ (-3) 0.125
1/2
100
√ 100 ^ (1 / 2) 10
100 sqrt(100) 10

Mathematical Constants

Math R Result
𝜋 pi 3.1415927
𝑒 exp(1) 2.7182818

Logarithms

Note that we will use ln and log interchangeably to mean the natural logarithm.
There is no ln() in R, instead it uses log() to mean the natural logarithm.

Math R Result
log(𝑒) log(exp(1)) 1
log10 (1000) log10(1000) 3
log2 (8) log2(8) 3
log4 (16) log(16, base = 4) 2

Trigonometry

Math R Result
sin(𝜋/2) sin(pi / 2) 1
cos(0) cos(0) 1

2.3 Getting Help

In using R as a calculator, we have seen a number of functions: sqrt(), exp(),
log() and sin(). To get documentation about a function in R, simply put
a question mark in front of the function name and RStudio will display the
documentation, for example:
18 CHAPTER 2. INTRODUCTION TO R

?log
?sin
?paste
?lm

Frequently one of the most diﬀicult things to do when learning R is asking for
help. First, you need to decide to ask for help, then you need to know how
to ask for help. Your very first line of defense should be to Google your error
message or a short description of your issue. (The ability to solve problems
using this method is quickly becoming an extremely valuable skill.) If that fails,
and it eventually will, you should ask for help. There are a number of things
you should include when emailing an instructor, or posting to a help website
such as Stack Exchange.

• Describe what you expect the code to do.

• State the end goal you are trying to achieve. (Sometimes what you expect
the code to do, is not what you want to actually do.)
• Provide the full text of any errors you have received.
• Provide enough code to recreate the error. Often for the purpose of this
course, you could simply email your entire .R or .Rmd file.
• Sometimes it is also helpful to include a screenshot of your entire RStudio
window when the error occurs.

If you follow these steps, you will get your issue resolved much quicker, and
possibly learn more in the process. Do not be discouraged by running into
errors and diﬀiculties when learning R. (Or any technical skill.) It is simply part
of the learning process.

2.4 Installing Packages

R comes with a number of built-in functions and datasets, but one of the main
strengths of R as an open-source project is its package system. Packages add
additional functions and data. Frequently if you want to do something in R,
and it is not available by default, there is a good chance that there is a package
that will fulfill your needs.
To install a package, use the [Link]() function. Think of this as
buying a recipe book from the store, bringing it home, and putting it on your
shelf.

[Link]("ggplot2")

Once a package is installed, it must be loaded into your current R session before
being used. Think of this as taking the book off of the shelf and opening it up
to read.
2.4. INSTALLING PACKAGES 19

library(ggplot2)

Once you close R, all the packages are closed and put back on the imaginary
shelf. The next time you open R, you do not have to install the package again,
but you do have to load any packages you intend to use by invoking library().
20 CHAPTER 2. INTRODUCTION TO R
Chapter 3

Data and Programming

3.1 Data Types

R has a number of basic data types.

• Numeric
– Also known as Double. The default type when dealing with numbers.
– Examples: 1, 1.0, 42.5
• Integer
– Examples: 1L, 2L, 42L
• Complex
– Example: 4 + 2i
• Logical
– Two possible values: TRUE and FALSE
– You can also use T and F, but this is not recommended.
– NA is also considered logical.
• Character
– Examples: "a", "Statistics", "1 plus 2."

3.2 Data Structures

R also has a number of basic data structures. A data structure is either homoge-
neous (all elements are of the same data type) or heterogeneous (elements can
be of more than one data type).

21
22 CHAPTER 3. DATA AND PROGRAMMING

Dimension Homogeneous Heterogeneous

1 Vector List
2 Matrix Data Frame
3+ Array

3.2.1 Vectors

Many operations in R make heavy use of vectors. Vectors in R are indexed

starting at 1. That is what the [1] in the output is indicating, that the first
element of the row being displayed is the first element of the vector. Larger
vectors will start additional rows with [*] where * is the index of the first
element of the row.
Possibly the most common way to create a vector in R is using the c() function,
which is short for “combine.” As the name suggests, it combines a list of elements
separated by commas.

c(1, 3, 5, 7, 8, 9)

## [1] 1 3 5 7 8 9

Here R simply outputs this vector. If we would like to store this vector in a
variable, we can do so with the assignment operator =. In this case the
variable x now holds the vector we just created, and we can access the vector
by typing x.

x = c(1, 3, 5, 7, 8, 9)
x

## [1] 1 3 5 7 8 9

As an aside, there is a long history of the assignment operator in R, partially

due to the keys available on the keyboards of the creators of the S language.
(Which preceded R.) For simplicity we will use =, but know that often you will
see <- as the assignment operator.
The pros and cons of these two are well beyond the scope of this book, but
know that for our purposes you will have no issue if you simply use =. If you
are interested in the weird cases where the difference matters, check out The R
Inferno.
If you wish to use <-, you will still need to use =, however only for argument
passing. Some users like to keep assignment (<-) and argument passing (=)
separate. No matter what you choose, the more important thing is that you
3.2. DATA STRUCTURES 23

stay consistent. Also, if working on a larger collaborative project, you should

use whatever style is already in place.
Because vectors must contain elements that are all the same type, R will au-
tomatically coerce to a single type when attempting to create a vector that
combines multiple types.

c(42, "Statistics", TRUE)

## [1] "42" "Statistics" "TRUE"

c(42, TRUE)

## [1] 42 1

Frequently you may wish to create a vector based on a sequence of numbers.

The quickest and easiest way to do this is with the : operator, which creates a
sequence of integers between two specified integers.

(y = 1:100)

## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100

Here we see R labeling the rows after the first since this is a large vector. Also,
we see that by putting parentheses around the assignment, R both stores the
vector in a variable called y and automatically outputs y to the console.
Note that scalars do not exist in R. They are simply vectors of length 1.

## [1] 2

To create a sequence that is not limited to consecutive integers, use the seq()
function to define a sequence by its start, end, and increment.
24 CHAPTER 3. DATA AND PROGRAMMING

seq(from = 1.5, to = 4.2, by = 0.1)

## [1] 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3
## [20] 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2

We will discuss functions in detail later, but note here that the input labels
from, to, and by are optional.

seq(1.5, 4.2, 0.1)

## [1] 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3
## [20] 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2

Another common operation to create a vector is rep(), which can repeat a

single value a number of times.

rep("A", times = 10)

## [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"

The rep() function can be used to repeat a vector some number of times.

rep(x, times = 3)

## [1] 1 3 5 7 8 9 1 3 5 7 8 9 1 3 5 7 8 9

We have now seen four different ways to create vectors:

• c()
• :
• seq()
• rep()

So far we have mostly used them in isolation, but they are often used together.

c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 42, 2:4)

## [1] 1 3 5 7 8 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 2 3 42
## [26] 2 3 4

The length of a vector can be obtained with the length() function.

3.2. DATA STRUCTURES 25

length(x)

## [1] 6

length(y)

## [1] 100

[Link] Subsetting

To subset a vector, we use square brackets, [].

## [1] 1 3 5 7 8 9

x[1]

## [1] 1

x[3]

## [1] 5

We see that x[1] returns the first element, and x[3] returns the third element.

x[-2]

## [1] 1 5 7 8 9

We can also exclude certain indexes, in this case the second element.

x[1:3]

## [1] 1 3 5

x[c(1,3,4)]

## [1] 1 5 7

Lastly we see that we can subset based on a vector of indices.

All of the above are subsetting a vector using a vector of indexes. (Remember a
single number is still a vector.) We could instead use a vector of logical values.
26 CHAPTER 3. DATA AND PROGRAMMING

z = c(TRUE, TRUE, FALSE, TRUE, TRUE, FALSE)

## [1] TRUE TRUE FALSE TRUE TRUE FALSE

x[z]

## [1] 1 3 7 8

3.2.2 Vectorization

One of the biggest strengths of R is its use of vectorized operations. (Frequently

the lack of understanding of this concept leads to a belief that R is slow. R is
not the fastest language, but it has a reputation for being slower than it really
is.)

x = 1:10
x + 1

## [1] 2 3 4 5 6 7 8 9 10 11

2 * x

## [1] 2 4 6 8 10 12 14 16 18 20

2 ^ x

## [1] 2 4 8 16 32 64 128 256 512 1024

sqrt(x)

## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427

## [9] 3.000000 3.162278

log(x)

## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101

## [8] 2.0794415 2.1972246 2.3025851

We see that when a function like log() is called on a vector x, a vector is

returned which has applied the function to each element of the vector x.
3.2. DATA STRUCTURES 27

3.2.3 Logical Operators

Operator Summary Example Result

x < y x less than y 3 < 42 TRUE
x > y x greater than y 3 > 42 FALSE
x <= y x less than or equal to y 3 <= 42 TRUE
x >= y x greater than or equal 3 >= 42 FALSE
to y
x == y xequal to y 3 == 42 FALSE
x != y x not equal to y 3 != 42 TRUE
!x not x !(3 > 42) TRUE
x | y x or y (3 > 42) | TRUE TRUE
x & y x and y (3 < 4) & ( 42 > 13) TRUE

In R, logical operators are vectorized.

x = c(1, 3, 5, 7, 8, 9)

x > 3

## [1] FALSE FALSE TRUE TRUE TRUE TRUE

x < 3

## [1] TRUE FALSE FALSE FALSE FALSE FALSE

x == 3

## [1] FALSE TRUE FALSE FALSE FALSE FALSE

x != 3

## [1] TRUE FALSE TRUE TRUE TRUE TRUE

x == 3 & x != 3

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

28 CHAPTER 3. DATA AND PROGRAMMING

x == 3 | x != 3

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

This is extremely useful for subsetting.

x[x > 3]

## [1] 5 7 8 9

x[x != 3]

## [1] 1 5 7 8 9

sum(x > 3)

## [1] 4

[Link](x > 3)

## [1] 0 0 1 1 1 1

Here we see that using the sum() function on a vector of logical TRUE and FALSE
values that is the result of x > 3 results in a numeric result. R is first auto-
matically coercing the logical to numeric where TRUE is 1 and FALSE is 0. This
coercion from logical to numeric happens for most mathematical operations. If
you are interested in more detail, check out Advanced R.

which(x > 3)

## [1] 3 4 5 6

x[which(x > 3)]

## [1] 5 7 8 9

max(x)

## [1] 9
3.2. DATA STRUCTURES 29

which(x == max(x))

## [1] 6

[Link](x)

## [1] 6

3.2.4 More Vectorization

x = c(1, 3, 5, 7, 8, 9)
y = 1:100

x + 2

## [1] 3 5 7 9 10 11

x + rep(2, 6)

## [1] 3 5 7 9 10 11

x > 3

## [1] FALSE FALSE TRUE TRUE TRUE TRUE

x > rep(3, 6)

## [1] FALSE FALSE TRUE TRUE TRUE TRUE

x + y

## Warning in x + y: longer object length is not a multiple of shorter object

## length

## [1] 2 5 8 11 13 15 8 11 14 17 19 21 14 17 20 23 25 27
## [19] 20 23 26 29 31 33 26 29 32 35 37 39 32 35 38 41 43 45
## [37] 38 41 44 47 49 51 44 47 50 53 55 57 50 53 56 59 61 63
## [55] 56 59 62 65 67 69 62 65 68 71 73 75 68 71 74 77 79 81
## [73] 74 77 80 83 85 87 80 83 86 89 91 93 86 89 92 95 97 99
## [91] 92 95 98 101 103 105 98 101 104 107
30 CHAPTER 3. DATA AND PROGRAMMING

length(x)

## [1] 6

length(y)

## [1] 100

length(y) / length(x)

## [1] 16.66667

(x + y) - y

## Warning in x + y: longer object length is not a multiple of shorter object

## length

## [1] 1 3 5 7 8 9 1 3 5 7 8 9 1 3 5 7 8 9 1 3 5 7 8 9 1 3 5 7 8 9 1 3 5 7 8 9 1
## [38] 3 5 7 8 9 1 3 5 7 8 9 1 3 5 7 8 9 1 3 5 7 8 9 1 3 5 7 8 9 1 3 5 7 8 9 1 3
## [75] 5 7 8 9 1 3 5 7 8 9 1 3 5 7 8 9 1 3 5 7 8 9 1 3 5 7

y = 1:60
x + y

## [1] 2 5 8 11 13 15 8 11 14 17 19 21 14 17 20 23 25 27 20 23 26 29 31 33 26
## [26] 29 32 35 37 39 32 35 38 41 43 45 38 41 44 47 49 51 44 47 50 53 55 57 50 53
## [51] 56 59 61 63 56 59 62 65 67 69

length(y) / length(x)

## [1] 10

rep(x, 10) + y

## [1] 2 5 8 11 13 15 8 11 14 17 19 21 14 17 20 23 25 27 20 23 26 29 31 33 26
## [26] 29 32 35 37 39 32 35 38 41 43 45 38 41 44 47 49 51 44 47 50 53 55 57 50 53
## [51] 56 59 61 63 56 59 62 65 67 69
3.2. DATA STRUCTURES 31

all(x + y == rep(x, 10) + y)

## [1] TRUE

identical(x + y, rep(x, 10) + y)

## [1] TRUE

# ?any
# ?[Link]

x = c(1, 3, 5)
y = c(1, 2, 4)
x == y

## [1] TRUE FALSE FALSE

all(x == y)

## [1] FALSE

any(x == y)

## [1] TRUE

While all returns TRUE only when all of its arguments are TRUE, any returns
TRUE when at least one of its arguments is TRUE.

x = c(10 ^ (-8))
y = c(10 ^ (-9))
all(x == y)

## [1] FALSE

[Link](x, y)

## [1] TRUE

The [Link] function tests “near equality” with a default tolerance value
around 1.5e-8 and returns TRUE if all of its arguments have differences smaller
than the tolerance.

Common questions

In R, the assignment operator '=' can be used interchangeably with '<-', though '= 'is typically used for argument passing within functions. Maintaining consistency in using either operator within a project is crucial to avoid confusion, especially in collaborative settings where adhering to an established style can facilitate clearer communication .