0% found this document useful (0 votes)
33 views6 pages

Correlation and Simple Linear Regression Guide

This document provides an overview of correlation and simple linear regression analyses. It defines correlation analysis as measuring the strength of the linear or nonlinear relationship between two continuous variables. Simple linear regression focuses on evaluating the impact of a predictor variable on an outcome variable. The document outlines the Pearson correlation coefficient and Spearman's rank correlation coefficient, and explains how to interpret their values. It also describes the key aspects and assumptions of simple linear regression models, including the least squares method for estimating regression parameters.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views6 pages

Correlation and Simple Linear Regression Guide

This document provides an overview of correlation and simple linear regression analyses. It defines correlation analysis as measuring the strength of the linear or nonlinear relationship between two continuous variables. Simple linear regression focuses on evaluating the impact of a predictor variable on an outcome variable. The document outlines the Pearson correlation coefficient and Spearman's rank correlation coefficient, and explains how to interpret their values. It also describes the key aspects and assumptions of simple linear regression models, including the least squares method for estimating regression parameters.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Correlation and Simple Linear Regression Analyses

Objectives
At the end of this module, the students will be able to:
1. Describe the correlation and simple linear regression analyses;
2. Appreciate the importance of correlation and simple linear regression analyses; and
3. Apply the concepts of correlation and simple linear regression analyses.

A. Introduction to Correlation and Simple Linear Regression Analysis


(Lind, et al., 2006, Mann, 2004, Freund and Simon, 1997)

Analyses between two variables may focus on:


(a) Any association between the variables,
(b) The value of one variable in predicting the other; and
(c) The magnitude/strength of relationship

For Example:
 Family income and expenditure on luxury items.
 Sales revenue and expenses incurred on advertising.
 Yield of a crop and quantity of fertilizer applied.

Aspects to consider in examining the statistical relationship between two or more variables:
 Is there an association between two or more variables? If yes, what is the form and degree of that relationship?
 Is the relationship strong or significant enough to arrive at a desirable conclusion?
 Can the relationship be used for predictive purpose, that is, to predict the most likely value of a dependent
variable corresponding to the given value of independent variable or variables?

The objective of correlation analysis is to gain insight into the strength of the relationship whereas regression
analysis focuses on the form of the relationship between variables. These two techniques are used to investigate
relationships between continuous variables. Correlation analysis is often conducted in a retrospective or
observational study. On the one hand, simple regression analysis is preferred when the aim is to evaluate the relative
impact of the predictor variable on the particular outcome.

Although correlation and regression analyses are mathematically similar, their purposes are different.
Correlation analysis is generally overused. It is often interpreted incorrectly (to establish “causation”) and should be
reserved for generating hypotheses rather than for testing them. On the other hand, regression modeling is a more
useful statistical technique that allows us to assess the strength of the relationships in the data and the uncertainty
in the model by using confidence intervals.

B. Correlation Analysis

Correlation analysis measures and interprets the strength of a linear or nonlinear (eg, exponential,
polynomial, and logistic) relationship between two continuous variables. When conducting correlation analysis, the
term association is used to mean “linear association”.
For this course, the focus is on the Pearson r and Spearman  correlation coefficients. Both correlation
coefficients take on values between -1 and +1, ranging from being negatively correlated (-1) to uncorrelated (0) to
positively correlated (+1). The sign of the correlation coefficient (ie, positive or negative) defines the direction of the
relationship. The absolute value indicates the strength of the correlation

Specifically, the topics covered herein include two commonly used correlation coefficients, the Pearson correlation
coefficient and the Spearman  for measuring linear and nonlinear relationship, respectively, between two continuous
variables.
TABLE 1. Interpretation of Correlation Coefficient

Correlation Coefficient Direction and Strength


Value of Correlation
- 1.0 Perfectly negative
- 0.8 Strongly negative
- 0.5 Moderately negative
- 0.2 Weakly negative
0.0 No association
+0.2 Weakly positive
+0.5 Moderately positive
+0.8 Strongly positive
+1.0 Perfectly positive

Note.—The sign of the correlation coefficient (ie, positive or negative) defines the direction of the
relationship. The absolute value indicates the strength of the correlation.

1. Linear Correlation

The Pearson correlation coefficient is also known as the sample correlation coefficient (r), product-
moment correlation coefficient, or coefficient of correlation. It was introduced by Galton in 1877 and developed
later by Pearson. It measures the linear relationship between two random variables.
For example, when the value of the predictor is manipulated (increased or decreased) by a fixed amount, the
outcome variable changes proportionally (linearly). A linear correlation coefficient can be computed by means of the
data and their sample means. When a scientific study is planned, the required sample size may be computed on the
basis of a certain hypothesized value with the desired statistical power at a specified level of significance

2. Rank Correlation

The Spearman  is the sample correlation coefficient (rs or rho) of the ranks (the relative order) based on
continuous data. It was first introduced by Spearman in 1904. The Spearman  is used to measure the monotonic
relationship between two variables (ie, whether one variable tends to take either a larger or smaller value, though not
necessarily linearly) by increasing the value of the other variable.

Linear vs Rank Correlation Coefficients

The Pearson correlation coefficient necessitates use of interval or continuous measurement scales of the
measured outcome in the study population. In contrast, rank correlations also work well with ordinal rating data, and
continuous data are reduced to their ranks. The rank procedure will also be illustrated briefly with our example data.
The smallest value in the sample has rank 1, and the largest has the highest rank. In general, rank correlations are not
easily influenced by the presence of skewed data or data that are highly variable.

Limitations and Precautions

It is worth noting that even if two variables (eg, cigarette smoking and lung cancer) are highly correlated, it
is not sufficient proof of causation. One variable may cause the other or vice versa, or a third factor is involved, or a
rare event may have occurred. To conclude causation, the causal variables must precede the variable it causes, and
several conditions must be met (eg, reversibility, strength, and exposure response on the basis of the Bradford-Hill
criteria or the Rubin causal model).

C. Simple Linear Regression


Linear Regression:
When the dependence of the variable is represented by a straight line then it is called linear regression,
otherwise it is said to be non linear or curvilinear regression.
For Example, if ‘X’ is dependent variable and ‘Y’ is dependent variable, then the relation Y = a + bX is
linear regression.

The purpose of simple regression analysis is to evaluate the relative impact of a predictor variable on a
particular outcome. This is different from a correlation analysis where the purpose is to examine the strength and
direction of the relationship between two random variables.

Linear regression attempts to find a straight line that best “fits” the data, where the variation of the data
above and below the line is minimized. For this course, only the linear regression of one continuous variable on
another continuous variable with no gaps on each measurement scale is dealt with as an introductory topic. Higher
levels of regression analysis are dealt with at the postgraduate level. There are other types of regression (eg, multiple
linear, logistic, and ordinal) analyses, which are beyond the scope of this course in inferential statistics.

A simple regression model contains only one independent (explanatory) variable, Xi, for i = 1, . . ., n subjects,
and is linear with respect to both the regression parameters and the dependent variable. The corresponding dependent
(outcome) variable is labeled. The model is expressed as

Yi = a + bXi + ei

where the regression parameter a is the intercept (on the y axis), and the regression parameter b is the slope of the
regression line (Fig 1). The random error term ei is assumed to be uncorrelated, with a mean of 0 and constant
variance.
For convenience in inference and improved efficiency in estimation, analyses often incur an additional
assumption that the errors are distributed normally. Transformation of the data to achieve normality may be applied.
Thus, the word line (linear, independent, normal, equal variance) summarizes these requirements.

Figure 1. Simple linear regression model shows that the expectation of the dependent variable Y is linear in the
independent variable X, with an intercept a = 1.0 and a slope b = 2.0.
Typical steps for regression model analysis are the following:
(a) determine if the assumptions underlying a normal relationship are met in the data,
(b) obtain the equation that best fits the data,
(c) evaluate the equation to determine the strength of the relationship for prediction and estimation, and
(d) assess whether the data fit these criteria before the equation is applied for prediction and estimation.
Least Squares Method

The main goal of linear regression is to fit a straight line through the data that predicts Y based on X. To
estimate the intercept and slope regression parameters that determine this line, the least squares method is commonly
used. It is not necessary for the errors to have a normal distribution, although the regression analysis is more efficient
with this assumption. With this regression method, a set of regression parameters are found such that the sum of
squared residuals (ie, the differences between the observed values of the outcome variable and the fitted values) are
minimized. The fitted y value is then computed as a function of the given x value and the estimated intercept and slope
regression parameter. For example, in Eq. 1, once the estimates of a and b are obtained from the regression analysis,
the predicted y value at any given x value is calculated as a + bx.

Coefficient of Determination, R2

It is meaningful to interpret the value of the Pearson correlation coefficient r by squaring it; hence, the term
R-square (R2) or coefficient of determination. This measure (with a range of 0–1) is the fraction of the variability in Y
that can be explained by the variability in X through their linear relationship, or vice versa. That is, R2 =
SSregression/SStotal, where SS stands for the sum of squares. Note that R2 is calculated only on the basis of the
Pearson correlation coefficient in the linear regression analysis. Thus, it is not appropriate to compute R2 on the basis
of rank correlation coefficients such as the Spearman .

Limitations and Precautions

The following understandings should be considered when regression analysis is performed.


(a) To understand whether the assumptions have been met, determine the magnitude of the gap between the data
and the assumptions of the model.
(b) No matter how strong a relationship is demonstrated with regression analysis, it should not be interpreted as
causation (as in the correlation analysis).
(c) The regression should not be used to predict or estimate outside the range of values of the independent
variable of the sample (eg, extrapolation of radiation cancer risk from the Hiroshima data to that of diagnostic
radiologic tests).

Summary

When correlation analysis is conducted to measure the association between two random variables, either the
Pearson linear correlation coefficient or the Spearman rank correlation coefficient  may be adopted. The former
coefficient is used to measure the linear relationship but is not recommended for use with skewed data or data with
extremely large or small values (often called the outliers). In contrast, the latter coefficient is used to measures a
general association, and it is recommended for use with data that are skewed or that have outliers.

When simple regression analysis is conducted to assess the linear relationship of a dependent variable as a
function of the independent variable, caution must be used when determining which of the two variables is viewed as
the independent variable that makes sense clinically. A useful graphical aid is a scatterplot.
Figure 1. Scatterplots of four sets of data generated by means of the following Pearson correlation coefficients
(from left to right): r = 0 (uncorrelated data), r = 0.8 (strongly positively correlated), r = 1.0 (perfectly positively
correlated), and r = -1 (perfectly negatively correlated).

Once the regression line is obtained, caution should also be used to avoid prediction of a y value for any value
of x that is outside the range of the data. Finally, correlation and regression analyses do not infer causality, and
more rigorous analyses are required if causal inference is to be made.

You might also like