CHAPTER 7
CORRELATION AND REGRESSION ANALYSIS
1. CORRELATION ANALYSIS
Correlation Analysis — a statistical method used to evaluate the strength and direction of
the linear relationship between two quantitative variables. It measures how closely two
variables move together.
Dependent Variable — the variable whose value is being predicted or explained. It is the
outcome or response variable in the analysis, often denoted as Y.
Independent Variable — the variable used to predict or explain changes in the dependent
variable. It is the predictor or explanatory variable, often denoted as X.
Scatter Diagram (Scatter Plot) — a graphical representation that shows the relationship
between two variables. Each point on the graph represents a pair of values (X, Y). It is used
to visually assess whether a relationship exists between two variables.
Positive Correlation — a relationship between two variables where both variables move in
the same direction. When one variable increases, the other also increases.
Negative Correlation — a relationship between two variables where the variables move in
opposite directions. When one variable increases, the other decreases.
Zero Correlation (No Correlation) — a situation where there is no linear relationship
between two variables. Changes in one variable do not predict changes in the other.
Perfect Positive Correlation — a relationship where the correlation coefficient equals
exactly +1, meaning the two variables increase together in perfect proportion.
Perfect Negative Correlation — a relationship where the correlation coefficient equals
exactly -1, meaning one variable increases in perfect proportion to the decrease in the other.
2. THE CORRELATION COEFFICIENT
Correlation Coefficient — a numerical measure that quantifies the strength and direction of
the linear relationship between two variables. It ranges from -1 to +1.
• A value close to +1 indicates a strong positive relationship.
• A value close to -1 indicates a strong negative relationship.
• A value close to 0 indicates little to no linear relationship.
a. Pearson's Correlation Coefficient
Pearson's Correlation Coefficient (r) — also called the Pearson Product-Moment
Correlation Coefficient, it is a measure of the linear relationship between two continuous,
quantitative variables. It assumes that both variables are normally distributed and that the
relationship between them is linear. It is the most commonly used correlation coefficient.
The value of r always falls between -1 and +1:
• r = +1 → perfect positive linear relationship
• r = -1 → perfect negative linear relationship
• r = 0 → no linear relationship
Strength Interpretation of Pearson's r:
• ±0.00 to ±0.19 — Very Weak Correlation
• ±0.20 to ±0.39 — Weak Correlation
• ±0.40 to ±0.59 — Moderate Correlation
• ±0.60 to ±0.79 — Strong Correlation
• ±0.80 to ±1.00 — Very Strong Correlation
b. Spearman's Rank Correlation Coefficient
Spearman's Rank Correlation Coefficient (rₛ or ρ) — a nonparametric measure of the
monotonic relationship between two variables. Instead of using the actual data values, it
uses the ranks of the data. It is used when the data are ordinal, or when the assumptions of
Pearson's correlation (normality, linearity) are not met.
Rank — the position of a data value when all values are arranged in ascending or
descending order.
Monotonic Relationship — a relationship where as one variable increases, the other
variable either consistently increases or consistently decreases, but not necessarily at a
constant rate.
Tied Ranks — when two or more data values are equal, they are assigned the average of
the ranks they would have occupied.
3. CORRELATION AND CAUSATION
Correlation and Causation — the principle that a correlation between two variables does
not necessarily mean that one variable causes the other. Just because two variables are
correlated does not imply a cause-and-effect relationship.
Causation — implies that a change in one variable directly causes a change in another
variable.
Spurious Correlation — a correlation between two variables that appears to be meaningful
but is actually caused by a third variable (confounding variable) or is purely coincidental.
Confounding Variable — a variable that is not included in the analysis but influences both
the dependent and independent variables, creating a misleading correlation between them.
4. REGRESSION ANALYSIS
Regression Analysis — a statistical technique used to model and analyze the relationship
between a dependent variable and one or more independent variables. It is used to predict
the value of the dependent variable based on the values of the independent variables.
Least Squares Principle — the method used in regression analysis to determine the best-
fitting line by minimizing the sum of the squared differences (residuals) between the
observed values and the values predicted by the regression line. It produces the line that
results in the smallest possible total squared error.
Regression Line — the line that best fits the data in a scatter plot, determined using the
least squares principle. It represents the predicted relationship between the dependent and
independent variables.
Residual (Error Term) — the difference between the observed value of the dependent
variable and the value predicted by the regression model. It represents the portion of the
dependent variable that cannot be explained by the independent variable(s).
5. SIMPLE LINEAR REGRESSION
Simple Linear Regression — a regression model that examines the linear relationship
between one dependent variable (Y) and one independent variable (X). The model is
expressed as:
Y = β₀ + β₁X + ε
Where:
• Y = dependent variable
• β₀ = y-intercept (the value of Y when X = 0)
• β₁ = slope (the change in Y for every one-unit change in X)
• ε = error term
Y-Intercept (β₀ or b₀) — the value of the dependent variable Y when the independent
variable X is equal to zero. It is the point where the regression line crosses the Y-axis.
Slope (β₁ or b₁) — the change in the dependent variable Y for every one-unit increase in
the independent variable X. It represents the rate of change in the relationship between X
and Y.
a. Assumptions of Simple Linear Regression
Linearity — the relationship between the dependent variable and the independent variable
must be linear.
Independence — the observations (residuals) must be independent of each other. There
should be no pattern or relationship between the residuals.
Homoscedasticity — the variance of the residuals (errors) must be constant across all
levels of the independent variable. The spread of the residuals should be roughly equal
throughout.
Normality — the residuals (error terms) must be normally distributed.
No Multicollinearity — in simple linear regression, the independent variable should not be
perfectly correlated with another variable (this assumption is more critical in multiple
regression).
b. Estimating the Coefficients of the Simple Linear Regression Model
Estimated Regression Equation — the sample-based equation used to predict the
dependent variable, written as:
Ŷ = b₀ + b₁X
Where:
• Ŷ (Y-hat) = predicted value of Y
• b₀ = estimated y-intercept
• b₁ = estimated slope
b₁ (Estimated Slope) — computed using the sample data to estimate the true population
slope β₁. It tells us how much Y changes for every one-unit change in X.
b₀ (Estimated Y-Intercept) — the estimated value of Y when X equals zero, based on
sample data.
c. Estimating the Standard Error
Standard Error of the Estimate (Sₑ) — also called the standard error of regression, it
measures the average distance that the observed values fall from the regression line. It
indicates the accuracy of the predictions made by the regression model. A smaller standard
error indicates a better-fitting model.
d. Constructing the Confidence and Prediction Intervals
Confidence Interval for the Mean Response — an interval estimate of the mean (average)
value of the dependent variable Y for a given value of X. It captures where the true mean of
Y lies for a specific X value.
Prediction Interval — an interval estimate for the value of an individual observation of the
dependent variable Y for a given value of X. It is wider than the confidence interval because
it accounts for additional uncertainty in predicting a single value rather than a mean.
e. Coefficient of Determination
Coefficient of Determination (R²) — a measure that indicates what proportion or
percentage of the total variation in the dependent variable Y is explained by the independent
variable(s) in the regression model. It ranges from 0 to 1 (or 0% to 100%).
• R² = 1 (or 100%) means the model perfectly explains all the variation in Y.
• R² = 0 means the model explains none of the variation in Y.
Total Sum of Squares (SST) — the total variation in the dependent variable Y around its
mean. It measures how much Y values deviate from the mean of Y.
Regression Sum of Squares (SSR) — the portion of the total variation in Y that is
explained by the regression model (i.e., by the independent variable X).
Error Sum of Squares (SSE) — the portion of the total variation in Y that is NOT explained
by the regression model. It represents the unexplained variation or residual variation.
Relationship: SST = SSR + SSE
f. Relationship of Correlation Coefficient, Standard Error of Estimate, and Coefficient of
Determination
• R² = r² — the coefficient of determination is the square of the Pearson correlation
coefficient in simple linear regression.
• A higher r (closer to ±1) results in a higher R² (more variation explained) and a lower
standard error (better predictions).
• A lower r (closer to 0) results in a lower R² and a higher standard error (less accurate
predictions).
6. MULTIPLE LINEAR REGRESSION
Multiple Linear Regression — a regression model that examines the linear relationship
between one dependent variable (Y) and two or more independent variables (X₁, X₂, ...,
Xₖ). The model is expressed as:
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε
a. Assumptions of Multiple Linear Regression
Linearity — the relationship between the dependent variable and each independent
variable must be linear.
Independence of Errors — the residuals must be independent of each other.
Homoscedasticity — the variance of the residuals must be constant across all levels of the
independent variables.
Normality of Errors — the residuals must be normally distributed.
No Multicollinearity — the independent variables must not be highly correlated with each
other. High multicollinearity makes it difficult to determine the individual effect of each
independent variable on Y.
Multicollinearity — a condition in multiple regression where two or more independent
variables are highly correlated with each other, making it difficult to isolate the individual
effect of each predictor on the dependent variable.
b. Estimating the Coefficients of the Multiple Linear Regression Model
Partial Regression Coefficient — the coefficient (b₁, b₂, etc.) associated with each
independent variable in the multiple regression model. It represents the change in Y for a
one-unit change in that specific independent variable, holding all other independent
variables constant.
Adjusted R² (Adjusted Coefficient of Determination) — a modified version of R² that
accounts for the number of independent variables in the model. Unlike R², adjusted R²
penalizes the addition of variables that do not significantly improve the model. It is more
appropriate for evaluating multiple regression models.
c. Interpreting the Multiple Linear Regression
Holding Other Variables Constant (Ceteris Paribus) — when interpreting a partial
regression coefficient, the effect of one independent variable on Y is interpreted while
assuming all other independent variables remain unchanged.
Overall F-Test — a hypothesis test used to determine whether the multiple regression
model as a whole is statistically significant, i.e., whether at least one independent variable
significantly predicts Y.
Individual t-Test — a hypothesis test used to determine whether each individual regression
coefficient (βᵢ) is statistically significant, i.e., whether a specific independent variable
significantly contributes to predicting Y when all other variables are in the model.
7. REGRESSION ANALYSIS WITH DUMMY VARIABLES
Dummy Variable — also called an indicator variable or binary variable, it is an artificial
variable created to represent a categorical variable in a regression model. It takes the value
of 1 if a condition is met and 0 if it is not.
Reference Category (Base Category) — the category of a categorical variable that is
excluded when creating dummy variables. All other categories are compared against this
reference category.
Dummy Variable Trap — a situation in regression where too many dummy variables are
created, leading to perfect multicollinearity. To avoid this, if a categorical variable has k
categories, only k - 1 dummy variables should be created.
8. POSITIVE MONOTONIC TRANSFORMATION OF DATA
Positive Monotonic Transformation — a mathematical transformation applied to data that
preserves the order (ranking) of the data values. As the original value increases, the
transformed value also increases.
Logarithmic Transformation (Log Transformation) — a transformation that applies the
logarithm function to the data values. Used to reduce skewness, handle non-linearity, or
stabilize variance.
Square Root Transformation — a transformation that takes the square root of each data
value. Used to reduce the effect of large values and normalize skewed data.
Data Transformation — the process of applying a mathematical function to data values to
make the data meet the assumptions of regression (e.g., linearity, normality,
homoscedasticity).
9. MODELING RELATIONSHIPS OF MULTIPLE VARIABLES WITH LINEAR
REGRESSION
Model Building — the process of selecting the most appropriate set of independent
variables to include in a regression model to best predict the dependent variable.
Stepwise Regression — a method of model selection where independent variables are
added or removed from the regression model one at a time based on statistical criteria (such
as p-values or F-statistics), until the best model is found.
Forward Selection — a stepwise method where variables are added one at a time to the
model, starting with the most significant variable, until no remaining variable significantly
improves the model.
Backward Elimination — a stepwise method where all variables are initially included in the
model, and variables are removed one at a time, starting with the least significant, until all
remaining variables are statistically significant.
Overfitting — a problem that occurs when a regression model is too complex and fits the
sample data too closely, including random noise, resulting in poor predictive performance on
new data.
10. ADVANCED REGRESSION MODELS
Polynomial Regression — a form of regression where the relationship between the
dependent variable and the independent variable is modeled as an nth-degree polynomial.
Used when the relationship between X and Y is curvilinear rather than strictly linear.
Interaction Effect — occurs when the effect of one independent variable on the dependent
variable depends on the value of another independent variable. Modeled by including an
interaction term (the product of two independent variables) in the regression equation.
Interaction Term — a new variable created by multiplying two independent variables
together, used to capture the interaction effect between them in a regression model.
Logistic Regression — a regression model used when the dependent variable is
categorical (e.g., binary: yes/no, 0/1). It predicts the probability that an observation belongs
to a particular category.
Nonlinear Regression — a regression model where the relationship between the
dependent variable and the independent variables is not linear and cannot be linearized
through transformation.