0% found this document useful (0 votes)
10 views12 pages

Understanding Goodness of Fit in Regression

The document discusses the concept of goodness of fit in regression analysis, focusing on the coefficient of determination (R²) as a measure of how well a linear model explains the variation in a response variable. It provides examples of R² calculations for skin cancer mortality rates versus latitude and height versus GPA, highlighting the importance of interpreting R² values correctly and cautioning against inferring causation from correlation. Additionally, it explains the relationship between R² and the Pearson correlation coefficient (rxy).

Uploaded by

promptmba24
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views12 pages

Understanding Goodness of Fit in Regression

The document discusses the concept of goodness of fit in regression analysis, focusing on the coefficient of determination (R²) as a measure of how well a linear model explains the variation in a response variable. It provides examples of R² calculations for skin cancer mortality rates versus latitude and height versus GPA, highlighting the importance of interpreting R² values correctly and cautioning against inferring causation from correlation. Additionally, it explains the relationship between R² and the Pearson correlation coefficient (rxy).

Uploaded by

promptmba24
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Goodness of Fit

1/19
Goodness of Fit

In the following plot, we note that

Yi − Ȳ = (Yi − Ŷi ) + (Ŷi − Ȳ )

Skin Cancer Mortality versus Latitude


Mortality (Deaths per 10 million)
250

Yi
200

^
Y i
150

Y
100
50

30 35 40 45

Latitude (at center of state)

2/19
Some Notations

1 Regression sum of squares


n
X
SSR = (Ŷi − Ȳ )2
i=1

2 Error sum of squares


n
X
SSE = (Yi − Ŷi )2
i=1

3 Total sum of squares


n
X
SST O = (Yi − Ȳ )2
i=1

It can be shown that


SST O = SSE + SSR

3/19
We have a random sample of n = 35 students and we are interested in how strong the
linear relationship between the height (x) of a student and his or her GPA (Y ).

R^2 = 0.0028

^
Y
Y
3.5
3.0
GPA
2.5
2.0
1.5

60 65 70

Height

In this example SSR = 0.0276, SSE = 9.7055 and SST O = 9.7331. Note that
SSR/SST O = 0.0028.
4/19
Now let’s look at the example of skin cancer mortality rates (Y ) vs latitude (x)
again.

R^2 = 0.6798
220
Mortality (Deaths per 10 million)

^
Y
Y
200
180
160
140
120
100

30 35 40 45

Latitude (at center of state)

In this example SSR = 36464.2, SSE = 17173.07 and SST O = 53637.27. Note that
SSR/SST O = 0.6798.
5/19
Coefficient of Determination: R2

1 One measure of goodness-of-fit is called coefficient of determination: R2 .


SSR SSE
R2 = =1−
SST O SST O
It shows the proportion of the variation in the response Y explained by the
linear regression model.

2 R2 is a number between 0 and 1.

3 If R2 = 1, all of the data points fall on the regression line. The predictor x
accounts for all of the variation in Y .

4 If R2 = 0, the estimated regression line is perfectly horizontal. The predictor x


accounts for none of the variation in Y .

6/19
Interpretation of R2 when 0 < R2 < 1

When R2 is some number between 0 and 1, like 0.6 or 0.3, we say either

1 R2 × 100 percent of the variation in Y is reduced by taking into account


predictor x

or

2 R2 × 100 percent of the variation in Y is explained by the variation in predictor


x

7/19
Some R2 Cautions
Caution #1 When R2 is close to zero, it doesn’t necessarily mean that x and Y
are not related.
R^2 = 0.0003
0.0
-0.5
-1.0
Y

-1.5
-2.0
-2.5
-3.0

-1.0 -0.5 0.0 0.5 1.0


8/19
x
The appropriate relationship between x and Y is “quadratic”, not “linear”.

0.0
-0.5
-1.0
Y

-1.5
-2.0
-2.5
-3.0

-1.0 -0.5 0.0 0.5 1.0

9/19
Caution #2 Correlation does not imply causation.
1 In the following example, the relationship between wine consumption and death due to heart
disease is examined. Each data point represents one country.

R^2 = 0.692

300
Heart disease deaths (per 100,000 people)

250
200
150
100
50

2 4 6 8

Wine consumption (Liters of wine per person per year)

2 For this data set, R2 = 0.692. A person might be tempted to conclude that he or she should
drink more wine, since it reduces the risk of heart disease.
3 However, there may be other differences in the behavior of the people in the various
countries that really explain the differences in the heart disease death rates, such as diet,
exercise level, stress level, social support structure and so on. 10/19
R2 and rxy ((Pearson) correlation coefficient)

1 The correlation coefficient rxy is directly related to the coefficient of


determination R2 . √
rxy = ± R2

2 For example, if R2 = 0.87, then rxy could be ±0.93.

3 Because R2 is always a number between 0 and 1, the correlation coefficient


rxy is always a number between -1 and 1.

11/19
In the example about how strong the linear relationship between the height (x) of a
student and his or her GPA (Y ). R2 = 0.0028, and rxy = −0.053.

R^2 = 0.0028

^
Y
Y
3.5
3.0
GPA
2.5
2.0
1.5

60 65 70

Height

12/19

Common questions

Powered by AI

A linear regression model might be insufficient for explaining the relationship between two variables with a low R² value if the actual relationship is non-linear. For example, a quadratic relationship might exist instead of a linear one, causing the linear model to fail in capturing the true variability between the variables. The low R² value in this scenario reflects the model's inadequacy, not the absence of a relationship .

A high R² value in regression analysis indicates a strong association between the variables but does not imply causation. Association can be due to confounding factors that are not accounted for in the regression model. For example, while R² = 0.692 suggests a strong relationship between wine consumption and heart disease deaths, this relationship may be influenced by other factors like diet and exercise, which are not included in the model .

The coefficient of determination, R², helps in assessing the fit of a linear regression model by showing the proportion of variation in the response variable Y that is explained by the predictor X. R² is calculated as SSR/SSTO, where SSR is the regression sum of squares and SSTO is the total sum of squares. An R² value of 1 indicates a perfect fit with all data points lying on the regression line, while an R² value of 0 indicates that the predictor accounts for none of the variation in Y .

The total sum of squares (SSTO) is significant in a regression model as it represents the total variability in the response variable Y around its mean. It is the sum of the regression sum of squares (SSR), which reflects the variation explained by the model, and the error sum of squares (SSE), which represents the unexplained variability. Thus, SSTO = SSR + SSE .

An R² value close to zero implies that the linear regression model does not explain the variability in the response variable Y based on the predictor X. In such cases, the estimated regression line is horizontal, indicating no linear relationship between the variables. However, this does not necessarily mean the variables are unrelated, as the true relationship might be non-linear .

An R² value of 0.6798 indicates that approximately 67.98% of the variation in the response variable Y is explained by the predictor X through the linear regression model. This suggests a moderately strong relationship but also indicates that there is approximately 32.02% of the variation unexplained, possibly due to other factors or randomness .

Low R² can still provide useful insights if it indicates a complex, non-linear relationship not captured by a simple linear model. Strategies to uncover hidden relationships include exploring non-linear regression models, applying transformations to the data, or identifying and including relevant covariates or confounders that were initially ignored in the model .

The concept "correlation does not imply causation" is critical in regression analysis because R², while indicating how well the model explains the variability in Y, does not demonstrate a causal relationship between X and Y. For instance, high R² between wine consumption and heart disease deaths could misleadingly suggest causation when, in reality, other unexamined factors could be influencing heart disease rates. Thus, despite high R², it is essential to consider potential confounders and examine whether theoretical or experimental evidence supports a causal link .

Caution should be taken despite a high R² value because it only indicates correlation, not causation. External factors or confounders may influence both independent and dependent variables, leading to misleading conclusions. For example, although a high R² was found between wine consumption and heart disease deaths, other factors like lifestyle and genetics may actually drive the observed relationship .

The Pearson correlation coefficient, rxy, is directly related to the coefficient of determination, R², where rxy is ±√R². R² quantifies the proportion of variation explained by the regression model, while the sign of rxy indicates the direction of association. For example, if R² = 0.87, then rxy could be ±0.93, reflecting either a strong positive or negative linear relationship .

You might also like