0% found this document useful (0 votes)

11 views23 pages

Correlation and Regression Analysis Guide

The document discusses correlation and regression analysis, detailing Pearson's correlation coefficient, its properties, and calculation steps, as well as Spearman's rank correlation for non-parametric measures. It explains simple and multiple linear regression, including assumptions, model building, and checking for normality and homogeneity of variance. Additionally, it covers model selection methods and the importance of predictive capability in regression models.

Uploaded by

rabiaalrabea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views23 pages

Correlation and Regression Analysis Guide

Uploaded by

rabiaalrabea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Correlation & Regression

- Pearson’s Correlation Coefficient:

- The direction is indicated by the sign of r, +ve or -ve.
- The strength is indicated by the absolute value of r, the higher the r
either +ve or -ve the closer the dots to the line.

* We use scatterplot to identify:

1- Form: Linear, Curved, Cluster or No pattern.
2- Direction: Positive, Negative, No direction.
3- Strength: How closely it fits the line: Weak, Moderate or Strong.

* Prosperities of r:
1- r range from -1 to +1.
2- Dimensionless (Because it measure strength and direction)
3- if X and Y are independence then r=0 (the opposite isn’t correct)

* Cautions with Correlation:

1- Correlation doesn’t imply Causation.
2- High Correlation Coefficient not necessarily mean there is a correlation.
3- Significance depends on Sample size & Size of Correlation Coefficient.

-How to calculate it?

First Step: Mean of X & Y:

𝑆𝑢𝑚 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠
𝑀𝑒𝑎𝑛 (𝑋𝑚) =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒

Second Step: Div. for mean:

Xd Yd
X1 - Xm Y1 – Ym
X2 – Xm Y2 – Ym
Third Step: Covariance:
X3 - Xm Y3 – Ym
(Xd1 x Yd1) + (Xd2 x Yd2) + ⋯ … etc
𝑛−1
= 𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒

Fourth Step: Variance and Standard Deviation of X and Y:

(𝑋𝑑1! ) + (𝑋𝑑2! ) + ⋯ . 𝑒𝑡𝑐

= √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒. 𝑋 = 𝑆𝐷𝑥
𝑛−1

(𝑌𝑑1! ) + (𝑌𝑑2! ) + ⋯ . 𝑒𝑡𝑐

= √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒. 𝑌 = 𝑆𝐷𝑦
𝑛−1

Fifth Step: Correlation Coefficient:

𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑆𝐷𝑥 𝑥 𝑆𝐷𝑦

Sixth Step: Standard Error:

• Degree of Freedom = n-2
"#$ !
𝑆. 𝐸 = N %&
While,
$
t = '.)

Later On: (Simple Linear Regression):

SLR = Intercept + (Slope x X)

While,
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑆𝐷𝑥
𝑆𝑙𝑜𝑝𝑒 = 𝑜𝑟 𝑟𝑥
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑋 𝑆𝐷𝑦
And,
𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝑀𝑦 − (𝑆𝑙𝑜𝑝𝑒 𝑥 𝑀𝑥)

Also, Confidence Interval using Fisher’s Z-Transformation (r is known) :

Step one: Get Fisher’s Z:

1 1+𝑟
𝑍= ln T U
2 1−𝑟

Step two: S.E. of Z:

1
𝑆𝐸𝑧 =
√𝑛 − 3

Step three: Determine Z-score for CI (1.96 for CI 95%):

𝑍 ± 1.96 𝑥 𝑆𝐸𝑧 = (𝑙𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑, 𝑢𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑)

Step four: Convert CI for Z back to r:

𝑒 !* − 1
𝑟 = !* = 𝑑𝑜 𝑖𝑡 𝑡𝑤𝑖𝑐𝑒 𝑓𝑜𝑟 𝑙𝑜𝑤𝑒𝑟 𝑎𝑛𝑑 𝑓𝑜𝑟 𝑢𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑠.
𝑒 +1

- Spearman’s Rank Correlation (p):

- It’s a non-parametric measure of correlation that assesses the strength

and the direction of the monotonic relation between two variables.
- Used for:
- Ordinally scaled variables.
- Non-normally distributed variables.
- Non-linear monotonic relationships.

- Procedure:
- Rank both variables from lowest to highest.
make it less sensitive to outliers & non-normal distribution.
- Calculate a Pearson’s correlation coefficient on the ranks.

- Equation:

+ - (/#0)! 2⋯.456
𝑃 =1− 7(7! #")
(n = number of pairs)

- Interpreting:
- P = +1 → Perfect positive monotonic relation (X increase, Y Increase)
- P = -1 → Perfect negative monotonic relation (X increase, Y Decrease)
- P = 0 → No monotonic relationship (No consistent relation)
- 0 < P < 1 → Positive monotonic relation (X increase, Y Increase) but not
perfectly.
- -1 < P < 0 → Negative monotonic relation (X increase, Y Decrease) but not
perfectly.

- Regression:

* Types:
- Dependent Variable/Outcome/Y (Response Variable):
1- Continuous: Linear Regression. Ex. HR, BP…. etc
2- Binary: Logistic Regression. Ex. y/n, Disease/No Disease….etc
3- Time to Event: Cox Regression. Ex. Time to death, recurrence.
- Independent Variable/Predictor/X (Explanatory):
1- One Predictor: Simple Regression.
2- Multiple Predictors: Multiple Regression.

* Uses:
1- As a predication tool.
2- to control for confounders.

** Simple Linear Regression:

SLR = Intercept + (Slope x X) + e (Residuals)
While,
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑆𝐷𝑥
𝑆𝑙𝑜𝑝𝑒 = 𝑜𝑟 𝑟𝑥
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑋 𝑆𝐷𝑦
And,
𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝑀𝑦 − (𝑆𝑙𝑜𝑝𝑒 𝑥 𝑀𝑥)
E= Observed Value – Predicated Value
The Coefficient of determination = r2 (a measure of how successful the
regression was in explaining the response)

* Assumptions:
1- The residuals are normally distributed.
2- Homogeneity of Variance.
3- Independence of the measurement. 4- Linearity of the model.

* What to do in case of bad fit?

1- Non-linear transformation of X or Y. (e.g. log(x))
2- Non-linear Regression.
3- Weighted Regression.
4- Multiple Regression.

* How can we check the Normality of the residuals?

1- First Determine the residuals. (E)
2- Ways to check Normality: Most common way (Normal Probability Plot)
- Histogram/Boxplot*
- Tests: Shapiro-wilk and Kolmogorov-Smirnov.
* How can we check the Homogeneity of the Variance?
- Make residual plot (Scattered plot of the observed vs. predicated values):
- Graph shows horizontal band of equal heights = Homogeneity.
- Graph shows a band of equal heights but not horizontal = No.

* Linear Regression & ANOVA:

- The regression analysis can also be viewed as a one-way ANOVA with a

specific kind of post-hoc test.

- Equations: ( A = B + C )

1. Deviation between measurement & overall mean: (A)

𝑌" − 𝑌^
2. Deviation between measurement & regression line: (B)
` (𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑙𝑖𝑛𝑒)
𝑌" − 𝑌!
3. Deviation between regression line & overall mean: (C)
𝑌^! − 𝑌^
* Table (ANOVA) for Regression: SE = Square Root of MSE
Source SS df MS F Sig.

Regression SSR K-1 𝑆𝑆849 MSR/MSE

𝑑𝑓
Residuals SSE N-K 𝑆𝑆84:
𝑑𝑓
Total SSR-SSE N-1 Var(y)
- Other Equations:
''"#$ ''%&' (7#")
𝑅! = ''%&'
, 𝑉𝑎𝑟(𝑦) = ;#"
, 𝐴𝑑𝑗𝑢𝑠. 𝑅! = 1 − (7#<#")
𝑥 (1 − 𝑅! )
- H0: 𝐵1 = 0
- H1: 𝐵1 ≠ 0
- F follows F distribution with dfReg & dfRes degrees of freedom.

* Proportion of Explained Variance 𝑹𝟐 :

- is the proportion of variance in Y explained by the regression in X.

** Multiple Linear Regression:

MLR = Intercept + (Slope1 x X1) + (Slope2 x X2) + ….. + e (Residuals)
* Assumptions:
1- The residuals are normally distributed.
2- Homogeneity of Variance.
3- Independence of the measurement.
4- Linearity of the model.
5- For every X the variance of Y should be equal.
6- No Multicollinearity: Multicollinearity means that two or more of the
Independent variables are highly correlated with each other.

* How can we check for these assumptions:

1- Normal distribution: Histogram/Boxplot & Tests.
2- Homogeneity: error plot with residuals (e > Y-axis, Predications > X-axis)

* With 2 explanatory variables, there are four possible models:

1- Empty model: y=b0+e (lowest predication percentage (0), V. Cheap).
2- Model with X1: y=b0+bX1*X1+e (explains 30%, Not V. Cheap).
3- Model with X2: y=b0+bX2*X2+e (explains 60%, Not V. Cheap).
4- Model with both: y=b0+bX1*X1+bX2*X2+e (explains 80%, Exp.).
Note: If 2) and 3) are P<0.05, 4) is Statistically better.
* When and Why Multiple Regression?
- When? Research Question > Study Design > [Analysis].
- Why?
A) Observational Studies:
- Etiological: - Usually Examine effect of one determinant on outcome.
- Observational studies have problem of confounders.
- Goal: effect of determinant on outcome while controlling
for potential confounders.
- Predication: - Goal: Which factor contribute independently to the
Outcome.
- Selection of best variable, while controlling overlaps.
B) Experimental Studies:
- Effect of treatment on outcome.
- Randomization should take care of confounding.
- Taking into account pre-specified variables related to
Outcome may decrease uncertainty & increase power.

* Checking for assumptions:

1. Independence of observations (residuals)
• Observations are independent from one another.
• Knowing the value of one case tells nothing about the value of other
variables.
• To check Durbin-Watson statistic (1.5 to 2.5)? ✓
2. Normality of residuals
• We calculate standardized residuals.
• Check using:
o Histogram
o Shapiro-Wilk test
o PP Plot

3. Homoscedasticity (Same Scattered)

• Refers to if residuals are equally distributed or not
(heteroscedasticity).
• Plot of residuals & Predicted values.
4. Linearity
• Means predictors have a straight-line relation with the outcome.
• If residuals are evenly scattered about the regression line, don't worry
about linearity.
• Use scatterplot.
5. Multicollinearity (No)
• Means two or more variables of predictors are highly correlated with
each other (ex. weight & BMI).
• Check Correlation Coefficient between all predictors (Above 0,8) → ×
• Variance Inflation Factor (VIF) values (below 10, better below 5)
6. No influential outliers
• Use Case-wise diagnostics & Cook's distance
• Cook's should be below (4/n). Ex. 4/100 = 0.04
Different classifications of unusual points:
1. Outlier - Observation with a high residual (outcome).
2. Leverage - Observation with an extreme value on a predictor variable.
3. Influential points - Both outlier & leverage. Removing the observation
changes the estimate of correlation.

* Model building:
A) Various criteria can be used to determine which model is best:
- Model Selection Methods:
1. Highest R².
2. Highest adjusted R².
3. Cross-validation.
4. Stepwise selection (forward/backward):
. Forward: Start with no predictors. Add one by one. Stop when adding
predictors no longer significantly improves the model.
. Backward: Start with all predictors. Remove one by one (starting with the
highest P-value). Stop when removing worsens the model.

B) Some potentially explanatory variables can be removed before building

the model. (Researcher's choice).
C) Best model depends on the used dataset. Different dataset = another best
model.

* Model Building & Prediction:

- Model building is a compromise between:
1. Maximum % of explained variation (R²).
2. Minimum of explanatory variables.
N.B.: Extra explanatory variable = extra % of explained variation.
→ Regardless of whether the extra variable is TRULY associated with the
dependent variable.

* Predictive model?
- Best model:
1. Highest Predictive Capability (adjusted R²).
2. Lowest number of predictors.
- How?
1. Automatic selections:
- Forward.
- Backward.
- Stepwise (mix - forward + backward).

2. Manual selections: Enter variables.

* Model building for non-predictive purposes?

- We have the following options:
1. To work as the previous way, to include only stat sig. variables
(Automatic) with cautions.
2. Include All variables with low p-value (<0.2) in simple regression.
3. Include All studied variables or Clinically Imp. Variables.
4. Mixture of the above methods.

* Model Building Slides:

1. Multicollinearity: occurs when explanatory variables are highly

correlated, which distorts individual coefficient estimates,
leading to inconsistent results when variables are added or
removed.
- Indicators: Symptoms of multicollinearity include:
- High correlation among explanatory variables.
- Non-significant individual variables despite high overall
model fit. (R2)
- Large coefficient changes when adding/removing variables.
- High variance inflation factor (VIF below 10, better below 5)
or low tolerance values. VIF = 1/1-R2
- Tolerance close to 0 = explanatory variable is highly

correlated to the others in the model

2. Choice of Explanatory Variables.
3. Dummy Variables: created for categorical data with multiple
levels to allow for separate effects by level.
4. Sample Size: To ensure a stable model, a rule of thumb is
recommended—10-15 observations per predictor.
5. Confounders: Stepwise/forward/backward.
6. Missing Values: can lead to selection bias and loss of statistical
power, as cases with any missing data are excluded in
“complete case” analyses.
- Low Missing Rates (<5-10%): May be ignored without
major issues.
- Higher Missing Rates: Options include removing
variables with high missingness (suboptimal) or using advanced
techniques like multiple imputation to replace missing values.
** Single Logistic Regression: (Similar to SLR)
- Dependent Variable/Y/Outcome: Binary.
- Independent Variable/X/Predictor: Numeric, Ordinal, Categorical.

- Equation:
𝒑 )
4(
ln j𝟏#𝒑k = 𝑏@ + 𝑏" 𝑥 𝑋 , 𝑷= )
"24 (
, 𝑶𝑹 = 𝑠𝑙𝑜𝑝𝑒 𝑥 𝑋 = 𝜕 = 𝑒 A
To calculate the opposite OR = 1/OR = (males to females and vice versa)

Note: the exponential of the coefficient (which is the OR) is an indicator of

the change in odds resulting from a unit change in the predictor.
if there is no association between outcome and predictor, the Coefficient will
be ZERO while the exp(b) will be ONE.

- Examples:

- Predictor continuous variable:

If we study the association between waist circumference (continuous

variable) and having diabetes (binary variable), The interpretation of the
OR value is as follows:

- For each unit increase in waist circumference (1 cm), the odds of being
diabetic increases multiplicatively by 1.04.
Note: that if the waist circumference increases by 3 units (3 cm), the odds of
being diabetic increases by 1.04 x 1.04 x 1.04. (It does not increase by 1.04
x 3).

- If the OR value is greater than 1: as the predictor increases, the odds of

the outcome occurring increase.
- If the OR value is less than 1: as the predictor increases, the odds of the
outcome occurring decrease.
- If the OR value is 1: no change (no association).

- Predictor binary variable:

If we study the association between hypertension (binary variable) and
having diabetes (binary variable).
- For patients with hypertension, the odds of having diabetes is 2.3 times
the odds of having diabetes among patients who don’t have hypertension.

- If the OR value is greater than 1: the odds of the outcome occurring are
higher in the higher coded group (coded as 1).
- If the OR value is less than 1: the odds of the outcome occurring are
lower in the higher coded group (coded as 1).
- If the OR value is 1: no association.
- Predictor categorical variable:

If we study the association between smoking status (categorical variable)

and having bladder cancer (binary variable).
Note: for the smoking variable, the “never smokers” group is considered the
reference category. All other smoking categories are compared to this group.

- For occasional smokers, the odds of having bladder cancer is 1.5 times
the odds of having bladder cancer among never smokers.
- For former smokers, the odds of having bladder cancer is 2.3 times the
odds of having bladder cancer among never smokers.
- For current smokers, the odds of having bladder cancer is 5.2 times the
odds of having bladder cancer among never smokers.
There is an association between being a former smoker or a current smoker
and having bladder cancer.

Note: The OR compares the odds of occurrence of the outcome in each

category compared to the reference category
- If the OR value is greater than 1: the odds of the outcome occurring are
higher in this category as compared to the reference category.
- If the OR value is less than 1: the odds of the outcome occurring are
lower in this category as compared to the reference category.
- If the OR value is 1: no difference from the reference category.

Note: If the confidence interval is containing 1, there is no statistically

significant association.

Note2: The OR compares the odds of occurrence of the outcome in the

higher coded group (1) to the odds of the lower coded group (0).
- It is always important to recognize the reference category. If the sex is
coded 0 for males and 1 for females and the resulting OR = 1.5. This means
that the odds of having the outcome in females is 1.5 times that of males.
- If the coding is reversed, 0 for females and 1 for males, the resulting OR is
0.67. This means that the odds of having the outcome in males is 0.67 that
of the females. The two results are the same, the difference is only which
one is used as a reference group.

- Interpretation of 95% CI:

- In linear regression, if the 95% CI of the coefficient crosses 0, the
result is statistically non-significant (true value may be 0, indicating no
relationship).
- In logistic regression, if the 95% CI of the OR crosses 1, the result is
statistically non-significant (true value may be 1, indicating no
difference/change).
** Multiple Logistic Regression:

𝒑
ln j𝟏#𝒑k = 𝑏@ + 𝑏" 𝑥 𝑋1 + 𝑏! 𝑥 𝑋2

- Crude and Adjusted Odds Ratios:

• Crude Odds Ratios:

o Result from simple logistic regression
o Measure the association between two variables without adjusting for other
variables
• Adjusted Odds Ratios:
o Result from multiple logistic regression
o Measure the association between two variables while adjusting (controlling)
for other variables in the model
• Reporting:
o Sometimes, both crude and adjusted odds ratios are presented to show how
estimates change after adjustment.
o 95% confidence intervals are usually reported for adjusted odds ratios, but not
always for crude odds ratios.

Additional Notes:

• Large changes in odds ratios after adjustment might indicate the presence of
confounding factors or effect modifiers.
• The interpretation of odds ratios in multiple logistic regression is similar to that in
simple logistic regression.
• OR = 𝑒 B

- What to report from the regression output:

1- The Ors (Unadjusted).

2- The Ors (Adjusted).

3- The 95% CI of the Adjusted OR (not significant if contains 1).

4- The p-value (for significance of association).

** Dummy Variables:
- Normally Categorical variables has 2 characteristics (Y/N, M/F),
However, in some cases we might have Categorical variable with
more than 2 characteristics (Small/Medium/Large). In this case
we use Dummy Variables:

1- Create Dummy Variables (K-1, K= number of charac.).

- Each one has two characteristics: (0/1, Y/N).

Medium Large

Yes 1 0

No 0 1

Small 0 0

** Pearson’s Chi-Square test: (non-parametric test)

- The chi-square test for independence (also called Pearson's

chi-square test or the chi-square test of association) is used
to study if there is a relationship between two categorical
variables. (dependent and independent)

Usage: Used to study if there is a relationship/association

between two categorical variables.

How expected values are calculated:

1. We make a 2x2 table for the 2 variables (sex, preferred

drink) with the observed (actual) values.
2. We add the totals to the rows and columns.
3. We calculate the "Expected Value" for each cell. This is done
by multiplying each row total by each column total and
dividing that by the overall total.
4. We get then then expected values (if there is no association
between the two variables)
5. The chi-square test works by comparing the observed values
(actual data) to the expected values (if there is no
association). (Assumption: less than 20% of the cells have
expected count less than 5).

Note: Assumption not met?

- Fisher’s exact (for 2x2 table)

- Exact test (for more than 2x2 table)

- Merge categories where possible.

Interprete: If p<0.05, there is a significant relationship between

the two variables.

Reporting: chi-square test was conducted between sex and

preferred drink. There was a statistically significant association
between sex and the preferred drink, p = 0.043. A higher
percentage of females (48.8%) prefer coffee as compared to males
(42.3%).

** Agresti-Coull Confidence interval:

- r = number of successes.
- n = sample size in each group.
- Z = 1.96 if 95% CI

- For each Category (x and y):

!$
1- 𝑟̌ = 𝑟 + % " &

!$
2- 𝑛( = 𝑛 + % # &

$̌
3- 𝑃+ = &'
4- 𝐷𝑖𝑓𝑓. 𝑃+ = 𝑝̌( − 𝑝̌)

'% (,-'% ) '& /,-'& 0

5- S.E.(diff) = 2 &'%
+ &'&

6- 𝐷𝑖𝑓𝑓. 𝑃+ ± (𝑍 𝑥 𝑆. 𝐸. (𝑑𝑖𝑓𝑓 ) = (𝒙, 𝒚)

* Another equation for calculating the CI:

𝑆𝐷
𝑚𝑒𝑎𝑛 ± 𝑍 A C
√𝑛

𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑑𝑢𝑚𝑚𝑦 = 𝒃 ± 𝒁 ∗ 𝑺. 𝑬 = 𝐶 → 𝒆𝒄

* Z-Values for CIs:

- 80% = 1.282 - 85% = 1.440 - 90% = 1.645 - 95% = 1.96

- 99% = 2.576. – 99.5% = 2.807 - 99.9% = 3.291

Survival Analysis

- Survival analysis is concerned with the time until an event

occurs (time to event). This event is usually death, as survival
after breast cancer, but can be any other event.
- Time from operation to death
- Time from response till the recurrence of a tumor
- Time from operation to discharge from the hospital

- Survival time:
- Survival times are calculated from baseline time (Start) to
the endpoint (Event).
- Objectives of survival analysis:
- To estimate the time to event for a group of
individuals, such as the time until the second heart attack
for a group of myocardial infarction patients.
- To compare the time to event between two or more
groups, such as comparing time to second heart attack
between
- To calculate the survival probability at a certain time,
the probability that patients will survive for 1 or 5 years
(after diagnosis of lung cancer).

- Characteristics of survival data:

- Individuals do not enter the study at the same time.
- When the study ends, some individuals still haven't had
the event yet.
- Other individuals drop out or get lost in the middle of the
study, and all we know about them is the last time they
were still "free" of the event.

- Survival analysis terms:

- Time to event: The time from entry into a study until a
subject has a particular event (outcome).
- Censoring: subjects lost to follow up/drop out or the
study ends before they die.

Note: t-test or one way ANOVA could be used to examine the

influence of the explanatory variables on the survival time.
Note: Chi-Square test could be used to examine the influence of
the explanatory variables on the probability of dying within 4
years after start of the therapy OR of being dead at the end of the
study.

However, the drawbacks of these analyses?

- Actual survival time of who are alive is larger than the
observed survival time.
- Individuals who drop out, lost to follow up are ignored.

Also, we can use the Cumulative incidence = dead/cases, or the

Incidence rate = dead/sum of time. But CI don’t take into
account that some subjects are at risk because of censoring and
IR assume that all subjects have the same survival time
distribution.

As a result, Survival data can’t be analyzed by standard

methods:
1. Survival time is not normally distributed. → Kaplan-Meier.
2. Survival time may often be censored.

In survival analysis, three key concepts are used to describe the

time until an event occurs (such as failure or death): the
cumulative distribution function (CDF), the survival
function, and the hazard rate.

1. Cumulative Distribution Function (CDF):

- represents the probability that the event of interest (like

death or failure) has occurred by a certain time.
- Range: The CDF starts at 0 indicating no one has yet
experienced the event and approaches 1 if eventually
everyone will experience the event.
Given the Mean and Time unit.

1
𝑭(𝒕) = 1 − 𝑒 !"# 𝑤ℎ𝑖𝑙𝑒, 𝝀 =
𝑀𝑒𝑎𝑛
- Ex: F (100) = 1 – e-0.0005x100 = 0.3935
- there’s a 39.35% prob. that X will fail within 100 Hours.
2. Survival Function:

- represents the probability that the event has not

occurred by time. the probability that a subject survives
(or avoids the event) at least until time
- Range: The survival function starts at 1 (indicating 100%
survival at the beginning) and decreases to 0 (assuming
all subjects will eventually experience the event).

𝑺(𝒕) = 1 − 𝐹(𝑡)

- Ex: S (100) = 1 – 0.3935 = 0.6065

- there’s a 60.65% prob. that X will survive more than 100
Hours.
2. Hazard Rate:

- represents the instantaneous risk of the event occurring

at a particular time, given that the individual has
survived up to that time
- Range: The hazard rate is typically a non-negative value,
and it can vary over time. It is not a probability but a
rate, and it can be greater than 1.

1
𝝀=
𝑀𝑒𝑎𝑛
- Ex: h(t) = 1 / 200 = 0.005
- Hazard rate is 0.005 failures per hour, meaning there is
0.5% chance per hour that X will fail.

* Kaplan-Meier:

- The Kaplan-Meier method is commonly used in medical

research to estimate survival rates and compare survival
between different groups (e.g., treatment vs. control) by
using statistical tests like the log-rank test.

- Kaplan-Meier Survival Function:

- The Kaplan-Meier survival function estimates the
probability of surviving past a certain time (t). This is done
by calculating the probability of surviving at each event time
and then multiplying these probabilities together.

- Kaplan-Meier Equation:

𝒅𝒊 𝒏𝒊 − 𝒅𝒊
𝑺(𝒕) = 6(𝟏 − ) 𝑶𝒓 6( )
𝒏𝒊 𝒏𝒊

- where:
- Ti is the time of each observed event (e.g., death,
failure).
- di is the number of events (deaths) that occurred at
time.
- ni is the number of individuals "at risk" (i.e., who
have not yet had the event and have not been
censored) just before time.

- Example of Kaplan-Meier Calculation:

| Patient | Time (days) | Event (Death) |

|---------|-------------|---------------|
|1 | 10 | Yes |
|2 | 15 | No (Censored) |
|3 | 20 | Yes |
|4 | 25 | Yes |
|5 | 30 | No (Censored) |

- Step-by-Step Calculation

1. At (t = 10):
- Number at risk (n1 = 5) (all 5 patients are still in the study).
- Number of events (d1 = 1) (Patient 1 died at (t = 10).
- Survival probability at (t = 10):
- S (10) = 1 – (1/5) = 0.8

2. At (t = 20):
- Number at risk (n2 = 4) (Patient 1 died).
- Number of events (d2 = 1) (Patient 3 died at (t = 10).
- Survival probability at (t = 20):
- S (10) = 1 – (1/4) = 0.75
- Cumulative survival probability up to (t=20) = 0.8 x 0.75 = 0.6

and so on….

- Interpreting the Kaplan-Meier Curve:

- Survival Probability At (t = 25): for example, the survival
probability is 0.4, meaning there’s a 40% chance that a
randomly chosen patient from this group will survive
beyond 25 days.
- Censoring: The curve remains constant during censored
periods, indicating that no event has been observed, but we
still account for individuals who are "at risk."

- Comments on Kaplan-Meier curve:

• Plus signs (+) indicate censored observations.
• At times where the curve drops, an event has occurred.
• Deaths can only occur at observed death times, so the
curve remains horizontal between those time points.
• If the largest survival time is associated with an event,
cumulative survival will drop to 0, otherwise (as here) it will
remain at a positive value.

- Ways of estimation of the “average” survival time:

1. Take the mean (or median) of the survival times of all
subjects, disregarding censoring.
2. Take the mean (or median) of only those individuals who
died.
3. Use the Kaplan-Meier curve to estimate the median
survival time as the time at which the curve drops below
0.5.

Note: that the underlying distribution of the survival times

is often positively skewed. So, the median survival time is a
more appropriate measure of centrality.
Note: that the first two methods can give severely biased
estimates
- The mean and median are underestimated with methods 1
and 2:
• Disregarding censoring means that you assume that
subjects died at the given time point, while censoring
indicates that the individual survived up to and probably
beyond this time.
• Leaving out censored observations would only be OK if the
(unobserved) survival distribution of the censored
observations has the same mean (median) as that of the
events.
• Usually, we have more censoring in the higher survival
times. Leaving the censored observations out leads to
underestimation.
- Estimation of the standard error of the Kaplan-Meier
survival function:

𝒅𝒊
𝑺𝑬=𝑺(𝒕)> = 𝑺(𝒕) 𝒙 @𝚺
𝒏𝒊 (𝒏𝒊 − 𝒅𝒊

- Step-by-Step Example Calculation:

| Time _ | Events | At Risk |

|--------------- |-------------- ---|---------------- --|
| 10 |1 |5 |
| 20 |1 |4 |
| 30 |1 |3 |

- Let's assume we want to calculate the Kaplan-Meier

survival estimate and the standard error up to time
(t=30).

1. Calculate the Kaplan-Meier Survival Estimates:

- At t = 10:
S (10) = 1 x ( 1 – 1/5) = 0.8

- At t = 20:
S (20) = 0.8 x (1 - ¼) = 0.6

- At t = 30:
S (30) = 0.6 x (1 – 1/3) = 0.4

So, the Kaplan-Meier survival estimates are:

- S (10) = 0.8
- S (20) = 0.6
- S (30) = 0.4
2. Calculate the Standard Error:

We will calculate the standard error at t = 30 , so we

sum up contributions at each event time up to t = 30:
%
- For t = 10: = 0.05.
&(&!%)
%
- For t = 20: = 0.0833.
)()!%)
%
- For t = 30: = 0.1667
*(*!%)

Summing these values gives:

0.05 + 0.0833 + 0.1667 = 0.3

S(30) x √0.3 = 0.2191

So, the standard error of the Kaplan-Meier survival

estimates at t = 30 is approximately 0.2191.

- Confidence Intervals for the Survival Function:

𝑆(𝑡) ± 𝑍 𝑥 𝑆. 𝐸(𝑆(𝑡))

* Log-Rank Test:

- The log-rank test compares two or more survival

functions.

- if the two groups have two different means, Are the

observed differences in the survival times between the
two groups likely to be due to chance variation?

- Various non-parametric tests exist to answer this

question, Log-Rank test:
• Suppose there are r times where events occur in
one or both groups.
• Let nij be the number of patients in group i = 1 or
2 just before time tj.
• Let dij be the number of events in group i at tj.
• Then: nj = n1j + n2j and dj = d1j + d2j

- Under the null hypothesis of no difference between the

survival functions of the two groups, the expected
number of events at time tj in group i is given by:

𝑑"
𝑒!" = 𝑛!" × ( )
𝑛"

- Discrepancies between the observed number of events

dij and the expected number of events eij provide
evidence against the null hypothesis.

𝑒+ = Σ 𝑒+, , 𝑑+ = Σ 𝑑+, 𝑓𝑜𝑟 𝑖 = 1 𝑎𝑛𝑑 2

- Then the test statistic has a chi- squared distribution

with one degree of freedom.

(𝑑% − 𝑒% )- (𝑑- − 𝑒- )-
𝑇= +
𝑒% 𝑒-
* Sample Size for Log-Rank Test:

- The power of the log-rank test depends on the total

number of events, rather than the total number of
individuals.
- Therefore, the number of events and patients required
are calculated in three steps:

- Step One: Specify the type I (α), the power (1-β)

and the relevant treatment effect.
𝐥𝐨𝐠(𝑷𝑬 )
𝜹𝟎 = = 𝑯𝑹
𝐥𝐨𝐠(𝑷𝑪 )

- with PE and PC are the estimated

survival probability in the experimental
and control group after a certain follow-up
time T.

- Step Two: estimate d, number of events.

𝟏 + 𝜹𝟎 𝟐 𝟐
𝒅=1 5 6𝒁∝ + 𝒁𝜷 8
𝟏 − 𝜹𝟎

- Step Three: use d to estimate N.

𝟐𝒅
𝑵≥
𝟐 − 𝒑𝒄 − 𝒑𝒆

Note: Z for power: from Normal table

- 75% 0.674 - 90% 1.28
- 80% 0,84 - 95% 1.645
- 85%. 1.036

Correlation and Regression Analysis Guide
No ratings yet
Correlation and Regression Analysis Guide
15 pages
Understanding Multiple Linear Regression
No ratings yet
Understanding Multiple Linear Regression
17 pages
Regression Analysis: Slope and Models
No ratings yet
Regression Analysis: Slope and Models
5 pages
Correlation & Regression
No ratings yet
Correlation & Regression
32 pages
Statistical Analysis and Hypothesis Testing Guide
No ratings yet
Statistical Analysis and Hypothesis Testing Guide
2 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
2 pages
Multiple Linear Regression Assumptions
No ratings yet
Multiple Linear Regression Assumptions
17 pages
Understanding Multiple Linear Regression
No ratings yet
Understanding Multiple Linear Regression
19 pages
Supervised Predictive Analytics Techniques
No ratings yet
Supervised Predictive Analytics Techniques
103 pages
Multiple Regression Analysis Overview
100% (1)
Multiple Regression Analysis Overview
59 pages
Understanding Unstandardized Beta in Regression
100% (4)
Understanding Unstandardized Beta in Regression
28 pages
Module 5 - First Part
No ratings yet
Module 5 - First Part
8 pages
Fintech Investment Management Analysis
No ratings yet
Fintech Investment Management Analysis
19 pages
Multipleregression Assignment
No ratings yet
Multipleregression Assignment
4 pages
Statistical Analysis in Neuroimaging
No ratings yet
Statistical Analysis in Neuroimaging
39 pages
Understanding Regression Analysis Basics
No ratings yet
Understanding Regression Analysis Basics
46 pages
Statistical Analysis Techniques in SPSS
No ratings yet
Statistical Analysis Techniques in SPSS
15 pages
Understanding Multiple Linear Regression
No ratings yet
Understanding Multiple Linear Regression
29 pages
Regression Analysis and Model Evaluation
100% (1)
Regression Analysis and Model Evaluation
113 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
10 pages
CFA Level 2 Quantitative Methods Guide
No ratings yet
CFA Level 2 Quantitative Methods Guide
10 pages
Covariance Formula for Stock Analysis
No ratings yet
Covariance Formula for Stock Analysis
49 pages
Understanding Multiple Regression
No ratings yet
Understanding Multiple Regression
45 pages
Multiple Regression Analysis in SPSS
No ratings yet
Multiple Regression Analysis in SPSS
23 pages
Simple Linear Regression Explained
No ratings yet
Simple Linear Regression Explained
3 pages
Regression Analysis in Econometrics
100% (1)
Regression Analysis in Econometrics
54 pages
Correlation and Regression Overview
100% (1)
Correlation and Regression Overview
5 pages
Multiple Regression Model Assumptions
No ratings yet
Multiple Regression Model Assumptions
2 pages
Understanding Linear Regression Analysis
No ratings yet
Understanding Linear Regression Analysis
18 pages
Linear Regression: Simple vs. Multiple
No ratings yet
Linear Regression: Simple vs. Multiple
6 pages
Understanding Regression Analysis Basics
No ratings yet
Understanding Regression Analysis Basics
9 pages
Understanding Regression Analysis Techniques
No ratings yet
Understanding Regression Analysis Techniques
17 pages
Understanding Multiple Regression Analysis
No ratings yet
Understanding Multiple Regression Analysis
23 pages
Linear Regression in Machine Learning
100% (1)
Linear Regression in Machine Learning
55 pages
Regression Analysis in R: Techniques & Assumptions
No ratings yet
Regression Analysis in R: Techniques & Assumptions
13 pages
Understanding Regression Analysis Basics
No ratings yet
Understanding Regression Analysis Basics
15 pages
Business Analytics: Advance: Simple & Multiple Linear Regression
No ratings yet
Business Analytics: Advance: Simple & Multiple Linear Regression
38 pages
Correlation and Regression Techniques
No ratings yet
Correlation and Regression Techniques
10 pages
Correlation Analysis Overview and Methods
No ratings yet
Correlation Analysis Overview and Methods
52 pages
Linear Regression Model
No ratings yet
Linear Regression Model
10 pages
Hướng Dẫn Hồi Quy Tuyến Tính SPSS
No ratings yet
Hướng Dẫn Hồi Quy Tuyến Tính SPSS
11 pages
Understanding Multiple Regression Techniques
No ratings yet
Understanding Multiple Regression Techniques
45 pages
Corelation Coefficient and Regression 19mph402
No ratings yet
Corelation Coefficient and Regression 19mph402
11 pages
Correlation and Regression Analysis Basics
No ratings yet
Correlation and Regression Analysis Basics
7 pages
Non-Parametric Tests and Correlation Analysis
No ratings yet
Non-Parametric Tests and Correlation Analysis
54 pages
Regression Analysis Techniques Explained
No ratings yet
Regression Analysis Techniques Explained
6 pages
Understanding Multiple Regression Analysis
No ratings yet
Understanding Multiple Regression Analysis
24 pages
Least Squares Estimation in Regression
No ratings yet
Least Squares Estimation in Regression
24 pages
Statistical Methods in Pharmacy Research
No ratings yet
Statistical Methods in Pharmacy Research
14 pages
Data Analysis: Correlation & Regression Insights
No ratings yet
Data Analysis: Correlation & Regression Insights
4 pages
Understanding Simple Linear Regression
No ratings yet
Understanding Simple Linear Regression
9 pages
Chapter7 CorrelationRegression
No ratings yet
Chapter7 CorrelationRegression
7 pages
Understanding Regression Analysis Basics
No ratings yet
Understanding Regression Analysis Basics
14 pages
Financial Mathematics: Regression Analysis
No ratings yet
Financial Mathematics: Regression Analysis
8 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
6 pages
Understanding Multiple Regression Analysis
No ratings yet
Understanding Multiple Regression Analysis
12 pages
Understanding Multiple Regression Analysis
No ratings yet
Understanding Multiple Regression Analysis
33 pages
Understanding R² in Regression Analysis
No ratings yet
Understanding R² in Regression Analysis
40 pages
Understanding Multiple Regression Analysis
No ratings yet
Understanding Multiple Regression Analysis
57 pages
A Guide To Modern Econometrics 5th Edition Marno Verbeek Direct Link
100% (7)
A Guide To Modern Econometrics 5th Edition Marno Verbeek Direct Link
206 pages
glmm.hp: R Package for GLMM R² Analysis
No ratings yet
glmm.hp: R Package for GLMM R² Analysis
6 pages
The General Linear Model A Primer Full Text Download
100% (20)
The General Linear Model A Primer Full Text Download
17 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
66 pages
MAE, MSE, and RMSE in Regression
No ratings yet
MAE, MSE, and RMSE in Regression
9 pages
Financial Econometrics Midterm Exam Guide
No ratings yet
Financial Econometrics Midterm Exam Guide
3 pages
Correlation of Calories and Carbs in Starbucks
100% (1)
Correlation of Calories and Carbs in Starbucks
6 pages
OLS Fundamentals by Kirill Evdokimov
No ratings yet
OLS Fundamentals by Kirill Evdokimov
22 pages
Impact of Promotions on SPaylater Use
No ratings yet
Impact of Promotions on SPaylater Use
5 pages
Multicollinearity and Autocorrelation Analysis
No ratings yet
Multicollinearity and Autocorrelation Analysis
20 pages
Linear Regression for Used Car Price Prediction
No ratings yet
Linear Regression for Used Car Price Prediction
11 pages
Marquardt Method for Nonlinear Regression
No ratings yet
Marquardt Method for Nonlinear Regression
4 pages
AIRCO Customer Satisfaction Analysis
No ratings yet
AIRCO Customer Satisfaction Analysis
6 pages
Econometrics Midterm Exam Fall 2022
No ratings yet
Econometrics Midterm Exam Fall 2022
7 pages
Nonlinear Regression Models Overview
No ratings yet
Nonlinear Regression Models Overview
14 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
37 pages
Econometrics Exam Questions and Solutions
No ratings yet
Econometrics Exam Questions and Solutions
3 pages
Sire Index: Estimating Bull Genetics
No ratings yet
Sire Index: Estimating Bull Genetics
2 pages
MEC414 Feasibility Study Lecture 04
No ratings yet
MEC414 Feasibility Study Lecture 04
33 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
14 pages
Ankit Jindal Assignment-T-3
No ratings yet
Ankit Jindal Assignment-T-3
8 pages
Mediation Analysis Instructions for SPSS
No ratings yet
Mediation Analysis Instructions for SPSS
5 pages
MLR with Categorical Variables in Python
No ratings yet
MLR with Categorical Variables in Python
10 pages
Factors Influencing Unemployment in Indonesia
No ratings yet
Factors Influencing Unemployment in Indonesia
10 pages
R Spatial Econometrics Guide
No ratings yet
R Spatial Econometrics Guide
2 pages
Understanding Regression Analysis
No ratings yet
Understanding Regression Analysis
15 pages
PCS211 Kinematics Lab Manual
No ratings yet
PCS211 Kinematics Lab Manual
11 pages
Correlation and Regression Analysis in AI
No ratings yet
Correlation and Regression Analysis in AI
27 pages
Econometric Models: Key Concepts Explained
No ratings yet
Econometric Models: Key Concepts Explained
19 pages
Hsslive-Xii-Statistics-2. Rehression English
No ratings yet
Hsslive-Xii-Statistics-2. Rehression English
5 pages

Correlation and Regression Analysis Guide

Uploaded by

Correlation and Regression Analysis Guide

Uploaded by

Correlation & Regression

- Pearson’s Correlation Coefficient:

* We use scatterplot to identify:

* Cautions with Correlation:

-How to calculate it?

Second Step: Div. for mean:

Fourth Step: Variance and Standard Deviation of X and Y:

(𝑋𝑑1! ) + (𝑋𝑑2! ) + ⋯ . 𝑒𝑡𝑐

(𝑌𝑑1! ) + (𝑌𝑑2! ) + ⋯ . 𝑒𝑡𝑐

Fifth Step: Correlation Coefficient:

Sixth Step: Standard Error:

Later On: (Simple Linear Regression):

Also, Confidence Interval using Fisher’s Z-Transformation (r is known) :

Step one: Get Fisher’s Z:

Step two: S.E. of Z:

Step three: Determine Z-score for CI (1.96 for CI 95%):

Step four: Convert CI for Z back to r:

- Spearman’s Rank Correlation (p):

- It’s a non-parametric measure of correlation that assesses the strength

** Simple Linear Regression:

* What to do in case of bad fit?

* How can we check the Normality of the residuals?

* Linear Regression & ANOVA:

- The regression analysis can also be viewed as a one-way ANOVA with a

1. Deviation between measurement & overall mean: (A)

Regression SSR K-1 𝑆𝑆849 MSR/MSE

* Proportion of Explained Variance 𝑹𝟐 :

** Multiple Linear Regression:

* How can we check for these assumptions:

* With 2 explanatory variables, there are four possible models:

* Checking for assumptions:

3. Homoscedasticity (Same Scattered)

B) Some potentially explanatory variables can be removed before building

* Model Building & Prediction:

2. Manual selections: Enter variables.

* Model building for non-predictive purposes?

* Model Building Slides:

1. Multicollinearity: occurs when explanatory variables are highly

correlated to the others in the model

Note: the exponential of the coefficient (which is the OR) is an indicator of

- Predictor continuous variable:

If we study the association between waist circumference (continuous

- If the OR value is greater than 1: as the predictor increases, the odds of

- Predictor binary variable:

If we study the association between smoking status (categorical variable)

Note: The OR compares the odds of occurrence of the outcome in each

Note: If the confidence interval is containing 1, there is no statistically

Note2: The OR compares the odds of occurrence of the outcome in the

- Interpretation of 95% CI:

- Crude and Adjusted Odds Ratios:

• Crude Odds Ratios:

- What to report from the regression output:

1- The Ors (Unadjusted).

2- The Ors (Adjusted).

3- The 95% CI of the Adjusted OR (not significant if contains 1).

4- The p-value (for significance of association).

1- Create Dummy Variables (K-1, K= number of charac.).

- Each one has two characteristics: (0/1, Y/N).

** Pearson’s Chi-Square test: (non-parametric test)

- The chi-square test for independence (also called Pearson's

Usage: Used to study if there is a relationship/association

How expected values are calculated:

1. We make a 2x2 table for the 2 variables (sex, preferred

Note: Assumption not met?

- Fisher’s exact (for 2x2 table)

- Exact test (for more than 2x2 table)

- Merge categories where possible.

Interprete: If p<0.05, there is a significant relationship between

Reporting: chi-square test was conducted between sex and

** Agresti-Coull Confidence interval:

- For each Category (x and y):

*'% (,-*'% ) *'& /,-*'& 0

6- 𝐷𝑖𝑓𝑓. 𝑃+ ± (𝑍 𝑥 𝑆. 𝐸. (𝑑𝑖𝑓𝑓 ) = (𝒙, 𝒚)

* Another equation for calculating the CI:

𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑑𝑢𝑚𝑚𝑦 = 𝒃 ± 𝒁 ∗ 𝑺. 𝑬 = 𝐶 → 𝒆𝒄

* Z-Values for CIs:

'% (,-'% ) '& /,-'& 0