0% found this document useful (0 votes)
11 views23 pages

Correlation and Regression Analysis Guide

The document discusses correlation and regression analysis, detailing Pearson's correlation coefficient, its properties, and calculation steps, as well as Spearman's rank correlation for non-parametric measures. It explains simple and multiple linear regression, including assumptions, model building, and checking for normality and homogeneity of variance. Additionally, it covers model selection methods and the importance of predictive capability in regression models.

Uploaded by

rabiaalrabea
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views23 pages

Correlation and Regression Analysis Guide

The document discusses correlation and regression analysis, detailing Pearson's correlation coefficient, its properties, and calculation steps, as well as Spearman's rank correlation for non-parametric measures. It explains simple and multiple linear regression, including assumptions, model building, and checking for normality and homogeneity of variance. Additionally, it covers model selection methods and the importance of predictive capability in regression models.

Uploaded by

rabiaalrabea
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Correlation & Regression

- Pearson’s Correlation Coefficient:


- The direction is indicated by the sign of r, +ve or -ve.
- The strength is indicated by the absolute value of r, the higher the r
either +ve or -ve the closer the dots to the line.

* We use scatterplot to identify:


1- Form: Linear, Curved, Cluster or No pattern.
2- Direction: Positive, Negative, No direction.
3- Strength: How closely it fits the line: Weak, Moderate or Strong.

* Prosperities of r:
1- r range from -1 to +1.
2- Dimensionless (Because it measure strength and direction)
3- if X and Y are independence then r=0 (the opposite isn’t correct)

* Cautions with Correlation:


1- Correlation doesn’t imply Causation.
2- High Correlation Coefficient not necessarily mean there is a correlation.
3- Significance depends on Sample size & Size of Correlation Coefficient.

-How to calculate it?


First Step: Mean of X & Y:

𝑆𝑢𝑚 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠
𝑀𝑒𝑎𝑛 (𝑋𝑚) =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒

Second Step: Div. for mean:


Xd Yd
X1 - Xm Y1 – Ym
X2 – Xm Y2 – Ym
Third Step: Covariance:
X3 - Xm Y3 – Ym
(Xd1 x Yd1) + (Xd2 x Yd2) + ⋯ … etc
𝑛−1
= 𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒

Fourth Step: Variance and Standard Deviation of X and Y:

(𝑋𝑑1! ) + (𝑋𝑑2! ) + ⋯ . 𝑒𝑡𝑐


= √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒. 𝑋 = 𝑆𝐷𝑥
𝑛−1

(𝑌𝑑1! ) + (𝑌𝑑2! ) + ⋯ . 𝑒𝑡𝑐


= √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒. 𝑌 = 𝑆𝐷𝑦
𝑛−1

Fifth Step: Correlation Coefficient:


𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑆𝐷𝑥 𝑥 𝑆𝐷𝑦

Sixth Step: Standard Error:


• Degree of Freedom = n-2
"#$ !
𝑆. 𝐸 = N %&
While,
$
t = '.)

Later On: (Simple Linear Regression):


SLR = Intercept + (Slope x X)

While,
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑆𝐷𝑥
𝑆𝑙𝑜𝑝𝑒 = 𝑜𝑟 𝑟𝑥
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑋 𝑆𝐷𝑦
And,
𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝑀𝑦 − (𝑆𝑙𝑜𝑝𝑒 𝑥 𝑀𝑥)

Also, Confidence Interval using Fisher’s Z-Transformation (r is known) :

Step one: Get Fisher’s Z:


1 1+𝑟
𝑍= ln T U
2 1−𝑟

Step two: S.E. of Z:


1
𝑆𝐸𝑧 =
√𝑛 − 3

Step three: Determine Z-score for CI (1.96 for CI 95%):


𝑍 ± 1.96 𝑥 𝑆𝐸𝑧 = (𝑙𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑, 𝑢𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑)

Step four: Convert CI for Z back to r:


𝑒 !* − 1
𝑟 = !* = 𝑑𝑜 𝑖𝑡 𝑡𝑤𝑖𝑐𝑒 𝑓𝑜𝑟 𝑙𝑜𝑤𝑒𝑟 𝑎𝑛𝑑 𝑓𝑜𝑟 𝑢𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑠.
𝑒 +1

- Spearman’s Rank Correlation (p):

- It’s a non-parametric measure of correlation that assesses the strength


and the direction of the monotonic relation between two variables.
- Used for:
- Ordinally scaled variables.
- Non-normally distributed variables.
- Non-linear monotonic relationships.

- Procedure:
- Rank both variables from lowest to highest.
make it less sensitive to outliers & non-normal distribution.
- Calculate a Pearson’s correlation coefficient on the ranks.

- Equation:

+ - (/#0)! 2⋯.456
𝑃 =1− 7(7! #")
(n = number of pairs)

- Interpreting:
- P = +1 → Perfect positive monotonic relation (X increase, Y Increase)
- P = -1 → Perfect negative monotonic relation (X increase, Y Decrease)
- P = 0 → No monotonic relationship (No consistent relation)
- 0 < P < 1 → Positive monotonic relation (X increase, Y Increase) but not
perfectly.
- -1 < P < 0 → Negative monotonic relation (X increase, Y Decrease) but not
perfectly.

- Regression:

* Types:
- Dependent Variable/Outcome/Y (Response Variable):
1- Continuous: Linear Regression. Ex. HR, BP…. etc
2- Binary: Logistic Regression. Ex. y/n, Disease/No Disease….etc
3- Time to Event: Cox Regression. Ex. Time to death, recurrence.
- Independent Variable/Predictor/X (Explanatory):
1- One Predictor: Simple Regression.
2- Multiple Predictors: Multiple Regression.

* Uses:
1- As a predication tool.
2- to control for confounders.

** Simple Linear Regression:


SLR = Intercept + (Slope x X) + e (Residuals)
While,
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑆𝐷𝑥
𝑆𝑙𝑜𝑝𝑒 = 𝑜𝑟 𝑟𝑥
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑋 𝑆𝐷𝑦
And,
𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝑀𝑦 − (𝑆𝑙𝑜𝑝𝑒 𝑥 𝑀𝑥)
E= Observed Value – Predicated Value
The Coefficient of determination = r2 (a measure of how successful the
regression was in explaining the response)

* Assumptions:
1- The residuals are normally distributed.
2- Homogeneity of Variance.
3- Independence of the measurement. 4- Linearity of the model.

* What to do in case of bad fit?


1- Non-linear transformation of X or Y. (e.g. log(x))
2- Non-linear Regression.
3- Weighted Regression.
4- Multiple Regression.

* How can we check the Normality of the residuals?


1- First Determine the residuals. (E)
2- Ways to check Normality: Most common way (Normal Probability Plot)
- Histogram/Boxplot*
- Tests: Shapiro-wilk and Kolmogorov-Smirnov.
* How can we check the Homogeneity of the Variance?
- Make residual plot (Scattered plot of the observed vs. predicated values):
- Graph shows horizontal band of equal heights = Homogeneity.
- Graph shows a band of equal heights but not horizontal = No.

* Linear Regression & ANOVA:

- The regression analysis can also be viewed as a one-way ANOVA with a


specific kind of post-hoc test.

- Equations: ( A = B + C )

1. Deviation between measurement & overall mean: (A)


𝑌" − 𝑌^
2. Deviation between measurement & regression line: (B)
` (𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑙𝑖𝑛𝑒)
𝑌" − 𝑌!
3. Deviation between regression line & overall mean: (C)
𝑌^! − 𝑌^
* Table (ANOVA) for Regression: SE = Square Root of MSE
Source SS df MS F Sig.

Regression SSR K-1 𝑆𝑆849 MSR/MSE


𝑑𝑓
Residuals SSE N-K 𝑆𝑆84:
𝑑𝑓
Total SSR-SSE N-1 Var(y)
- Other Equations:
''"#$ ''%&' (7#")
𝑅! = ''%&'
, 𝑉𝑎𝑟(𝑦) = ;#"
, 𝐴𝑑𝑗𝑢𝑠. 𝑅! = 1 − (7#<#")
𝑥 (1 − 𝑅! )
- H0: 𝐵1 = 0
- H1: 𝐵1 ≠ 0
- F follows F distribution with dfReg & dfRes degrees of freedom.

* Proportion of Explained Variance 𝑹𝟐 :


- is the proportion of variance in Y explained by the regression in X.

** Multiple Linear Regression:


MLR = Intercept + (Slope1 x X1) + (Slope2 x X2) + ….. + e (Residuals)
* Assumptions:
1- The residuals are normally distributed.
2- Homogeneity of Variance.
3- Independence of the measurement.
4- Linearity of the model.
5- For every X the variance of Y should be equal.
6- No Multicollinearity: Multicollinearity means that two or more of the
Independent variables are highly correlated with each other.

* How can we check for these assumptions:


1- Normal distribution: Histogram/Boxplot & Tests.
2- Homogeneity: error plot with residuals (e > Y-axis, Predications > X-axis)

* With 2 explanatory variables, there are four possible models:


1- Empty model: y=b0+e (lowest predication percentage (0), V. Cheap).
2- Model with X1: y=b0+bX1*X1+e (explains 30%, Not V. Cheap).
3- Model with X2: y=b0+bX2*X2+e (explains 60%, Not V. Cheap).
4- Model with both: y=b0+bX1*X1+bX2*X2+e (explains 80%, Exp.).
Note: If 2) and 3) are P<0.05, 4) is Statistically better.
* When and Why Multiple Regression?
- When? Research Question > Study Design > [Analysis].
- Why?
A) Observational Studies:
- Etiological: - Usually Examine effect of one determinant on outcome.
- Observational studies have problem of confounders.
- Goal: effect of determinant on outcome while controlling
for potential confounders.
- Predication: - Goal: Which factor contribute independently to the
Outcome.
- Selection of best variable, while controlling overlaps.
B) Experimental Studies:
- Effect of treatment on outcome.
- Randomization should take care of confounding.
- Taking into account pre-specified variables related to
Outcome may decrease uncertainty & increase power.

* Checking for assumptions:


1. Independence of observations (residuals)
• Observations are independent from one another.
• Knowing the value of one case tells nothing about the value of other
variables.
• To check Durbin-Watson statistic (1.5 to 2.5)? ✓
2. Normality of residuals
• We calculate standardized residuals.
• Check using:
o Histogram
o Shapiro-Wilk test
o PP Plot

3. Homoscedasticity (Same Scattered)


• Refers to if residuals are equally distributed or not
(heteroscedasticity).
• Plot of residuals & Predicted values.
4. Linearity
• Means predictors have a straight-line relation with the outcome.
• If residuals are evenly scattered about the regression line, don't worry
about linearity.
• Use scatterplot.
5. Multicollinearity (No)
• Means two or more variables of predictors are highly correlated with
each other (ex. weight & BMI).
• Check Correlation Coefficient between all predictors (Above 0,8) → ×
• Variance Inflation Factor (VIF) values (below 10, better below 5)
6. No influential outliers
• Use Case-wise diagnostics & Cook's distance
• Cook's should be below (4/n). Ex. 4/100 = 0.04
Different classifications of unusual points:
1. Outlier - Observation with a high residual (outcome).
2. Leverage - Observation with an extreme value on a predictor variable.
3. Influential points - Both outlier & leverage. Removing the observation
changes the estimate of correlation.

* Model building:
A) Various criteria can be used to determine which model is best:
- Model Selection Methods:
1. Highest R².
2. Highest adjusted R².
3. Cross-validation.
4. Stepwise selection (forward/backward):
. Forward: Start with no predictors. Add one by one. Stop when adding
predictors no longer significantly improves the model.
. Backward: Start with all predictors. Remove one by one (starting with the
highest P-value). Stop when removing worsens the model.

B) Some potentially explanatory variables can be removed before building


the model. (Researcher's choice).
C) Best model depends on the used dataset. Different dataset = another best
model.

* Model Building & Prediction:


- Model building is a compromise between:
1. Maximum % of explained variation (R²).
2. Minimum of explanatory variables.
N.B.: Extra explanatory variable = extra % of explained variation.
→ Regardless of whether the extra variable is TRULY associated with the
dependent variable.

* Predictive model?
- Best model:
1. Highest Predictive Capability (adjusted R²).
2. Lowest number of predictors.
- How?
1. Automatic selections:
- Forward.
- Backward.
- Stepwise (mix - forward + backward).

2. Manual selections: Enter variables.

* Model building for non-predictive purposes?


- We have the following options:
1. To work as the previous way, to include only stat sig. variables
(Automatic) with cautions.
2. Include All variables with low p-value (<0.2) in simple regression.
3. Include All studied variables or Clinically Imp. Variables.
4. Mixture of the above methods.

* Model Building Slides:

1. Multicollinearity: occurs when explanatory variables are highly


correlated, which distorts individual coefficient estimates,
leading to inconsistent results when variables are added or
removed.
- Indicators: Symptoms of multicollinearity include:
- High correlation among explanatory variables.
- Non-significant individual variables despite high overall
model fit. (R2)
- Large coefficient changes when adding/removing variables.
- High variance inflation factor (VIF below 10, better below 5)
or low tolerance values. VIF = 1/1-R2
- Tolerance close to 0 = explanatory variable is highly

correlated to the others in the model


2. Choice of Explanatory Variables.
3. Dummy Variables: created for categorical data with multiple
levels to allow for separate effects by level.
4. Sample Size: To ensure a stable model, a rule of thumb is
recommended—10-15 observations per predictor.
5. Confounders: Stepwise/forward/backward.
6. Missing Values: can lead to selection bias and loss of statistical
power, as cases with any missing data are excluded in
“complete case” analyses.
- Low Missing Rates (<5-10%): May be ignored without
major issues.
- Higher Missing Rates: Options include removing
variables with high missingness (suboptimal) or using advanced
techniques like multiple imputation to replace missing values.
** Single Logistic Regression: (Similar to SLR)
- Dependent Variable/Y/Outcome: Binary.
- Independent Variable/X/Predictor: Numeric, Ordinal, Categorical.

- Equation:
𝒑 )
4(
ln j𝟏#𝒑k = 𝑏@ + 𝑏" 𝑥 𝑋 , 𝑷= )
"24 (
, 𝑶𝑹 = 𝑠𝑙𝑜𝑝𝑒 𝑥 𝑋 = 𝜕 = 𝑒 A
To calculate the opposite OR = 1/OR = (males to females and vice versa)

Note: the exponential of the coefficient (which is the OR) is an indicator of


the change in odds resulting from a unit change in the predictor.
if there is no association between outcome and predictor, the Coefficient will
be ZERO while the exp(b) will be ONE.

- Examples:

- Predictor continuous variable:

If we study the association between waist circumference (continuous


variable) and having diabetes (binary variable), The interpretation of the
OR value is as follows:

- For each unit increase in waist circumference (1 cm), the odds of being
diabetic increases multiplicatively by 1.04.
Note: that if the waist circumference increases by 3 units (3 cm), the odds of
being diabetic increases by 1.04 x 1.04 x 1.04. (It does not increase by 1.04
x 3).

- If the OR value is greater than 1: as the predictor increases, the odds of


the outcome occurring increase.
- If the OR value is less than 1: as the predictor increases, the odds of the
outcome occurring decrease.
- If the OR value is 1: no change (no association).

- Predictor binary variable:


If we study the association between hypertension (binary variable) and
having diabetes (binary variable).
- For patients with hypertension, the odds of having diabetes is 2.3 times
the odds of having diabetes among patients who don’t have hypertension.

- If the OR value is greater than 1: the odds of the outcome occurring are
higher in the higher coded group (coded as 1).
- If the OR value is less than 1: the odds of the outcome occurring are
lower in the higher coded group (coded as 1).
- If the OR value is 1: no association.
- Predictor categorical variable:

If we study the association between smoking status (categorical variable)


and having bladder cancer (binary variable).
Note: for the smoking variable, the “never smokers” group is considered the
reference category. All other smoking categories are compared to this group.

- For occasional smokers, the odds of having bladder cancer is 1.5 times
the odds of having bladder cancer among never smokers.
- For former smokers, the odds of having bladder cancer is 2.3 times the
odds of having bladder cancer among never smokers.
- For current smokers, the odds of having bladder cancer is 5.2 times the
odds of having bladder cancer among never smokers.
There is an association between being a former smoker or a current smoker
and having bladder cancer.

Note: The OR compares the odds of occurrence of the outcome in each


category compared to the reference category
- If the OR value is greater than 1: the odds of the outcome occurring are
higher in this category as compared to the reference category.
- If the OR value is less than 1: the odds of the outcome occurring are
lower in this category as compared to the reference category.
- If the OR value is 1: no difference from the reference category.

Note: If the confidence interval is containing 1, there is no statistically


significant association.

Note2: The OR compares the odds of occurrence of the outcome in the


higher coded group (1) to the odds of the lower coded group (0).
- It is always important to recognize the reference category. If the sex is
coded 0 for males and 1 for females and the resulting OR = 1.5. This means
that the odds of having the outcome in females is 1.5 times that of males.
- If the coding is reversed, 0 for females and 1 for males, the resulting OR is
0.67. This means that the odds of having the outcome in males is 0.67 that
of the females. The two results are the same, the difference is only which
one is used as a reference group.

- Interpretation of 95% CI:


- In linear regression, if the 95% CI of the coefficient crosses 0, the
result is statistically non-significant (true value may be 0, indicating no
relationship).
- In logistic regression, if the 95% CI of the OR crosses 1, the result is
statistically non-significant (true value may be 1, indicating no
difference/change).
** Multiple Logistic Regression:

𝒑
ln j𝟏#𝒑k = 𝑏@ + 𝑏" 𝑥 𝑋1 + 𝑏! 𝑥 𝑋2

- Crude and Adjusted Odds Ratios:

• Crude Odds Ratios:


o Result from simple logistic regression
o Measure the association between two variables without adjusting for other
variables
• Adjusted Odds Ratios:
o Result from multiple logistic regression
o Measure the association between two variables while adjusting (controlling)
for other variables in the model
• Reporting:
o Sometimes, both crude and adjusted odds ratios are presented to show how
estimates change after adjustment.
o 95% confidence intervals are usually reported for adjusted odds ratios, but not
always for crude odds ratios.

Additional Notes:

• Large changes in odds ratios after adjustment might indicate the presence of
confounding factors or effect modifiers.
• The interpretation of odds ratios in multiple logistic regression is similar to that in
simple logistic regression.
• OR = 𝑒 B

- What to report from the regression output:

1- The Ors (Unadjusted).

2- The Ors (Adjusted).

3- The 95% CI of the Adjusted OR (not significant if contains 1).

4- The p-value (for significance of association).

** Dummy Variables:
- Normally Categorical variables has 2 characteristics (Y/N, M/F),
However, in some cases we might have Categorical variable with
more than 2 characteristics (Small/Medium/Large). In this case
we use Dummy Variables:

1- Create Dummy Variables (K-1, K= number of charac.).

- Each one has two characteristics: (0/1, Y/N).

Medium Large

Yes 1 0

No 0 1

Small 0 0

** Pearson’s Chi-Square test: (non-parametric test)

- The chi-square test for independence (also called Pearson's


chi-square test or the chi-square test of association) is used
to study if there is a relationship between two categorical
variables. (dependent and independent)

Usage: Used to study if there is a relationship/association


between two categorical variables.

How expected values are calculated:

1. We make a 2x2 table for the 2 variables (sex, preferred


drink) with the observed (actual) values.
2. We add the totals to the rows and columns.
3. We calculate the "Expected Value" for each cell. This is done
by multiplying each row total by each column total and
dividing that by the overall total.
4. We get then then expected values (if there is no association
between the two variables)
5. The chi-square test works by comparing the observed values
(actual data) to the expected values (if there is no
association). (Assumption: less than 20% of the cells have
expected count less than 5).

Note: Assumption not met?

- Fisher’s exact (for 2x2 table)

- Exact test (for more than 2x2 table)

- Merge categories where possible.

Interprete: If p<0.05, there is a significant relationship between


the two variables.

Reporting: chi-square test was conducted between sex and


preferred drink. There was a statistically significant association
between sex and the preferred drink, p = 0.043. A higher
percentage of females (48.8%) prefer coffee as compared to males
(42.3%).

** Agresti-Coull Confidence interval:

- r = number of successes.
- n = sample size in each group.
- Z = 1.96 if 95% CI

- For each Category (x and y):


!$
1- 𝑟̌ = 𝑟 + % " &

!$
2- 𝑛( = 𝑛 + % # &


3- 𝑃+ = &'
4- 𝐷𝑖𝑓𝑓. 𝑃+ = 𝑝̌( − 𝑝̌)

*'% (,-*'% ) *'& /,-*'& 0


5- S.E.(diff) = 2 &'%
+ &'&

6- 𝐷𝑖𝑓𝑓. 𝑃+ ± (𝑍 𝑥 𝑆. 𝐸. (𝑑𝑖𝑓𝑓 ) = (𝒙, 𝒚)

* Another equation for calculating the CI:

𝑆𝐷
𝑚𝑒𝑎𝑛 ± 𝑍 A C
√𝑛

𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑑𝑢𝑚𝑚𝑦 = 𝒃 ± 𝒁 ∗ 𝑺. 𝑬 = 𝐶 → 𝒆𝒄

* Z-Values for CIs:

- 80% = 1.282 - 85% = 1.440 - 90% = 1.645 - 95% = 1.96

- 99% = 2.576. – 99.5% = 2.807 - 99.9% = 3.291

Survival Analysis

- Survival analysis is concerned with the time until an event


occurs (time to event). This event is usually death, as survival
after breast cancer, but can be any other event.
- Time from operation to death
- Time from response till the recurrence of a tumor
- Time from operation to discharge from the hospital

- Survival time:
- Survival times are calculated from baseline time (Start) to
the endpoint (Event).
- Objectives of survival analysis:
- To estimate the time to event for a group of
individuals, such as the time until the second heart attack
for a group of myocardial infarction patients.
- To compare the time to event between two or more
groups, such as comparing time to second heart attack
between
- To calculate the survival probability at a certain time,
the probability that patients will survive for 1 or 5 years
(after diagnosis of lung cancer).

- Characteristics of survival data:


- Individuals do not enter the study at the same time.
- When the study ends, some individuals still haven't had
the event yet.
- Other individuals drop out or get lost in the middle of the
study, and all we know about them is the last time they
were still "free" of the event.

- Survival analysis terms:


- Time to event: The time from entry into a study until a
subject has a particular event (outcome).
- Censoring: subjects lost to follow up/drop out or the
study ends before they die.

Note: t-test or one way ANOVA could be used to examine the


influence of the explanatory variables on the survival time.
Note: Chi-Square test could be used to examine the influence of
the explanatory variables on the probability of dying within 4
years after start of the therapy OR of being dead at the end of the
study.

However, the drawbacks of these analyses?


- Actual survival time of who are alive is larger than the
observed survival time.
- Individuals who drop out, lost to follow up are ignored.

Also, we can use the Cumulative incidence = dead/cases, or the


Incidence rate = dead/sum of time. But CI don’t take into
account that some subjects are at risk because of censoring and
IR assume that all subjects have the same survival time
distribution.

As a result, Survival data can’t be analyzed by standard


methods:
1. Survival time is not normally distributed. → Kaplan-Meier.
2. Survival time may often be censored.

In survival analysis, three key concepts are used to describe the


time until an event occurs (such as failure or death): the
cumulative distribution function (CDF), the survival
function, and the hazard rate.

1. Cumulative Distribution Function (CDF):

- represents the probability that the event of interest (like


death or failure) has occurred by a certain time.
- Range: The CDF starts at 0 indicating no one has yet
experienced the event and approaches 1 if eventually
everyone will experience the event.
Given the Mean and Time unit.

1
𝑭(𝒕) = 1 − 𝑒 !"# 𝑤ℎ𝑖𝑙𝑒, 𝝀 =
𝑀𝑒𝑎𝑛
- Ex: F (100) = 1 – e-0.0005x100 = 0.3935
- there’s a 39.35% prob. that X will fail within 100 Hours.
2. Survival Function:

- represents the probability that the event has not


occurred by time. the probability that a subject survives
(or avoids the event) at least until time
- Range: The survival function starts at 1 (indicating 100%
survival at the beginning) and decreases to 0 (assuming
all subjects will eventually experience the event).

𝑺(𝒕) = 1 − 𝐹(𝑡)

- Ex: S (100) = 1 – 0.3935 = 0.6065


- there’s a 60.65% prob. that X will survive more than 100
Hours.
2. Hazard Rate:

- represents the instantaneous risk of the event occurring


at a particular time, given that the individual has
survived up to that time
- Range: The hazard rate is typically a non-negative value,
and it can vary over time. It is not a probability but a
rate, and it can be greater than 1.

1
𝝀=
𝑀𝑒𝑎𝑛
- Ex: h(t) = 1 / 200 = 0.005
- Hazard rate is 0.005 failures per hour, meaning there is
0.5% chance per hour that X will fail.

* Kaplan-Meier:

- The Kaplan-Meier method is commonly used in medical


research to estimate survival rates and compare survival
between different groups (e.g., treatment vs. control) by
using statistical tests like the log-rank test.

- Kaplan-Meier Survival Function:


- The Kaplan-Meier survival function estimates the
probability of surviving past a certain time (t). This is done
by calculating the probability of surviving at each event time
and then multiplying these probabilities together.

- Kaplan-Meier Equation:

𝒅𝒊 𝒏𝒊 − 𝒅𝒊
𝑺(𝒕) = 6(𝟏 − ) 𝑶𝒓 6( )
𝒏𝒊 𝒏𝒊

- where:
- Ti is the time of each observed event (e.g., death,
failure).
- di is the number of events (deaths) that occurred at
time.
- ni is the number of individuals "at risk" (i.e., who
have not yet had the event and have not been
censored) just before time.

- Example of Kaplan-Meier Calculation:

| Patient | Time (days) | Event (Death) |


|---------|-------------|---------------|
|1 | 10 | Yes |
|2 | 15 | No (Censored) |
|3 | 20 | Yes |
|4 | 25 | Yes |
|5 | 30 | No (Censored) |

- Step-by-Step Calculation

1. At (t = 10):
- Number at risk (n1 = 5) (all 5 patients are still in the study).
- Number of events (d1 = 1) (Patient 1 died at (t = 10).
- Survival probability at (t = 10):
- S (10) = 1 – (1/5) = 0.8

2. At (t = 20):
- Number at risk (n2 = 4) (Patient 1 died).
- Number of events (d2 = 1) (Patient 3 died at (t = 10).
- Survival probability at (t = 20):
- S (10) = 1 – (1/4) = 0.75
- Cumulative survival probability up to (t=20) = 0.8 x 0.75 = 0.6

and so on….

- Interpreting the Kaplan-Meier Curve:


- Survival Probability At (t = 25): for example, the survival
probability is 0.4, meaning there’s a 40% chance that a
randomly chosen patient from this group will survive
beyond 25 days.
- Censoring: The curve remains constant during censored
periods, indicating that no event has been observed, but we
still account for individuals who are "at risk."

- Comments on Kaplan-Meier curve:


• Plus signs (+) indicate censored observations.
• At times where the curve drops, an event has occurred.
• Deaths can only occur at observed death times, so the
curve remains horizontal between those time points.
• If the largest survival time is associated with an event,
cumulative survival will drop to 0, otherwise (as here) it will
remain at a positive value.

- Ways of estimation of the “average” survival time:


1. Take the mean (or median) of the survival times of all
subjects, disregarding censoring.
2. Take the mean (or median) of only those individuals who
died.
3. Use the Kaplan-Meier curve to estimate the median
survival time as the time at which the curve drops below
0.5.

Note: that the underlying distribution of the survival times


is often positively skewed. So, the median survival time is a
more appropriate measure of centrality.
Note: that the first two methods can give severely biased
estimates
- The mean and median are underestimated with methods 1
and 2:
• Disregarding censoring means that you assume that
subjects died at the given time point, while censoring
indicates that the individual survived up to and probably
beyond this time.
• Leaving out censored observations would only be OK if the
(unobserved) survival distribution of the censored
observations has the same mean (median) as that of the
events.
• Usually, we have more censoring in the higher survival
times. Leaving the censored observations out leads to
underestimation.
- Estimation of the standard error of the Kaplan-Meier
survival function:

𝒅𝒊
𝑺𝑬=𝑺(𝒕)> = 𝑺(𝒕) 𝒙 @𝚺
𝒏𝒊 (𝒏𝒊 − 𝒅𝒊

- Step-by-Step Example Calculation:

| Time _ | Events | At Risk |


|--------------- |-------------- ---|---------------- --|
| 10 |1 |5 |
| 20 |1 |4 |
| 30 |1 |3 |

- Let's assume we want to calculate the Kaplan-Meier


survival estimate and the standard error up to time
(t=30).

1. Calculate the Kaplan-Meier Survival Estimates:

- At t = 10:
S (10) = 1 x ( 1 – 1/5) = 0.8

- At t = 20:
S (20) = 0.8 x (1 - ¼) = 0.6

- At t = 30:
S (30) = 0.6 x (1 – 1/3) = 0.4

So, the Kaplan-Meier survival estimates are:


- S (10) = 0.8
- S (20) = 0.6
- S (30) = 0.4
2. Calculate the Standard Error:

We will calculate the standard error at t = 30 , so we


sum up contributions at each event time up to t = 30:
%
- For t = 10: = 0.05.
&(&!%)
%
- For t = 20: = 0.0833.
)()!%)
%
- For t = 30: = 0.1667
*(*!%)

Summing these values gives:


0.05 + 0.0833 + 0.1667 = 0.3

S(30) x √0.3 = 0.2191

So, the standard error of the Kaplan-Meier survival


estimates at t = 30 is approximately 0.2191.

- Confidence Intervals for the Survival Function:

𝑆(𝑡) ± 𝑍 𝑥 𝑆. 𝐸(𝑆(𝑡))

* Log-Rank Test:

- The log-rank test compares two or more survival


functions.

- if the two groups have two different means, Are the


observed differences in the survival times between the
two groups likely to be due to chance variation?

- Various non-parametric tests exist to answer this


question, Log-Rank test:
• Suppose there are r times where events occur in
one or both groups.
• Let nij be the number of patients in group i = 1 or
2 just before time tj.
• Let dij be the number of events in group i at tj.
• Then: nj = n1j + n2j and dj = d1j + d2j

- Under the null hypothesis of no difference between the


survival functions of the two groups, the expected
number of events at time tj in group i is given by:

𝑑"
𝑒!" = 𝑛!" × ( )
𝑛"

- Discrepancies between the observed number of events


dij and the expected number of events eij provide
evidence against the null hypothesis.

𝑒+ = Σ 𝑒+, , 𝑑+ = Σ 𝑑+, 𝑓𝑜𝑟 𝑖 = 1 𝑎𝑛𝑑 2

- Then the test statistic has a chi- squared distribution


with one degree of freedom.

(𝑑% − 𝑒% )- (𝑑- − 𝑒- )-
𝑇= +
𝑒% 𝑒-
* Sample Size for Log-Rank Test:

- The power of the log-rank test depends on the total


number of events, rather than the total number of
individuals.
- Therefore, the number of events and patients required
are calculated in three steps:

- Step One: Specify the type I (α), the power (1-β)


and the relevant treatment effect.
𝐥𝐨𝐠(𝑷𝑬 )
𝜹𝟎 = = 𝑯𝑹
𝐥𝐨𝐠(𝑷𝑪 )

- with PE and PC are the estimated


survival probability in the experimental
and control group after a certain follow-up
time T.

- Step Two: estimate d, number of events.

𝟏 + 𝜹𝟎 𝟐 𝟐
𝒅=1 5 6𝒁∝ + 𝒁𝜷 8
𝟏 − 𝜹𝟎

- Step Three: use d to estimate N.

𝟐𝒅
𝑵≥
𝟐 − 𝒑𝒄 − 𝒑𝒆

Note: Z for power: from Normal table


- 75% 0.674 - 90% 1.28
- 80% 0,84 - 95% 1.645
- 85%. 1.036

You might also like