0% found this document useful (0 votes)
24 views14 pages

Correlation vs. Regression Analysis Explained

The document provides information about correlation, regression analysis, and examples of applying statistical tests like the F-test and t-test. It defines correlation as measuring the strength of association between paired variables while regression finds the linear relationship between a dependent and independent variable. An example is given of a spurious correlation between hair length and test scores where the real factor is gender. Reasons for regression analysis include explaining and predicting relationships between variables. Descriptive statistics and their interpretation are also discussed.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views14 pages

Correlation vs. Regression Analysis Explained

The document provides information about correlation, regression analysis, and examples of applying statistical tests like the F-test and t-test. It defines correlation as measuring the strength of association between paired variables while regression finds the linear relationship between a dependent and independent variable. An example is given of a spurious correlation between hair length and test scores where the real factor is gender. Reasons for regression analysis include explaining and predicting relationships between variables. Descriptive statistics and their interpretation are also discussed.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

2011

Final Project

Raiha, Maheen ,Fabiha Mahnoor , Zara


1/1/2011

Q1 Differentiate between Correlation and Regression Analysis.

CORRELATION:

Correlation is a statistical measurement of the relationship between two variables. Possible correlations range from +1 to 1.

A zero correlation indicates that there is no relationship between the variables. A correlation of 1 indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down. A correlation of +1 indicates a perfect positive correlation, meaning that both variables move in the same direction together. In other words Correlation computes the value of the Pearson correlation coefficient, r. Its value ranges from -1 to +1. The correlation answers the STRENGTH of linear association between paired variables, say X and Y Correlation is calculated whenever: * both X and Y is measured in each subject and quantify how much they are linearly associated. * in particular the Pearson's product moment correlation coefficient is used when the assumption of both X and Y are sampled from normally-distributed populations are satisfied * or the Spearman's moment order correlation coefficient is used if the assumption of normality is not satisfied. * correlation is not used when the variables are manipulated, for example, in experiments. if you interchange variables X and Y in the calculation of correlation coefficient you will get the same value of this correlation coefficient REGRESSION:

Technique of fitting a simple equation to real data points. The most typical type of regression is linear regression (meaning you use the equation for a straight line, rather than some other type of curve), constructed using the least-squares method (the line you choose is the one that minimizes the sum of the squares of the distances between the line and the data points). It's customary to use "a" or "alpha" for the intercept of the line, and "b" or "beta" for the slope; so linear regression gives you a formula of the form: y = bx + a Linear regression quantifies goodness of fit with r2, sometimes shown in uppercase as R2.

The "best" linear regression model is obtained by selecting the variables (X's) with at least strong correlation to Y, i.e. >= 0.80 or <= -0.80 The same underlying distribution is assumed for all variables in linear regression. Thus, linear regression will underestimate the correlation of the independent and dependent when they (X's and Y) come from different underlying distributions.

The regression tells us the FORM of linear association that best predicts Y from the values of X.

Linear regression is used whenever: * at least one of the independent variables (Xi's) is to predict the dependent variable [Link] of the Xi's are dummy variables, i.e.

Xi = 0 or 1, which are used to code some nominal variables. * if one manipulates the X variable, e.g. in an experiment. Linear regression are not symmetric in terms of X and Y. That is interchanging X and Y will give a different regression model (i.e. X in terms of Y) against the original Y in terms of X.

Q2. Give Any Example of Spurious Correlation between any two REAL WORLD variables & highlight the hidden factor. "Spurious Relation (or Correlation) : A situation in which measures of two or more variables are statistically related (they cover) but are not in fact causally linkedusually because the statistical relation is caused by a third variable. When the effects of the third variable are removed, they are said to have been partialed out. A spurious correlation, as defined in definition a, is sometimes called an "illusory correlation." Lurking Variable. A third variable that causes a correlation between two others sometimes, like the troll under the bridge, an unpleasant surprise when discovered. A lurking variable is a source of a spurious [Link] example, if researchers found a correlation between individuals' college grades and their income later in life, they might wonder whether doing well in school increased income. It might; but good grades and high income could both be caused by a third (lurking or hidden variable) such as tendency to work hard." For example, if the students in a psychology class who had long hair got higher scores on the midterm than those who had short hair, there would be a correlation between hair length and test scores. Not many people, however, would believe that there was a causal link and that, for example, students who wished to improve their

grades should let their hair grow. The real cause might be gender: that is, women (who usually have longer hair) did better on the test. Or that might be a spurious relationship too. The real cause might be class rank: Seniors did better on the test than sophomores and juniors, and, in this class, the women (who also had longer hair) were mostly seniors, whereas the men (with shorter hair) were mostly sophomores and juniors." Here in the example long hairs is one variable and grades the other, while the cause of higher grades might be Gender but here they have stressed upon long hairs and grades which indicates that in their class women achieve high grades. So this indicates the FALLACY factor of two variable.

Q 3 : Write down a paragraph on why we do regression analysis.

Regression Analysis: We do regression analysis for the following reasons:

a. Explaining the relationship between Y and X variables with a model b. Estimating and testing the intensity of their relationship c. Given a fixed x value, we can predict y value.

Applications of regression analysis exist in almost every field. In economics, the dependent variable might be a family's consumption expenditure and the independent variables might be the family's income, number of children in the family, and other factors that would affect the family's consumption patterns. In political science, the dependent variable might be a state's level of welfare spending and the independent variables measures of public opinion and institutional variables that would cause the state to have higher or lower levels of welfare spending. In sociology, the dependent variable might be a measure of the social status of various occupations and the independent variables characteristics of the occupations (pay, qualifications, etc.). In psychology, the dependent variable might be individual's racial tolerance as measured on a standard scale and with indicators of social background as independent variables. In education, the dependent variable might be a student's score on an achievment test and the independent variables characteristics of the student's family, teachers, or school.

MATHEMATICAL EQUATION:
B1=0.976 B0=250.553 Y^ = bo + b1 X1

Y^ = 395.5817809 + 0.976 X1
Therefore,for every 1 unit change in X1, there will be 0.976 change in Y^ as 0.976 is the gradient of the function. The Y intercept of the function is 395.5817809 .

SSR = Sum ( Y^ - y)^2 = 1.23

SST = Sum ( Y - y)^2 = 1.35

R^2= SSR / SST 1.23 / 1.347 r^2 = 0.912927 The value of r^2 is quite high (91.29 %) meaning there is a strong relationship and dependence b/ w X & Y.

Adjusted r ^ 2 = 1-[( 1- r^2) n-1/ n-k-1]

= 0.910439

After adjusting for the no. of explanatory variables and sample size, the adjusted R^2 is also very high ( 91.04 % ) which means there is strong dependance between X & Y .

(v)

SCATTER PLOT

20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 0 10 20 30 40 Series1 2 per. Mov. Avg. (Series1)

T he scatter plot also supports our analysis that on average both the variables show same movements and trend.

Q 4. Give Descriptive Statistics and interpret the data.

DESCRIPTIVE STATISTICS: Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. There are three major characteristics of a single variable that we tend to look at:

the distribution the central tendency the dispersion

The Distribution. The distribution is a summary of the frequency of individual values or ranges of values for a variable. The simplest distribution would list every value of a variable and the number of persons who had each value. Central Tendency. The central tendency of a distribution is an estimate of the "center" of a distribution of values. There are three major types of estimates of central tendency:

Mean Median Mode

Dispersion. Dispersion refers to the spread of the values around the central tendency. There are two common measures of dispersion, the range, the standard deviation and the variance.

DESCRIPTIVE STATS

APPLICATION OF F- TEST.

H0 : B1 = B2 = 0 Ha : at least 1 B is not = 0 Its a one tailed test with 5 % significance level.

SSR = Sum ( Y^- y)^2

7.83525

d.f for SSR = k = 1

MSR = SSR / k

= 7.83525

SSE = Sum (Y y) ^2

= 1346829407

d.f = n-k-i = 35 MSE = SSE / n-k-1

= 37411927.98

F = MSR/ MSE

= 2.03614

0.323985026

F c.v =

2.09432 > 0.323985026 Since F > F c.v , so reject H0 : B1 = B2 = 0 Therefore, Ha : at least 1 B is not = 0

Intercept X Variable 1

Coefficients 498.3621 0.978816

Standard Error 440.8294 0.051096

t Stat 1.13051 19.15623

P-value 0.265949 3.92E-20

Lower 95% -396.569 0.875085

Upper 95% 1393.293 1.082547

Lower 95.0% -396.569 0.875085

Upper 95.0% 1393.293 1.082547

APPLICATION OF t TEST
HYPOTHESIS Ho : B1 =0 Ha : B1 is not = 0

Its a 2 tailed test with d.f = 35 & 0.05 significance level .

t = ( b1 - B1) / S(b1)

d.f = n-2 = 37 -2 = 35

Standard error of model = [{sum ( Y^ - Y )^2}/ 35]^ 0.5 = 3364957 S ( b1) = standard error of model / [sum ( X- x)^ 0.5 = 3364957/1285049376
=93.86846

t = 1.13051

t c.v = 0.324174

As

t > t c.v

1.13051 > 0.324174

So reject Ho : B1 =0. Therefore, as the gradient is not equal to 0 there is a linear relationship b/w Y & X and they are dependent on each other.

P VALUE ANALYSIS :

calculated p value = 0.265949 alpha = 0.05

As, calculated p value ( 0.265949 ) > alpha ( 0.05 )

Thus, accept Ho :

Mean of X=6350.59

V=72784603.35

Df=35

Mean of Y=6549.84

V=78646765.38

Df=36

Grand mean =(37*6350.59)+(36*6549.84) =6429.39

ANOVA Source Between Within Total SS


536810.8654

Df 1

Ms
536810.8654

F 0.0071

5378744671 71 =5378744671+536810.8654 72

75756967.2

Calculated F > critical F 0.007 > 4.0012. False so Accept Ho :

6000 5000 4000 3000 2000 1000 0 -1000 0 -2000 -3000 -4000 -5000 5000 10000 15000 20000 Series1

As the residuals are identically divided around the mean i.e. 0 , thus, the errors are independent and random & are so they are IID .

Common questions

Powered by AI

ANOVA (Analysis of Variance) in regression assesses how well a statistical model fits the observed data by examining the variance among group means. It partitions total variability into variability accounted for by the model (between groups) and unexplained variability (within groups). By comparing these variances, ANOVA tests whether there is a significant difference in group means, thus indicating if at least one predictor variable model has a significant impact on the dependent variable. This supports evaluating overall model efficacy and the significance of variables in explaining outcome variability .

Mean, median, and mode are three measures of central tendency that describe the 'center' of a dataset, each providing a different perspective. The mean offers an arithmetic average, capturing the central point by summing all values and dividing by their count, which is sensitive to outliers. The median provides the middle value, potentially giving a more accurate representation in skewed distributions by splitting the dataset into two equal parts. The mode identifies the most frequently occurring value, useful for highlighting common attributes in categorical data. Utilizing all three can offer a comprehensive view of data distribution and center .

A scatter plot visually represents the relationship between two variables in a dataset, allowing researchers to observe patterns, trends, and possible correlations. By plotting each pair of variables as a point on a two-dimensional graph, it's easier to identify linear relationships, clusters, and outliers, which may indicate stronger, weaker, or erroneous associations within the data. This visualization thus provides a preliminary analysis and graphical interpretation to support further statistical analysis .

Central tendency provides a measure of a dataset's 'center,' while dispersion describes the spread of data around that center, measured commonly through range, variance, or standard deviation. Understanding both is crucial, as identical central tendencies can mask very different distribution shapes—one dataset might have tightly clustered values, while another's are more spread out. This relationship is crucial for understanding variation within data, assessing reliability, and comparing datasets comprehensively, ensuring accurate data interpretations and conclusions in statistical analysis .

Correlation is a statistical measure that describes the strength and direction of a relationship between two variables, with its value ranging from -1 to +1. It is used to quantify how much two variables are linearly associated and is symmetric, meaning the correlation coefficient remains the same if variables X and Y are interchanged . Regression analysis, on the other hand, is a technique to fit an equation to data points, often used to predict the value of a dependent variable (Y) based on the values of one or more independent variables (X's). Linear regression, specifically, calculates the best fit line using the least squares method to minimize errors in prediction and is not symmetric, meaning that interchanging X and Y will result in a different regression model .

A spurious correlation emerges when two variables appear to be related statistically, but their correlation is not due to any direct causal link and is often a result of a third variable, known as a lurking variable. This lurking variable influences both correlated variables, causing them to appear related when they are not. For instance, the correlation between students' hair length and their test scores may seem significant, but a lurking variable such as class rank or gender might actually be the underlying cause. When these lurking effects are controlled or removed, the illusion of a relationship between the original two variables disappears .

Yes, the adjusted R^2 provides more insight than the regular R^2 because it accounts for the number of explanatory variables in the model and the sample size. While R^2 measures the proportion of variance in the dependent variable explained by the independent variables, adjusted R^2 adjusts this measure to penalize the addition of variables that do not improve the model significantly. This results in a more accurate representation of the model's explanatory power by considering both the variance explained and the complexity of the model, making it especially useful for model comparison .

Regression analysis is beneficial in various fields because it helps explain the relationships between dependent and independent variables, estimate the intensity of these relationships, and make predictions based on specific values of independent variables. In economics, it can be used to model a family's consumption expenditure based on their income and other socioeconomic factors. In political science, it could explain state welfare spending with respect to public opinion and institutional variables. In sociology, regression might reveal how occupational characteristics such as pay and qualifications relate to social status. These applications illustrate how regression analysis aids in understanding complex relationships within data and making informed decisions based on these insights .

The F-test in regression analysis evaluates the overall significance of the model by testing whether at least one of the predictor variables in a multiple regression has a non-zero coefficient. By comparing the variance explained by the model (SSR) against the variance not explained (SSE), the F-test determines if the regression model provides a better fit to the data than a model without any predictors. If the computed F value exceeds the critical F value, the null hypothesis that all regression coefficients are equal to zero is rejected, indicating that the model is statistically significant .

In regression analysis, the p-value evaluates the null hypothesis that a particular coefficient equals zero. A small p-value (typically less than 0.05) indicates strong evidence against the null hypothesis, suggesting that the variable contributes significantly to the model. When testing, for example, a coefficient B1 = 0, a calculated p-value greater than the significance level (α) leads to retaining the null hypothesis, indicating inadequate evidence to prove the variable's influence on the dependent variable. It's a critical tool for deciding which predictors to include in the model, ensuring its relevance and accuracy .

You might also like