0% found this document useful (0 votes)
6 views14 pages

Engineering Data Analysis

The document discusses the importance of data analysis, specifically focusing on Analysis of Variance (ANOVA), Regression, and Correlation as key statistical techniques for exploring relationships within data. It outlines the methodologies, assumptions, and applications of these techniques, providing examples and explanations for one-way and two-way ANOVA, regression analysis, and correlation analysis. The content serves as an educational resource for understanding how to analyze data effectively in various fields.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views14 pages

Engineering Data Analysis

The document discusses the importance of data analysis, specifically focusing on Analysis of Variance (ANOVA), Regression, and Correlation as key statistical techniques for exploring relationships within data. It outlines the methodologies, assumptions, and applications of these techniques, providing examples and explanations for one-way and two-way ANOVA, regression analysis, and correlation analysis. The content serves as an educational resource for understanding how to analyze data effectively in various fields.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

College of Engineering and Architecture

Civil Engineering Department


Brgy. Bajumpandan, Dumaguete City

ENM 241 – ENGINEERING DATA ANALYSIS


SEC A (TTH 1:00 – 2:30 p.m.)

ANALYSIS OF VARIANCE REGRESSION AND


CORRELATION

Project Members
Saniel, Jomin Clart
Tubat, James Aeron A.
Venida, Ira Gaines

Dr. Rosario Abrasaldo


Instructor

June 2024
Table of Contents

A. Introduction…………………………………………………………………………..
2

B. Discussion……………………………………………………………………………..
3
A. Analysis of Variance……………………………………………….
3
B. Regression Analysis……………………………………………… 6
C. Correlation Analysis………………………………………………..
8
C. Assessment…………………………..................................................
..... 10

D. References…………………………………………………………………………….
.. 12

1
I. INTRODUCTION

Data analysis plays a pivotal role in today’s data-driven world. It helps


certain sectors to harness the power of data, enabling them to make
decisions, optimize processes, and gain a competitive edge. By turning raw
data into meaningful insights, data analysis empowers different fields that
allows them to identify opportunities, mitigate risks, and enhance their
overall performance.

Analysis of Variance (ANOVA), Regression, and Correlation are


fundamental statistical techniques used to explore relationships and
patterns within data. These methods provide powerful tools for
understanding how variables interact and influence each other in various
contexts.

Analysis of Variance (ANOVA) is a statistical method used to analyze


the differences among group means and assess whether these differences
are statistically significant. It is a powerful tool in research and decision-
making processes across various fields, including social sciences, medicine,
economics, and engineering.

Regression analysis is a fundamental statistical method used to


examine relationships between variables. Its importance in data analysis
(e.g., Hypothesis Testing, Prediction and Forecasting, & Understanding
Relationships) spans various fields, from scientific research to business
analytics.

Correlation analysis is also a crucial statistical method used to


measure and quantify the relationship between two or more variables.
However, unlike regression, correlation analysis measures the strength and
direction of the linear relationship between two or more variables. It
determines how closely the movements of variables are related to each
other, without implying causation.

Together, these techniques form the cornerstone of statistical


analysis, guiding researchers and practitioners in making informed
decisions across diverse fields from social sciences to engineering and
beyond. This introduction sets the stage for a deeper exploration into how
these methods are applied, interpreted, and leveraged to extract meaningful
insights from data.

2
II. DISCUSSION
A. Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA) is a statistical test that examines the
differences in means among multiple groups. It compares the variation
between group means to the variation within the groups. If the variation
between group means is significantly larger than the variation within
groups, it suggests a significant difference between the means of the
groups.
ANOVA calculates an F-statistic by comparing between-group
variability to within-group variability. If the F-statistic exceeds a critical
value, it indicates significant differences between group means.
ANOVA is used to compare treatments, analyze factors impact on a
variable, or compare means across multiple groups. There are different
types of ANOVA, including one-way ANOVA, which compares means across
multiple groups or treatments, and two-way ANOVA, which considers the
effects of two independent variables on the dependent variable.

One Way Analysis of Variance (ANOVA)


One-way ANOVA (Analysis of Variance) is a statistical test used to
compare the means of three or more samples to determine if there are
significant differences among them. It is based on the assumption that the
samples are drawn from normally distributed populations with equal
variances.

One-Way ANOVA: Assumptions


For the results of a one-way ANOVA to be valid, the following assumptions
should be met:
1. Normality – Each sample was drawn from a normally distributed
population.
2. Equal Variances – The variances of the populations that the samples come
from are equal.
3. Independence – The observations in each group are independent of each
other and the observations within groups were obtained by a random
sample.

One-Way ANOVA: The Process


A one-way ANOVA uses the following null and alternative hypotheses:
 H0 (null hypothesis): μ1 = μ2 = μ3 = … = μk (all the population
means are equal)

3
H1 (alternative hypothesis): at least one population mean is different
from the rest.

A one-way ANOVA is often performed using statistical software (such


as R, Excel, Stata, SPSS, etc.) because it is time-consuming to do by hand.
Regardless of which software you use, you will get the following table as
output:

Table 1

Example:
Suppose we want to know whether
or not three different exam prep
programs lead to different mean scores
on a certain exam. To test this, we
recruit 30 students to participate in a
study and split them into three groups.
The students in each group are
randomly assigned to use one of the
three exam prep programs for the next
three weeks to prepare for an exam. At
the end ofthe three weeks, all of the Table 2
students take the same exam.

To perform a one-way
ANOVA on this data, we will use
the Statology One-Way ANOVA
Calculator.

4
From the output table we see that the F test statistic is 2.358 and the
corresponding p-value is 0.11385. Since this p-value is not less than 0.05,
we fail to reject the null hypothesis. This means we don’t have sufficient
evidence to say that there is a statistically significant difference between
the mean exam scores of the three groups.

Table 3

Two Way Analysis of Variance


(ANOVA)
A two-way ANOVA is used to determine whether or not there is a
statistically significant difference between the means of three or more
independent groups that have been split on two variables (sometimes called
“factors”)

Two-Way ANOVA: Assumptions


For the results of a two-way ANOVA to be valid, the following assumptions
should be met:
1. Normality – The response variable is approximately normally distributed
for each group.
2. Equal Variances – The variances for each group should be roughly equal.
3. Independence – The observations in each group are independent of each
other and the observations within groups were obtained by a random
sample

Two Way ANOVA Example:


A botanist wants to know whether
or not plant growth is influenced by
sunlight exposure and watering
frequency. She plants 40 seeds and lets
them grow for two months under
different conditions for sunlight
exposure and watering frequency. After
two months, she records the height of

In the table, we see that there


each plant. The results are shown:

were five plants grown under each


combination of conditions. For example,
there were five plants grown with daily
watering and no sunlight and their Table 4
heights after two months were 4.8
inches, 4.4 inches, 3.2 inches, 3.9 inches, and 4.4 inches:

5
The table shows
the result of the
two-way ANOVA.
We can observe the
following:
 The p-value
for the
interaction
between
watering
frequency
and sunlight
exposure
was
0.310898.
This is not
statistically
significant at
alpha level
0.05.
 The p-value
for watering
frequency
was
0.975975.
This is not
statistically significant at alpha level 0.05.
 The p-value for sunlight exposure was 3.9E-8 (0.000000039). This is
statistically significant at alpha level 0.05.
These results indicate that sunlight exposure is the only factor that
has a statistically significant effect on plant height. And because there is no
interaction effect, the effect of
Table 5 sunlight exposure is consistent
across each level of watering
frequency. That is, whether a plant is watered daily or weekly has no impact
on how sunlight exposure affects a plant.

B. Regression Analysis
Regression is a statistical technique widely employed in finance,
investing, and various other fields to analyze and quantify the relationship
between a dependent variable (typically represented as y) and several
independent variables. This method aids financial and investment managers
in asset valuation and in understanding how variables, such as commodity
prices, correlate with the stocks of companies involved with those
commodities.
The objective of Regression analysis is to explain variability in
dependent variable by means of one or more of independent or control
variables.

Applications

6
There are four broad classes of applications of regression analysis.
Descriptive or explanatory: interest may be on describing “What
factors influence variability in dependent variable?” For example,

factor contributing to higher sales among company’s sales force.


Predictive, for example setting normal quota or baseline sales. We can
also use estimated equation to determine “normal” and “abnormal” or

outlier observations.
Comparing Alternative theoretical explanations, – Consumers use
reference price in comparing alternatives,

– Consumers use specific price points in comparing alternatives.


Decision purpose,
– Estimating variable and fixed costs having calibrated cost function.

– Estimating sales, revenues and pro its having calibrated demand


function.
– Setting optimal values of marketing mix variables.
– Using estimated equation for “What if” analysis.

Data Requirement
Measurement on two or more variables one of which must be
dependent.

Dependent variable must have interval or ratio scale measurement.


If independent variables are nominal scaled (e.g. brand choice), then

appropriate caution must be maintained so that results from analysis


can be interpreted. For example, it may be necessary to create


variables that take values 0 and 1 or dummy variables.

Steps in Regression Analysis


[Link] on purpose of model and appropriate dependent variable to
meet that purpose.
[Link] on independent variables.
[Link] parameters of regression equation.
[Link] estimated parameters, goodness of it and qualitative and
quantitative assessment of parameters.
[Link] appropriateness of assumptions.
[Link] some assumptions are not satisfied, modify and revise estimated
equation.
[Link] estimated regression equation.

Estimating Parameters
• Method of least squares, or
• Method of maximum likelihood, or
• Weighted least squares, or
• Method of least absolute deviations.
We will examine several alternative approaches to estimate
parameters including situation where we have only two observations.
Value of Dependent variable = Constant + Slope × Value of Independent
variable + Error
y=a+b×x+E
• Constant (a), Slope (b) and Error (E) are unknown.
• You observe N pair of values of dependent and independent
variables.

7
• Regression analysis provides reasonable (statistically unbiased)
values for slope(s) and intercept.
An Illustrative Example - Two observations only.
Suppose we have two observation (x1, y1) and (x2, y2) or (5,10) and
(20,20). These observations graphically can be shown as follows

The resulting equation would be y = 6.67 + .66 × x.


Now, suppose we have two observation (x1, y1) and (x2, y2) or (5,20) and
(20,10). These observations graphically can be shown as follows.

The resulting equation would be y = 23.33 − .66 × x. Now suppose we


observe five pairs of x and y observations as follows: (−2, 0),(−1, 0),(0, 1),
(1, 1) and (2, 3). These are displayed below along with regression line which
is shown in dashed format.

8
Nothing much changes, if we have multiple variables. We, however,
need to worry about joint variability of independent variables. Consider a
situation with two independent variables (x1i and x2i). That is, yi = a + b1 ×
x1i + b2 × x2i + Ei.
Here our interest lies with finding best values of a, b1 and b2. To
derive these, we could follow above steps. That is, first averaging of both
sides, then subtracting the averages and finally multiplying by (x1i − x¯1)
and (x2i − x¯2). This will give us two equations with two unknowns.
That is,
(yi − y¯) = b1(x1i − x¯1) + b2(x2i − x¯2)
Multiply first by (x1i − x¯1) and then by (x2i − x¯2). This will result in,

We would sum both sides of both equations and divide by N − 1.


Moreover for simplicity, we could make following substitutions.

C. Correlation Analysis
Correlation analysis is a statistical method used to evaluate the
strength and direction of the relationship between two or more variables.

Correlation Analysis Methodology


 Define the Problem: Identify the variables that you think might be
related. The variables must be measurable on an interval or ratio
scale.
 Data Collection: Collect data on the variables of interest. The data
could be collected through various means such as surveys,
observations, or experiments. It’s crucial to ensure that the data
collected is accurate and reliable.
 Data Inspection: Check the data for any errors or anomalies such as
outliers or missing values.
 Choose the Appropriate Correlation Method: Select the correlation
method that’s most appropriate for your data. If your data meet the
assumptions for Pearson’s correlation (interval or ratio level, linear
relationship, variables are normally distributed), use that.
 Compute the Correlation Coefficient: Once you’ve selected the
appropriate method, compute the correlation coefficient. This can be
done using statistical software such as R, Python, or SPSS, or
manually using the formulas.

9
Interpret the Results: Interpret the correlation coefficient you
obtained. If the correlation is close to 1 or -1, the variables are

strongly correlated. If the correlation is close to 0, the variables have


little to no linear relationship. Also consider the sign of the correlation
coefficient: a positive sign indicates a positive relationship (as one
variable increases, so does the other), while a negative sign indicates
a negative relationship (as one variable increases, the other
decreases).
Check the Significance: It’s also important to test the statistical
significance of the correlation. This typically involves performing a t-

test. A small p-value (commonly less than 0.05) suggests that the
observed correlation is statistically significant and not due to random
chance.
Report the Results: The final step is to report your findings. This
should include the correlation coefficient, the significance level, and a

discussion of what these findings mean in the context of your research


question.

Types of Correlation
(1)Pearson Correlation
This is the most common type of correlation analysis. Pearson
correlation measures the linear relationship between two continuous
variables. It assumes that the variables are normally distributed and have
equal variances. The correlation coefficient ( r ) ranges from -1 to +1, with -
1 indicating a perfect negative linear relationship, +1 indicating a perfect
positive linear relationship, and 0 indicating no linear relationship.

(2)Spearman Rank Correlation


Spearman’s rank correlation is a non-parametric measure that
assesses how well the relationship between two variables can be described
using a monotonic function. In other words, it evaluates the degree to
which, as one variable increases, the other variable tends to increase,
without requiring that increase to be consistent.

(3)Kendall’s Tau
Kendall’s Tau is another non-parametric correlation measure used to
detect the strength of dependence between two variables. Kendall’s Tau is
often used for variables measured on an ordinal scale (i.e., where values can
be ranked).

(4)Point-Biserial Correlation
This is used when you have one dichotomous and one continuous
variable, and you want to test for correlations. It’s a special case of the
Pearson correlation.

(5)Phi Coefficient
This is used when both variables are dichotomous or binary (having
two categories). It’s a measure of association for two binary variables.

(6)Canonical Correlation

10
This measures the correlation between two multi-dimensional
variables. Each variable is a combination of data sets, and the method finds
the linear combination that maximizes the correlation between them.

(7)Partial and Semi-Partial (Part) Correlations


These are used when the researcher wants to understand the
relationship between two variables while controlling for the effect of one or
more additional variables.

(8)Cross-Correlation
Used mostly in time series data to measure the similarity of two series
as a function of the displacement of one relative to the other.

(9)Autocorrelation
This is the correlation of a signal with a delayed copy of itself as a
function of delay. This is often used in time series analysis to help
understand the trend in the data over time.
III. ASSESSMENT
Situation 1:
Neuroscience researchers examined the impact of environment on rat
development. Rats were randomly assigned to be raised in one of the four
following test conditions: Impoverished (wire mesh cage - housed alone),
standard (cage with other rats), enriched (cage with other rats and toys),
super enriched (cage with rats and toys changes on a periodic basis). After
two months, the rats were tested on a variety of learning measures
(including the number of trials to learn a maze to a three perfect trial
criteria), and several neurological measure (overall cortical weight, degree
of dendritic branching, etc.). The data for the maze task is below. Compute
the appropriate test for the data provided below.

Impoverish Standar Enriche Super


ed d d Enriched
22 17 12 8
19 21 14 7
15 15 11 10
24 12 9 9
18 19 15 12

1. What is your computed answer?


1. What would be the null hypothesis in this study?
2. What would be the alternate hypothesis?
3. What is your Fcrit?
4. Are there any significant differences between the four testing
conditions?
5. Interpret your answer.

Situation 2:
A research study was conducted to examine the clinical efficacy of a
new antidepressant. Depressed patients were randomly assigned to one of
three groups: a placebo group, a group that received a low dose of the drug,
and a group that received a moderate dose of the drug. After four weeks of
11
treatment, the patients completed the Beck Depression Inventory. The
higher the score, the more depressed the patient. The data are presented
below. Compute the appropriate test.

Placeb Low Moderate


o Dose Dose
38 22 14
47 19 26
39 8 11
25 23 18
42 31 5

1. What is your computed answer?


2. What would be the null hypothesis in this study?
3. What would be the alternate hypothesis?
4. What probability level did you choose and why?
5. What is your Fcrit?
6. Is there a significant difference between the groups?
7. If there is a significant difference, where specifically are the
differences?
8. Interpret your answer.

3. Calculate the regression coefficient and obtain the lines of regression for
the following data

4. Calculate the two regression equations of X on Y and Y on X from the


data given below, taking deviations from a actual means of X and Y.

5. The graph below represents each individual’s weight and corresponding


blood pressure. Recall in previous sections the formulas for calculating a
regression line. Using the correlation coefficient and regression
line, interpret the graph.

12
Perso Weigh Blood
n t Pressure
A 150 125
B 169 130
C 175 160
D 180 169
E 200 150

IV. REFERENCES

Howell, D. C. (2012). Statistical methods for psychology (8 th ed.).


Wadsworth.

Gravetter, F. J., & Wallnau, L. B. (2016). Statistics for the behavioral


sciences (10th ed.). Cengage Learning.

Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. SAGE
Publications.

National Institute of Statistics. (2020). Analysis of variance, regression, and


correlation. Retrieved from
[Link]

13

You might also like