HO CHI MINH UNIVERSITY OF TECHNOLOGY
FACULTY OF MECHANICAL ENGINEERING
ASSIGNMENT REPORT
TITLE:
LINEAR REGRESSION MODEL FOR PREDICTING 3D PRINT QUALITY AND
STRENGTH
COURSE: PROBABILITY AND STATISTICS (MT2013)
SUPERVISOR: PHAN THỊ HƯỜNG
CLASS: CC07
GROUP: 2
Ho Chi Minh City, 2024
No. Student ID Name Tasks Contribution
1 20%
2 20%
3 20%
4 20%
Code and Data
Huỳnh Minh
5 2252838 Availibility + 20%
Triết
Write Report
CONTRIBUTION TABLE
Comment Evaluations
INTRODUCTION
Probability and statistics are two branches of mathematics which involve
collecting, analyzing, and interpreting numerical data to make informed decisions and
predictions. Together, they provide powerful tools for understanding uncertainty,
identifying patterns, and making evidence-based decisions in various fields in both social
and natural science. By quantifying uncertainty and uncovering insights from data,
probability and statistics enable deeper understanding of a subject.
Given these points, precision and accuracy are paramount in mechanical engineering
in general and 3D-printing in specific, statistics plays a crucial role by providing tools to
analyze and interpret data from experiments, simulations, and real-world observations.
By applying the theoretical and practicality of ANOVA and linear regression model used
in statistical analysis, this study aims to determine how much of the adjustment
parameters in 3d printers affect the print quality, accuracy and strength.
1. DATA INTRODUCTION
1.1. Dataset Description
₋ Title: 3D Printer Dataset for Mechanical Engineers
₋ Access Link: [Link]
₋ Source information:
+ Dataset comes from research by TR/Selcuk University Mechanical
Engineering department.
+ Work is based on the Ultimaker S5 3-D printer settings and filaments.
+ Material and strength tests were carried out on a Sincotec GMBH tester
capable of pulling 20 kN.
+ There are nine setting parameters and three measured output parameters.
(Described in section 1.3)
+ Number of observations/measurements: 50
1.2. Variables Description
Variable Type Unit Description
Setting Parameters
Layer Vertical thickness of each layer deposited by the
Quantitative mm
Height 3D printer
Wall
Quantitative mm Width of walls or outer shell of printed object
Thickness
Infill
Quantitative % Percentage of interior space filled with material
Density
Infill Pattern Categorical none Geometric pattern used to fill interior of the object
Nozzle
Quantitative ℃ Temperature of extruder nozzle
Temperature
Bed Quantitative ℃ Temperature of print bed
Temperature
mm/ Speed at which printer's nozzle moves along X, Y,
Print Speed Quantitative
s and Z axes
Material Categorical none Type of material used for printing
Fan Speed Quantitative % Speed of cooling fan
Output Parameters (Measured)
Roughness Quantitative µm Surface roughness or texture of printed object
Tension
Maximum stress printed object can withstand
(ultimate) Quantitative MPa
before breaking under tension
Strength
Percentage by which printed object can stretch or
Elongation Quantitative %
deform before breaking
1.3. Procedure Steps
₋ Step 1: Data preprocessing
Data reading
Dealing with missing data, data transformation
Defining new features
Adding, removing, converting variables (if necessary)
₋ Step 2: Descriptive statistics
Using sample statistics and plots
₋ Step 3: Inferential statistics (main method)
Applying linear regression model to analyse the relationships between the
setting parameters and find their impact to the printer’s output parameters
₋ Step 4: Prediction
Testing the assumption from the results
2. BACKGROUND
2.1. Regression Analysis
2.1.1. Definition
Regression Analysis is a statistical method used to study the relationship between a
dependent variable (Y) and one or more random variables (X), also known as explanatory
variables. The main objective of regression analysis is to make predictions or describe the
dependent variable based on the random variables. The relationship between X and Y can
be represented in the form of a linear function or equation.
General idea: estimate a random variable Y (the dependent variable) approximately as a
function F( X 1, ..., X n) of other random variables X 1, ..., X n (control variables, or
independent variables). This means that when we have the values of X 1, ..., X n, we want
to estimate the value of Y from them. Function F may depend on some parameters θ = (θ1
, ..., θ k). We can write Y as follows:
Y =F θ ( X 1 , … , X n ) +ϵ
₋ Where ϵ is the error (also a random variable). We want to choose the most
appropriate function F and parameters θ so that the error is as small as possible.
₋ √ E ¿ ¿ is called the standard error of the regression model: a model with a lower
standard error is considered more accurate.
₋ In the functional relationship, for each value of X, we find a unique value of Y.
However, in statistics, one value of X can correspond to multiple different values
of Y because besides the main variable X, the variable Y may also be influenced
by other factors.
2.1.2. Simple Linear Regression Model
A simple linear regression model involving a dependent variable Y and a random
variable X is expressed as the equation:
Y = β0 + β 1 X +ϵ
Where:
₋ β 0 and β 1 are unknown parameters (referred to as the intercept and slope
coefficients of the regression line).
₋ Y is the dependent variable, and X is the random variable.
₋ ϵ represents the error component, assumed to have a normal distribution N(0, σ²).
₋ The term "linear" in the simple linear regression model refers to linearity in the
regression coefficients, not necessarily in the variables Y and X.
2.2. Multiple Regression Model
2.2.1. Definition
Multiple linear regression is an extension of simple linear regression. It is used to
predict the value of a response variable based on the values of two or more explanatory
variables. The variable we want to predict is called the response variable (dependent
variable). The variables we use to predict the value of the response variable are called
explanatory variables (predictor variables, independent variables).
The general form of a multiple linear regression model is:
Y = β0 + β 1 X 1 + β 2 X 2 +…+ βi X i + μ
Where:
₋ Y is the dependent variable.
₋ X i are the explanatory variables.
₋ β iare the regression coefficients.
₋ β 0 is the intercept.
₋ μ is the random error.
The coefficients β i represent the change in the expected value of Y for a one-unit change
in X i , holding other variables constant.
₋ If β i > 0: the relationship between Y and X i is positive, meaning that when X i
increases (or decreases) with other independent variables held constant, Y also
increases (or decreases).
₋ If β i < 0: the relationship between Y and X i is negative, meaning that when X i
increases (or decreases) with other independent variables held constant, Y
decreases (or increases).
₋ If β i = 0: it suggests that there is no correlation between Y and X i , meaning that
Y may not depend on X i or X i may not significantly influence Y.
2.2.2. Testing the Model
In multiple regression models, the null hypothesis states that the model has no
significance, meaning all regression coefficients are equal to zero.
The Wald test (often called the F-test) is conducted as follows:
₋ Step 1: Null hypothesis H 0: β 1 = β 2 = ... = β i = 0.
₋ Step 2: Regress Y on a constant term and X 1, X 2 , ..., X i , then calculate the sum of
squared errors RSSU, RSSR. The F-distribution is the ratio of two independent
chi-square distributed variables.
₋ Step 3: Look up the values in the F-table corresponding to the degrees of freedom
(k – 1) for the numerator and (n – k) for the denominator, and with a
predetermined significance level α.
₋ Step 4: Reject the null hypothesis H 0 at significance level α if Fc > F(α, k-1, n-k).
For the p-value method, calculate the p-value = P (F > Fc| H 0) and reject H 0 if p <
α.
2.2.3. Testing the Assumptions of the Multiple Regression Model
Recalling the assumptions of the regression model:
Y = β0 + β 1 X 1 + β 2 X 2 +…+ βi X i + μi
₋ Assumption 1: Linearity of data: the relationship between the predictor variable
X and the response variable Y is assumed to be linear.
₋ Assumption 2: Normal distribution of errors.
₋ Assumption 3: Constant variance of errors.
₋ Assumption 4: Expectation of errors = 0.
₋ Assumption 5: Independence of errors μ1, ..., μn.
2.3. ANOVA
2.3.1. Definition
Analysis of Variance (ANOVA), also known as ANOVA test, is a parametric
statistical technique used to compare groups of data based on the mean values of
observed samples from these groups. It evaluates and concludes the equality of these
group mean values through hypothesis testing. In research, ANOVA is used as a tool to
examine the influence of one random factor on an outcome factor.
ANOVA is essentially an extension of the t-test method for independent samples
when comparing means of multiple groups of independent observations. Unlike the t-test
method, ANOVA can compare more than two groups. Note that ANOVA does not
compare variances but analyzes variances to compare with expectations.
ANOVA is used to test the hypothesis that the population means of groups are
equal.
This technique is based on calculating the variability within groups and between
group means.
There are two procedures for ANOVA: One-way ANOVA and Two-way ANOVA.
2.3.2. Two-Way Analysis of Variance
Two-way ANOVA is a partial extension of one-way ANOVA. With One-way, we
have one independent variable affecting the dependent variable. With Two-way
ANOVA, there are 2 independent variables.
Hypothesis of Two-way ANOVA:
The population has a normal distribution.
Each sample is observed once without repetition.
Steps to conduct hypothesis testing: We take non-repeated samples, then units of the first
random factor are grouped into K groups (columns), and units of the second random
factor are arranged into H blocks (rows). Thus, we have a combined table of two causal
factors consisting of K columns and H rows and (K x H) data cells. The total number of
observed samples is n = (K x H).
Columns (Groups)
Rows (Blocks)
1 2 … K
1 X 11 X 12 X1 K
2 X 21 X 22 X2 K
…
H XH1 XH2 X HK
₋ Step 1: Calculate sample means of groups
Individual group means (K columns):
H
∑ ❑ X ij
X j= i=1 ( j=1 , 2 ,… , K )
H
Individual block means (H rows):
K
∑ ❑ X ij
X i = j=1 (i=1 , 2, … , H )
K
Overall sample mean:
H K H K
∑ ❑ ∑ ❑ X ij ∑ ❑ X i ∑ ❑ X j
X = i =1 j=1
= i=1 = j =1
n H K
₋ Step 2: Calculate sum of squares deviations
SST: Total sum of squares, reflecting the variability of the outcome factor due to the
influence of all factors.
Formula:
H K
SST =∑ ❑ ∑ ❑( X ij− X )2
i=1 j =1
SSK: Sum of squares between groups (columns), reflecting the variability of the outcome
factor due to the influence of the first factor (arranged by column).
Formula:
K
SSK=H ∑ ❑( X j− X )2
j=1
SSH: Sum of squares between blocks (rows), reflecting the variability of the outcome
factor due to the influence of the second factor (arranged by row).
Formula:
H
SSH =K ∑ ❑( X i−X )2
i=1
SSE: Sum of squares of residuals, reflecting the variability of the outcome factor due to
the influence of other unrelated factors.
Formula:
SSE = SST - SSK – SSH
₋ Step 3: Calculate variances
SSK
MSK =
K −1
SSH
MSH =
H −1
SSE
MSE=
(K −1)(H −1)
₋ Step 4: Hypothesis testing
Calculate F-test statistic (F experimental)
MSK
F 1=
MSE
MSH
F 2=
MSE
Where: F 1 and F 2 are used for first factor and second factor respectively.
Find theoretical F for 2 causal factors:
+ Factor 1:
Standard F = F(k-1, (k-1)(h-1), α) is the critical value obtained from the F-distribution
table with k-1 degrees of freedom for the numerator and (k-1)(h-1) degrees of freedom
for the denominator at significance level α.
+ Factor 2:
Standard F = F(h-1, (k-1)(h-1), α) is the critical value obtained from the F-distribution
table with h-1 degrees of freedom for the numerator and (k-1)(h-1) degrees of freedom
for the denominator at significance level α.
If F 1 experimental > F 1 theoretical, reject H 0, meaning the means of k groups (columns)
are not equal.
If F 2 experimental > F 2 theoretical, reject H 0, meaning the means of h blocks (rows) are
not equal.
Two-Way Analysis of Variance Table:
Total sum of Degrees of Mean Square
Source of variation F-ratio
squares (SS) freedom (df) (MS)
Between columns
SSK (k - 1) MSK F1
(groups)
Between columns
SSH (h - 1) MSH F2
(groups)
Residual SSE (k - 1)(h - 1) MSE
Total SST (n - 1)
3. DATA ANALYSIS
3.1 Data reading
First, read data by using [Link] and display the data to the terminal to check if data
is successfully imported.
Figure 1: 3D Printer Dataset for Mechanical Engineers
3.2 Checking missing values
Using the command [Link](data) will return a new data frame which has null value.
Therefore, the sum command can be used to calculate the total number of rows having
null value.
Result of checking missing values. Our data doesn’t have any null value so just skip it
and move to the next step.
3.3 Data summary
Data statistics
First, We change material and infill_pattern into factor by using
dat$material=[Link](dat$material)
dat$infill_pattern=[Link](dat$infill_pattern)
then display the overview of data using summary(dat)
Figure 2: Overview of dataset
Next, we will examine the influence of infill_pattern factor on output parameters
by using boxplot() function
The result:
Figure 3: Boxplot between grid an honeycomb
Comment:
Only in the first graph, we can see that the median level of the grid is a bit higher
than the median level of the honeycomb while the medians of the grid are slightly lower
than the medians of the honeycomb in two remaining graphs. Overall, the difference of
infill_pattern does not make any big affection to output parameters.
After that, we redu with material factor and output parameters.
The result:
Figure 4: Boxplot between abs an pla
Comment:
Only in the first graph, we can see that the median level of the abs is higher than
median level of pla while medians of abs are clearly lower than medians of pla in two
remaining graphs. In conclusion, the difference of material strongly impacts output
parameters.
In the next step, we use histogram to consider the distribution of output parameters
Figure 5: Histogram
From the 3 pictures above we can see that the graphs are not evenly distributed. As in
the histogram graph of roughness large values are often concentrated between 50 and
200. Meanwhile, the histogram of elongation has large values concentrated in the middle
from 1 to 2. As for the histogram graph of tension_strength is concentrated to the right
from 25 to 30.
3.4 Correlation coefficients between variables.
To see the linear relationship between each variable, we will plot the correlation
coefficient of all variables using corrplot function and display these coefficients to the
terminal.
Figure 6: Correlogram of the data
Figure 7: Correlation parameters summary
4. INFERENTIAL STATISTICS
4.1 Build linear regression model
From the given data set, we build the appropriate regression model to analize how
adjustment parameters affect product after printing.
First, let assume the given data as:
₋ The dependent variable: roughness.
₋ The independent variables: layer_height, wall_thickness, infill_density,
infill_pattern, nozzle_temperature, bed_temperature, print_speed, material,
and fan_speed.
The model is displayed as follows:
roughness= β 0+ β 1×layer_height+ β 2×wall_thickness+ β 3×infill_density+ β 4×infill_patt
ern+ β 5×nozzle_temperature+ β 6×bed_temperature+ β 7×print_speed+ β 8×material+ β 9×
fan_speed + ε
We estimate the coefficients β i with i =0; 1; 2;...; 9;
#Linear regression model
model_1=lm(roughness~layer_height+wall_thickness+infill_density+infill_pattern+no
zzle_temperature+bed_temperature+print_speed+material+fan_speed,data=data1)
summary(model_1)
Figure 8: Results when building linear regression model 1
Comment: The result from the figure illustrates the values of β 0 to β 9 are -2371;
1269; 2.334; -0.04231; -0.1255; 15.06; -16.13; 0.6496; 298.5; NA respectively.
But it is notably that β 9’s value can not be estimated due to singularities.
In order to examine these singularities, we will check the correlation which
calculated in section 2.
Figure 9: correlation summary
Comment:
The figure shows that the correlation coefficient between bed_temperature and
fan_speed equals 1, it means that fan_speed and bed_temperature have a strong linear
relationship.
Rebuild the linear regression without fan_speed parameter.
Model 1_2: new linear regression
#New linear regression after eliminating fan_speed
model_1_2=lm(roughness~layer_height+wall_thickness+infill_density+infill_pattern+
nozzle_temperature+bed_temperature+print_speed+material,data=data1)
summary(model_1_2)
Figure 10: New linear regression
Comment: The values in the column “estimate” remain the same, hence we get the
equation:
roughness=−2371+1269 ×layer height +2.334 ×wallthickness −0. 04231 ×infill density −0. 1255× infill patternhon
Now, we will use p-values from the column Pr( > |t|) to test the hypotheses which
help examining if the independent variables affect considerably the output.
4.2 Use P-values for hypotheses testing
₋ Assuming significant value 𝛼 = 0.05:
+ Hypothesis Ho: 𝛽i = 0 => The variable has no statistical meanings to the output
value.
+ Hypothesis Ho: 𝛽i ≠ 0 => The variable has statistical meanings to the output
value.
₋ From the column Pr( > |t|), we get that the p-values of layer_height is
₋ 2 ×10−16< α=0.05 => we can reject Ho so that the adjustment of this data has the
considerable affect on roughness.
₋ Similarly, nozzle_temperature, bed_temperature, print_speed and material have
enormous affect on roughess.
₋ On the other hand, the P-value of 𝛽2 (wall_thickness), 𝛽3 (infill_density), 𝛽4
(infill_patternhoneycomb) are 0.29259, 0.85742 and 0.99117, respectively. We can
not reject the null hypothesis Ho with these P-values. Hence, they have no statical
meanings in regression model.
Eliminate wall_thickness, infill_density, infill_patternhoneycomb from the model.
Model 2: removing infill_patternhoneycomb
#eliminate infill_patternhoneycomb
model_2=lm(roughness~layer_height+wall_thickness+infill_density+nozzle_temperatu
re+bed_temperature+print_speed+material,data=data1)
summary(model_2)
Figure 11: Eliminate infill_patternhoneycomb
Model 3: removing infill_density
#eliminate infill_density
model_3=lm(roughness~layer_height+wall_thickness+nozzle_temperature+bed_tempe
rature+print_speed+material,data=data1)
summary(model_3)
Figure 12: Eliminate infill_density
Model 4: removing wall_thickness
#eliminate infill_density
model_4=lm(roughness~layer_height++
+nozzle_temperature+bed_temperature+print_speed+material,data=data1)
summary(model_4)
Figure 13: Eliminate wall thickness
4.3 Use anova to compare models
In this section, we use anova test to find the most suitable regression model for the output
data. It means that we will find out which model has the dependent variable explained
most by independent variables.
₋ Assuming significant value 𝛼 = 0.05:
+ Hypothesis H0: 𝛽i = 0 => Model is more effective
+ Hypothesis H1: 𝛽i ≠ 0 => The other model is more effective
₋ After identifying the most appropriate model, checking the assumption of the linear
regression model
4.3.1 Compare models
model 1_2 vs model 2
#compare 1_2 vs 2
anova(model_1_2,model_2 )
Figure 14: Anova of model 1_2 and model 2
Comment:
We can see see P−value=0.9912,, which is greater than the significant value 𝛼 = 0.05, so
we cannot reject H0. Therefore, model 2 is more effective than model 1_2.
model 2 vs model 3
#compare 2 vs 3
anova(model_2,model_3 )
Figure 15: Anova of model 2 and model 3
Comment:
We can see see P−value=0.8557,, which is greater than the significant value 𝛼 = 0.05, so
we cannot reject H0. Therefore, model 3 is more effective than model 2.
model 3 vs model 4
#compare 3 vs 4
anova(model_3,model_4 )
Figure 16: Anova of model 3 and model 4
Comment:
We can see P−value=0.2828, which is greater than the significant value 𝛼 = 0.05, so we
cannot reject H0. Therefore, model 4 is more effective than model 3.
Conclusion:
From comparison, model 4 is most effective, hence it is the most appropriate linear
regression model for roughness.
4.3.2 Checking the assumption of the linear regression model
₋ The assumptions of the regression model:Y = β0 + β 1 x1 + β 2 x 2 +..+ β i x i (i=1 , … , n)
₋ There must be a linear relationship between the outcome variable and the independent
variables.
₋ The error has a normal distribution.
₋ The variance of the errors is constant.
₋ Errors ε have expectation = 0
We will plot residual analysis graphs to examine easier.
#plot graph
plot(model_4)
Figure 17: Residual analysis graphs
Comment:
₋ Graph 1 displays the error values corresponding to the forecasted values, aiming to
verify the assumption of data linearity and the expectation of zero errors:
+ The red line appears almost straight, indicating that the linearity assumption of the
data is met.
+ The errors are mostly clustered around the zero line y = 0, with only a few
outliers. This confirms that the assumption of error expectation being 0 holds true.
₋ In Graph 2, the standardized errors are plotted to assess the normal distribution
assumption:
+ The standardized errors align closely with a straight line, suggesting that the
assumption of normal distribution is satisfied.
₋ Graph 3 depicts the square root of the errors, examining the constant variance
assumption:
+ While some outliers are present, the square root errors primarily cluster around
the red line, indicating an acceptable degree of variance stability.
₋ In Graph 4, any high-influence points within the dataset are identified:
+ Points 23, 25, and 5 exhibit relatively high impact scores. However, considering
they have not surpassed Cook’s distance, these points are not deemed highly
influential and do not require exclusion during analysis.
5. DISCUSSION AND EXTENSION
6. DATA AND CODE AVAILABILITY
Data link: Dataset
Code Link: Codelink
7. REFERENCES