Daily Gold Price Prediction Model
Daily Gold Price Prediction Model
R.W.U.S Rajapakse
University of Colombo School of Computing
2024
Declaration
Project/Thesis title: A Model to Predict Daily Gold Price Using Machine Learning-Based
Predictive Analysis Approach
1. The project/thesis is my original work and has not been submitted previously for a
degree at this or any other University/Institute. To the best of my knowledge, it does
not contain any material published or written by another person, except as
acknowledged in the text.
2. I understand what plagiarism is, the various types of plagiarism, how to avoid it, what
my resources are, who can help me if I am unsure about a research or plagiarism
issue, as well as what the consequences are at University of Colombo School of
Computing (UCSC) for plagiarism.
3. I understand that ignorance is not an excuse for plagiarism and that I am responsible
for clarifying, asking questions and utilizing all available resources in order to
educate myself and prevent myself from plagiarizing.
4. I am also aware of the dangers of using online plagiarism checkers and sites that offer
essays for sale. I understand that if I use these resources, I am solely responsible for
the consequences of my actions.
5. I assure that any work I submit with my name on it will reflect my own ideas and
effort. I will properly cite all material that is not my own.
6. I understand that there is no acceptable excuse for committing plagiarism and that
doing so is a violation of the Student Code of Conduct.
i
Signature of the Student Date
(DD/MM/YYYY)
24/09/2024
Certified by Supervisor(s)
This is to certify that this project/thesis is based on the work of the above-mentioned student
under my/our supervision. The thesis has been prepared according to the format stipulated and
is of an acceptable standard.
Signature
Date 24/09/2024
ii
I would like to dedicate this thesis to my family.
iii
Acknowledgment
I would like to express my sincere gratitude to Mr. Viraj Welgama, my internal supervisor,
for his invaluable guidance, assistance, and unwavering support throughout the entire research
process. His extensive knowledge and wealth of experience served as a constant source of
motivation.
I extend my heartfelt thanks to all the senior lecturers, lecturers, and instructors at UCSC for
their guidance and the knowledge they generously shared with me.
I am particularly grateful to my family and friends for their constant love, support, and
encouragement throughout the duration of this research.
Special thanks and appreciation are also due to my colleagues and individuals who willingly
contributed their expertise to help me articulate my thoughts effectively.
iv
Abstract
This thesis presents a comprehensive analysis of daily gold price prediction in the Sri Lankan
market, utilizing a range of exogenous variables to enhance forecasting accuracy. The study
explores the impact of key economic indicators, including Brent crude oil prices, USD to LKR
exchange rates, CNY to LKR exchange rates, silver prices, S&P 20 index, Colombo Consumer
Price Index (CCPI), and gold reserves on the volatility and trends of gold prices in Sri Lanka.
Two powerful machine learning algorithms, XGBoost and Random Forest, were employed to
predict daily gold prices. The study investigates the predictive performance of these algorithms
and compares their effectiveness in capturing the intricate dynamics of the Sri Lankan gold
market. The models were trained and tested on a dataset spanning from January 2014 to
September 2022 to ensure robustness and reliability in the results.
The findings reveal that XGBoost outperformed Random Forest in terms of predictive accuracy
and model performance. The superiority of XGBoost suggests its efficacy in handling complex
relationships and nonlinear patterns within the dataset, thereby providing more accurate and
reliable predictions of daily gold prices in the Sri Lankan market.
Furthermore, the inclusion of diverse exogenous variables allows for a more holistic
understanding of the factors influencing gold prices. The study contributes valuable insights to
the financial community, policymakers, and investors.
v
Table of Contents
Acknowledgment ..............................................................................................................iv
Abstract .............................................................................................................................. v
1.2 Motivation................................................................................................................... 2
Appendices ......................................................................................................................... I
REFERENCES ............................................................................................................... VI
vii
List of Figures
viii
List of Tables
Table 2: Significant Lags of each Variable Selected for the Model ......................................... 17
Table 4: Optimal Hyperparameter for XGBoost with all the Predictor Variables ................... 38
Table 5: Table of Accuracy Measures for XGBoost with all the Predictor Variables ............. 38
Table 6: Optimal Hyperparameters for Random Forest Regressor with all the Predictor
Variables ................................................................................................................................... 39
Table 7: Table of Accuracy Measures for Random Forest Regressor with all the Predictor
Variables ................................................................................................................................... 40
Table 8: Optimal Hyperparameters for XGBoost with Selected Predictor Variables .............. 41
Table 9: Table of Accuracy Measures for XGBoost with the Selected Predictor Variables .... 41
Table 10: Optimal Hyperparameters for Random Forest Regressor with the Selected Predictor
Variables ................................................................................................................................... 42
Table 11: Table of Accuracy Measures for Random Forest Regressor with the Selected
Predictor Variables ................................................................................................................... 43
Table 13: Important Lags to Predict Sri Lankan Gold Price Using XGBoost ......................... 45
ix
List of Abbreviations
x
Chapter 1
Introduction
Over time, gold's value has fluctuated due to various reasons including economic conditions,
geopolitical events, supply and demand dynamics, and changes in investor sentiment [2].
Traditionally, gold prices were influenced by the gold standard, which pegged currency
values to specific amounts of gold. However, with the abandonment of the gold standard in
the early 20th century, gold prices have become subject to market forces and speculative
trading [3].
As financial markets became more complex and interconnected, the need for accurate and
timely gold price predictions intensified. Investors, traders, financial institutions, and
governments seek to understand and anticipate gold price movements to make informed
decisions about asset allocation, risk management, and hedging strategies.
In the 21st century, the emergence of machine learning (ML) as a financial forecasting tool
has opened new possibilities for understanding and predicting gold price movements.
Advanced ML algorithms can analyze vast amounts of data, including historical price
patterns, economic indicators, and sentiment analysis, to offer more accurate and timely
predictions. This is particularly valuable for investors, who can use these insights to make
informed decisions about when to buy or sell gold, or whether to include it in their portfolio
for diversification and risk management. ML algorithms, such as decision trees, random
forests, support vector machines, and neural networks, have shown promising results in
predicting asset prices. A gold price prediction system using ML could significantly benefit
market participants by providing insights into potential price trends and risks associated with
gold investments [4].
1
1.2 Motivation
I. Predicting gold prices accurately can have practical implications in various industries,
including finance, investment, jewelry, and mining.
II. Gold price data is usually readily available and accessible, making it easier to conduct
research and build predictive models.
III. Increase employability with experience and exposure to machine learning projects.
IV. Interest in learning different machine learning techniques
V. Interest in the “Gold” field.
1.3.2 Objectives
I. Identifying factors affecting the gold prices
II. Identifying patterns in the gold price movements
III. Identifying outliers in the gold prices
IV. Test and compare XGBoost and Random Forest Regressor algorithms with a set of
variables and identify the best-suited ML model and the variables to predict gold prices.
V. Evaluate the output of XGBoost and Random Forest Regressor using different measures
such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute
Error (MSE), Mean Absolute Percentage Error (MAPE) and coefficient of determination
(R2).
1.4 Scope
This project primarily focuses on developing a model to predict the daily gold price in Sri
Lanka. Given the vastness of this domain and the time limitation of one year, this project is
confined to the Sri Lankan setting. The price prediction will be in a daily timeline based on
the gold prices starting from 1st week of January 2014 to 4th week of September 2022. Part of
the data will be used to train the model.
2
The price prediction would take into consideration any correlation of variables with gold
price. This may include and is not strictly limited to silver price, CCIP, USD to LKR
exchange rate, Brent crude oil price, and S&P Sri Lanka 20 index.
This project doesn’t expect to explore traditional forecasting methods such as Box and
Jenkins and exponential smoothing. Instead, this project will be using the ML model
XGBoost and Random Forest Regressor.
As for the data visualization, this aspect falls outside of the scope of this project.
Moreover, the majority of existing gold price forecasting models operate on a monthly or
quarterly basis, whereas this project stands out by predicting the daily gold price which is
more useful for the buyers and sellers than the quarterly or monthly values.
This project incorporates central bank gold reserves as an additional variable to the initial list,
aiming to analyze potential impacts associated with this inclusion.
Chapter 1 – Introduces the background, and problem definition with the novelty of the study,
scope, motivation, aims, and objectives of the research.
Chapter 2 – Carries out a review of literature based on past studies related to forecasting gold
prices using ML models in order to identify the factors that affect the gold price and the ML
models that predicts the gold price well.
Chapter 4 – Gives a description on sources of data, method of data collection, variables and
the formation of the data sets.
Chapter 6 – Discussion and conclusion about the results and suggestions for future work.
3
Chapter 2
Literature Review
2.1 Background
During the last few years, due to the emergence of machine learning techniques for
predictions in different fields, the prediction of gold prices also has gained popularity in the
industrial and scientific community.
Multiple research journals were analyzed prior to the start of the project to understand what
types of methods were used to predict gold prices in different countries and what factors
affect the gold price. This section only focuses on ML methods.
Restricted Boltzmann machines (RBM) were used for pre-training in Zhang and Bicong's
deep belief network (DBN) model, which was then fine-tuned by a supervised back-
propagation (BP) layer. They constructed a multi-layered neural network with the ideal 5-10-1
network topology. As input variables, they employed the Federal funds rate, the Dow Jones
Index, the crude oil price, the U.S. CPI, and the effective exchange rate. They found that the
price of gold is positively correlated with the U.S. Consumer Price Index (CPI), the price of
4
crude oil, and the Dow Jones Index. The gold price was negatively correlated with the Federal
funds rate and the effective exchange rate. This study was carried out with the same objective
as the previously described one, which was to forecast the monthly volatility of gold prices.
They compared the DBN model's results with those from a hyper model of GA-BPNN, a
standard NN model of BP, and ARIMA in order to evaluate the accuracy of the model.
According to the comparison, the suggested DBN model outperforms the other models,
having the lowest MAE, RMSE, and MAPE [6].
Liu and Li used random forest to create a machine-learning model. As initial input variables,
they used the US dollar index (USDX), crude oil price (COP), Dow Jones Industrial Average
(DJIA), US CPI (USCPI), US ten-year bond futures prices (US10BFP), Hang Seng Index
(HIS), and Standard & Poor's 500 Index (S&P500) and measured the performance of the
model. They selected the Standard & Poor's 500 Index (S&P500) and the Dow Jones
Industrial Average (DJIA) as the final model input variables using a stepwise backward
variable selection procedure. The Standard & Poor's 500 Index (S&P500) and the Dow Jones
Industrial Average (DJIA) displayed prediction performances of 0.9962 and 0.9999,
respectively. Another goal of this model's construction was to predict the monthly gold price
[7].
Makala and Z Li compared the outcomes of their use of ARIMA and SVM models for gold
price prediction. They compared SVM (Linear), SVM (Poly), Arima (2, 1, 2) (2, 1, 2, 12), and
SVM (RBF). The outcomes demonstrated that SVM (Poly) outperforms both the Arima
model and other SVM models. SAV resulted in an RMSE of 0.028 and MAPE of 2.5 and
36.18 and 2897 for ARIMA respectively. They used gold price data for this study research
[8].
In order to predict gold prices, Sami and Khurum contrasted artificial neural network (ANN)
models with linear regression. Their main goal was to forecast the price of gold by utilizing
the most extensive list of attributes. The input variables they used were the S&P 500 Index,
US Bond Rates, Silver Spot Rates, Platinum Spot Rates, Palladium Spot Rates, Rhodium Spot
Rates, Oil Spot Prices, NYSE Index, EGO Index, SLW Index, AU Index, ABX, BVN, China
Interest Rate, USA Interest Rate, UK Interest Rate, and Russia Interest Rate. The unique
aspect of this study is that, among these eighteen factors, some have never been used before,
such as performance metrics for China, India, and Russia, which are the top three buyers of
gold. Correlation analysis was used to determine which factors had the greatest impact on the
price of gold. They discovered an interesting finding, the stock price of Silver Wheaton
Corporation (SLW), the largest precious metals streaming company in the world, has the
5
highest correlation with gold rates, not the performance of the US economy (or any other
major economy), nor the price of other precious metals. Although ANN outperformed linear
regression, the difference was not statistically significant [9].
Mustafa Yurtsever employed various indicators including the effective federal funds rate,
consumer price index, S&P 500 stock market index, crude oil price, and effective exchange
rate to forecast monthly gold prices. His investigation indicated that all the mentioned factors
exert an influence on gold prices. The predictions were based on monthly data, utilizing three
different models: LSTM, Bi-LSTM, and GRU, for time series forecasting of gold prices.
Model performance was assessed using metrics such as Mean Absolute Error (MAE), Root
Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE). The outcomes
demonstrated that LSTM outperformed the other models, giving the most favorable results
with a 3.48 MAPE, 61,728 RMSE, and 48.85 MAE values [10].
In their research on forecasting gold prices, Manjula and Karthikeyan incorporated variables
such as stock market performance, crude oil prices, rupee-dollar exchange rates, inflation, and
interest rates. They employed three machine-learning algorithms—gradient-boosting
regression, random forest regression, and linear regression. Their analysis, based on monthly
gold price data spanning from January 2000 to December 2018, revealed that all three models
consistently achieved R2 values exceeding 95%, indicating a notably high level of accuracy in
predicting future monthly gold prices [11].
Livieris and Pintelas employed sophisticated deep learning methods to forecast both the price
and movement of gold. Their model consisted of two main components. The first layer is the
convolutional and pooling layer, which performs complex mathematical operations to develop
features of the input data. The second component integrated LSTM and dense layers,
leveraging the generated features to make predictions. This model demonstrated the capability
to forecast not only the next day's gold price but also the direction of its movement (increase
or decrease). The research utilized daily gold price data in USD spanning from January 2014
to April 2018. The models exhibited an accuracy level of 55% [12].
Sandeep Patalay proposed a system based on the M5P model tree machine learning algorithm
to predict gold prices. He used S&P 500 and crude oil prices as input variables. The results
showed an accuracy of 85% [13].
Zakaria et al. proposed a novel model to forecast gold prices using a recent meta-heuristic
method named whale optimization algorithm (WOA) as a trainer to learn the multilayer
perceptron neural network (NN). They used the Indian rupee INR, South African rand (ZAR),
6
and Chinese yuan (RMB) exchange rates, inflation rates of China and the US, crude oil prices,
copper prices, silver prices, and iron prices as input variables to the model. They used a
correlation matrix to identify the relationship between gold price and input variables. The
output of the correlation matrix showed that copper price, iron price, silver price, oil price,
South African exchange rate, and Indian exchange rate have a positive linear relationship with
the gold price. Among them, the silver price has the strongest positive relationship. China's
inflation, China's exchange rate, and the US inflation rate had weak linear relationships with
gold prices. They compared their proposed model’s (WAO-NN) results against five other
models, namely classic NN, particle swarm optimization for NN (PSO–NN), genetic
algorithm for NN (GA–NN), and grey wolf optimization for NN (GWO–NN) and ARIMA
models. They used the RMSE, MSE, STD, and R2 to evaluate each model. The results
depicted that the WOA–NN model provides the highest out of sample R2 0.9989 coupled with
the lowest MSE, RMSE, and STD of 0.00047, 0.02131, and 0.00340, respectively. Overall,
each model’s R2 value was above 95% [14].
An ML model was suggested using Support Vector Regression to predict copper price by
Gabriel, Carrasco, Christian, et al. Their predictions were for the 5, 10, 15, 20 and 30 days
intervals from the copper closing price at the London Metal Exchange. The error of 2.2% in
predicting copper prices for the intervals 5 and 10 days depicted the good performance of the
model [16].
Christian and Marian conducted a study to predict the returns of four precious metals (gold,
silver, platinum and palladium) using Random Forest. They evaluated the study in two
criteria, univariate and multivariate forecast evaluation criteria. They utilized weekly returns
from various financial indicators including the S&P-500 index, the GSCI commodity-price
index, the Chicago Board Options Exchange Volatility Index (VIX), the term spread (the
difference between the 10 year treasury constant maturity rate and the 3 month treasury
constant maturity rate), the corporate bond (CB) spread (the difference between Moody’s Baa
7
rate and Moody’s Aaa rate), as well as weekly returns of the dollar against the yen, euro,
British pound, Canadian dollar, Australian dollar, and an aggregate trade-weighted exchange-
rate index (TWEXB) as predictor variables. Their findings concluded that multivariate
forecasts using Random Forest on precious metals are much accurate than univariate forecasts
[17].
• XGBoost
• DBN
• Random Forest
• SVM
• ANN
• LSTM
The most common variables found from the literature that affects the gold prices are below.
This project uses XGBoost and Random Forest Regressor for forecasting daily gold price in
Sri Lanka as the literature shows that XGBoost and Random Forests show higher
performance.
In this study, the below variables are selected as predictor variables based on the insights from
the existing literature.
9
Chapter 3
Methodology and Solution Design
This section focuses on explaining the systematic methods followed to meet the objectives
mentioned in the previous section. The first part explains the steps followed in the process of
building the ML model and other analyses. The second part elaborates on the theory and
methodology behind time series, machine learning techniques, and regression tasks.
The data were predominantly available in a daily format, with the exception of the CCPI and
gold reserves held by the central bank. The CCPI is provided on a monthly basis, and for
analytical purposes, it was presumed that the CCPI value remained constant throughout the
entire month. Similarly, as gold reserves tend to change infrequently, it was assumed that the
quarterly figure was representative of the entire quarter.
The CCPI serves as a valuable indicator, presenting the monthly average change in the prices
of goods. In order to maintain consistency and ensure the accuracy of our machine learning
model, it is essential to have all variables in daily format when feeding data. Therefore, for the
monthly figure of CCPI, we assume it to remain constant throughout the entire month to align
with our daily data requirements.
The new addition to the variables is the gold reserves in the central bank, which are available
in quarterly figures. Given that gold reserves typically don't change on a daily basis and the
10
project requires daily data, we assume that the quarterly figure remains constant throughout all
the days of that quarter.
Firstly, the data in the separate CSV files were collated into one master CSV file based on the
date. There were fifteen missing values in the S&P 20 value column, one missing value in silver
price column and four missing values in the Brent crude oil price column. The missing values
were filled using the forward fill method [18],[19].
11
Table 1: Sample of the Dataset After Preprocessing
Brent
USD CNY Crud
LKR LKR e oil Gold
Gold Exchang Exchang price Silver S&P Reserve
Date Price e rate e rate USD Price CCIP 20 s
157671.4 110.9 2548.5 104.2 3263.8
1/1/2014 3 130.75 21.60 2 4 0 7 894.28
158636.7 107.7 2614.9 104.2 3285.3
1/2/2014 8 130.73 21.61 8 6 0 9 894.28
160947.6 106.8 2634.5 104.2 3294.8
1/3/2014 4 130.70 21.60 9 4 0 2 894.28
162422.8 106.7 2636.1 104.2 3281.2
1/6/2014 9 130.75 21.60 3 9 0 4 894.28
162412.0 107.3 2596.1 104.2 3300.4
1/7/2014 6 130.70 21.60 5 9 0 5 894.28
160678.0 107.1 2555.5 104.2 3350.3
1/8/2014 7 130.85 21.62 5 7 0 1 894.28
160417.0 106.3 2555.0 104.2 3362.3
1/9/2014 9 130.70 21.59 9 4 0 9 894.28
1/10/201 160949.3 107.2 2633.6 104.2 3384.5
4 1 130.70 21.60 5 3 0 1 894.28
1/13/201 163626.9 106.7 2667.7 104.2 3388.0
4 6 130.70 21.63 5 9 0 7 894.28
1/16/201 162141.3 107.0 2627.9 104.2 3427.0
4 3 130.68 21.58 9 0 0 5 894.28
1/17/201 162518.5 106.4 2653.0 104.2 3407.9
4 2 130.70 21.60 8 1 0 4 894.28
1/20/201 164318.5 106.3 2657.3 104.2 3425.7
4 6 130.74 21.60 5 0 0 1 894.28
1/21/201 163813.2 106.7 2599.1 104.2 3428.1
4 1 130.80 21.62 3 9 0 6 894.28
1/22/201 162489.9 108.2 2591.4 104.2 3458.2
4 7 130.80 21.62 7 0 0 6 894.28
12
3.3Feature Engineering
Checking for stationarity before feature engineering is important because the XGBoost
algorithm is not constructed to identify the quirks patterns in time series data [20] The time
series graphs of each variable were plotted to identify the non-stationary series. The time series
graphs for gold price, USD LKR exchange rate and CNY LKR exchange rate are shown below.
13
Figure 2: USD LKR Exchange Rate Time Series Graph
It is evident from the above graphs that none of the series are stationary. Further, the Augmented
Dickey-Fuller test was carried out to validate the findings from the time series graphs. Below
are the test statistics for gold price and USD LKR exchange rate respectively.
Given that all variables exhibited non-stationary behavior, it was deemed necessary to address
this issue by creating first-differenced variables for each variable in the analysis. This
adjustment was implemented to enhance the stationarity of the data and facilitate a more robust
analytical framework. The Augmented Dickey-Fuller test showed that the series were
Subsequently, lag variables were introduced for all differenced data. Prior to the creation of
these lag variables, Cross-Correlation Function (CCF) plots were generated between the
14
dependent and independent variables, aiming to pinpoint the significant lags. Additionally,
Partial Auto Correlation Function (PACF) plots were employed to identify the significant lags
associated with the dependent variable. This comprehensive approach was adopted to ensure
the identification and incorporation of relevant time lags in the subsequent analysis. The plots
were created for the pre-whitened data. The below figure shows the CCF plot and the PACF
plot.
15
Figure 7: S&P 20 Index CCF Plot
16
Figure 11: Gold Price PACF Plot
Based on the above CCF plots and PACF plot, the below lags of each variable were selected.
Variable Lags
CCIP 1,3,4
CNY LKR Exchange rate 1,2,7,14
Crude Oil Price 1,2,3
S&P 20 1,2,7,21
Silver Price 1
USD LKR Exchange rate 1,2,7,14
Gold Price 1,2
As the CCF plot for gold reserves revealed the absence of significant lags, gold reserves were
excluded from further analysis.
Following the identification of significant lags in both the dependent and independent
variables, the subsequent analysis and model training involved utilizing the differenced data
along with the identified lagged variables.
17
correlation matrix.
The correlation matrix suggested a multicollinearity between USD LKR exchange rate and
CNY LKR exchange rate, hence regularization parameters in XGBoost alpha (L1
regularization) and lambda (L2 regularization) were tuned to minimize the multicollinearity
and to prevent overfitting and improve the generalization of the model.
18
At its essence, ML revolves around the creation and deployment of algorithms to support these
decision-making processes and predictions. These algorithms are engineered to enhance their
performance over time, progressively gaining accuracy and effectiveness as they process more
data.
This kind of ML involves training the algorithm on a dataset with labels. By using the labeled
training data, it gains the ability to map input features to targets. The algorithm learns to
generalize from known data to make predictions on new, unseen data in supervised learning,
where it is given input features alongside the matching output labels.
Regression:
An algorithm learns to predict continuous values based on input data through regression, a
sort of supervised learning. Regression analysis uses continuous data, such as stock and home
prices, as the output labels. In ML, there are various regression methods such as Random
Forest, Polynomial, Ridge, Decision Tree, Support Vector, and Linear regression, among
others.
Classification:
In unsupervised learning, an algorithm picks up the ability to spot patterns in data without
needing to be explicitly trained on labeled instances. Finding the data's underlying structure or
distribution is the aim of unsupervised learning.
Reinforcement Learning:
By taking actions and getting rewarded or penalized for them, an agent learns how to interact
with its environment through a sort of ML called reinforcement learning. In reinforcement
learning, the aim is to build a strategy that maps states to actions in a way that optimizes the
total expected reward as time progresses.
20
The research study outlined in this thesis addresses regression. Its primary objective is to
identify the complex relationship between the daily gold price and various independent
variables, ultimately enabling the prediction of future gold prices.
Based on the literature review in Chapter 2, XGBoost and Random Forest were chosen for
this study.
3.6.2 XGBoost
XGBoost originated as a research project within the Distributed (Deep) Machine Learning
Community (DMLC) group, led by Tianqi Chen. XGBoost, a highly optimized distributed
gradient boosting toolkit which enables the efficient and scalable training of ML models. By
combining the predictions of several weak models, this ensemble learning technique generates
more robust predictions. Because of its capacity to manage large datasets and its ability to
achieve state-of-the-art performance in numerous ML tasks, including regression and
classification, XGBoost, which stands for "Extreme Gradient Boosting," has emerged as one of
the most liked and extensively applied machine learning algorithms.
In this approach, decision trees are generated sequentially. One crucial component of XGBoost
is weights. Each independent variable is assigned a weight before being fed into the decision
tree to make predictions about the outcome. The second decision tree gives greater weight to
the factors that the first decision tree mis predicted. [22].
The following figure illustrates how a strong learning algorithm is produced by combining
weak learners.
21
Figure 14: How to Create a Powerful Learner in XGBoost
Box 4 (Combination of Weak Learners) accurately classifies the observations, but Box 1, Box
2, and Box 3 (Weak Learners) incorrectly classify the same. This is seen in the image.
Regularization is a powerful tool that XGBoost uses to punish complex models and prevent
overfitting. One of the special features of XGBoost is its ability to handle various data
sparsity patterns. This allows the data layout to be reused by successive iterations without
having to be computed afresh, in contrast to other algorithms. XGBoost was designed to make
the best possible use of the available hardware.
Parameters
22
V. n_estimators – Number of trees need to build
VI. learning_rate - Step size for updating the model's parameters during each boosting
iteration
VII. Alpha - L1 regularization term in the objective function
VIII. Lambda - L2 regularization term in the objective function
The Random Forest comprises numerous decision trees as its foundational learning models.
Through a process known as Bootstrap, we randomly select subsets of rows and features from
the dataset to create sample datasets for each model [23].
The maximum number of features available for splitting at each node is constrained to a
specified percentage of the total features (defined as a hyperparameter). This constraint is in
place to prevent the ensemble model from excessively relying on any single feature and
encourages a balanced utilization of all potentially predictive features.
Moreover, in the process of generating splits, each tree selects a random sample from the
original dataset, introducing an additional layer of randomness. This randomness serves as a
preventive measure against overfitting, contributing to the robustness of the ensemble model
[24].
23
Figure 15: Mechanism of Random Forest Regressor
The below parameters of the Random Forest were tuned in this study.
Parameters
I. max_depth – The maximum length of the path from the root node to a leaf node.
II. min_samples_leaf – Significance attributed to the minimum size of samples in a leaf node.
III. min_samples_split – The minimum required number of observations in any given node to
split
IV. n_estimators – Number of trees that need to be built
3.6.4 GridSearchCV
In machine learning, a method called GridSearchCV (Grid Search Cross Validation) is used to
check and identify the best set of hyperparameters for a certain model. By methodically
examining a predetermined range of hyperparameter values, it generates a "grid" of potential
configurations. After that, it uses cross validation to assess each combination and chooses the
one that produces the best results. GridSearchCV improves the model performance, eliminates
manual trial-and-error, and automates the hyperparameter tuning process [25].
Hyperparameters:
24
descent algorithm, the depth of a decision tree, or the number of neighbors in a k-nearest
neighbors classifier.
Cross-validation in time series analysis is used to identify the most effective set of
hyperparameter values that reduce the risk of overfitting in the time series model. In contrast
to traditional cross-validation techniques, time series cross-validation is capable of handling
the temporal structure of the dataset. This method involves dividing the dataset into n number
of consecutive folds, in a way that each fold represents a similar time period. The model is
then trained on earlier folds and evaluated on subsequent folds, simulating the real-world
scenario where the model must make predictions on unseen future data. Since the data is not
randomly shuffled, this technique prevents the inclusion of data from the future in the training
set. This mitigates data leakage and provides a better assessment of the model's performance
over time. Time series cross-validation helps in identifying hyperparameter configurations
that result in optimal model performance across different time periods.
Expanding window and rolling origin are the two most common time series cross-validation
techniques. These methods involve iteratively moving the training and validation windows
through the dataset, allowing the model to adapt to evolving patterns and trends in the time
I. Simplifying models to make them more understandable for researchers and users.
II. Reducing training times.
III. Avoiding the challenges associated with high-dimensional data
IV. Enhancing the model's generalization by mitigating overfitting, which formally reduces
variance.
Initially, both the XGBoost and Random Forest Regressor models were trained using
differenced data containing lag variables, and their performances were evaluated.
Subsequently, the most significant features for each model were identified using the feature
importance functionality in both XGBoost and Random Forest Regressor. The models were
then retrained using only the selected important features, and their performance was evaluated
once again.
26
The following error measures were used based on the literature to compare the models
implemented for forecasting the monthly daily gold price.
The average squared difference between the actual and the predicted value.
1 2
MSE = 𝑛 ∑𝑛𝑖=1(yi – ŷi)
yi = Actual value
ŷi = Predicted value
1 2
RMSE = √𝑛 ∑𝑛𝑖=1(yi – ŷi)
The average of the absolute value of the difference between the actual and the predicted
value.
1
MAE = 𝑛 ∑𝑛𝑖=1 |yi – ŷ|
The percentage of variability within the values that can be explained by the regression model.
2
2 ∑𝑛 (y – ŷi)
𝑖=1 i
R =1- 2
∑𝑛 (y – 𝑦̅)
𝑖=1 i
27
Mean Absolute Percentage Error ((MAPE)
1 | yi – ŷ |
MAPE = 𝑛 ∑𝑛𝑖=1 yi
28
Chapter 4
Evaluation and Results
This study focuses on predicting the gold price of Sri Lanka using exogenous variables
including USD LKR Exchange rate, CNY LKR Exchange rate, Brent Crude oil price in USD,
Silver Price in USD, CCIP, S&P 20. The price prediction was in a daily timeline based on the
gold prices starting from January 2014 to September 2022.
Based on the above data, the gold price and S&P 20 indicates a significant variability while
CNY LKR exchange rate indicates a low variability. The substantial variability observed in
historical gold prices, as indicated by the high standard deviation of USD 121,826.19,
suggests that predicting future gold prices using a machine learning model be challenging.
29
Figure 18: Time Plot of Gold Price
Central banks increased gold reserves in 2022, which led to a surge in prices in 2022. The
purchase by central banks reflected their desire to increase gold purchases due to high levels
of inflation globally, geo-political uncertainty driven by Russia’s invasion of Ukraine and
financial market turbulence.
In 2022, central banks acquired 1,136 tons of gold valued at approximately $70 billion,
marking the largest acquisition in any year since 1967. The purchases made in the fourth
quarter alone, totaling 417 tons, nearly equaled those of the entire year of 2021, which
amounted to 450 tons.
The Central Bank of Sri Lanka pegged the USD LKR rate at the range of LKR 200 – 203 and
then subsequently gradually let the USD LKR float. This led to a depreciation of the rupee
against the USD, due to market fundamentals. Post Covid, Sri Lanka was struggling with
economic woes and saw a decrease in exports which impacted the country’s reserves balance.
The floating of the rupee led to the LKR depreciating against all other major currencies.
Silver Price
31
Silver prices too followed the same trajectory as gold prices due to high global inflation and
geo-political uncertainty.
Crude oil prices increased from in 2022, mainly driven by the geo-political tensions between
Russia and Ukraine. Further, oil prices were impacted in 2022, driven the production cuts by
OPEC+.
32
CCPI
The Colombo Consumer Price Index has been rising gradually since October 2021 and saw a
significant increase due to the economic crisis in Sri Lanka. Sri Lanka abandoned the peg and
in early 2022 and the Rupee depreciated against the dollar steeply from LKR 200 to LKR 360
post abandoning the page. This increased headline inflation and core inflation in the country.
During 2022, headline inflation increased the highest by 73.7% in September 2022.
S&P 20 Index
33
The S&P 20 saw one of the lowest points in history in 2020 due to the country coming to a
standstill due to the Covid-19 pandemic. The country’s regulator issued a directive to close
the stock market commencing 18th April and markets subsequently reopened on 11th May
2020.
In 2021 markets were pushed through primarily by the local investors who favored particular
stocks which influenced market capitalization. Despite Covid-19 slowing down the economy
there was disconnect between the ground business environment and listed stocks. Certain
companies benefitted heavily due to the import restrictions as they became the main
companies to supply goods which were otherwise imported. (Lanka Tiles, Rocell). The
artificial disconnect continued throughout the year between small businesses and large listed
companies. From October 2021 the Sri Lankan Rupee was pegged to the dollar at LKR 200 –
203 which also led to furthering the economic crisis.
In 2022, markets started to decline with the uncertainty of the political situation in the country
which commenced with the Aragalaya. The protests commenced in March 2022 against the
GoSL led by then president Gotabhaya Rajapakse resigned by April 2022, creating
uncertainty in the economy. As mentioned above, Sri Lanka had artificially pegged the Rupee
to the Green back for a few months and commenced floating the Rupee gradually from March
2022. The Aragalaya peaked with the resignation of the President Gotabaya Rajapakse in July
2022, and Parliament elected Ranil Wickremesinghe as the president of the country.
Subsequently, the country went through severe economic crisis with the unavailability of fuel
and gas which led to a complete slowdown in the economy. The Government had to request
for a bailout package from the IMF for the 17th time in its history and received its first
tranche of USD 330m in March 2023.
34
Figure 25: Plot of Correlations of Variables
35
4.3 Data Credibility
All the data gathered for this study is secondary data, and the sources for each variable are
indicated below.
This represents the gold prices in Sri Lanka per Troy ounce in LKR from January 2014 to
September 2022.
This is the value of one U.S. Dollar in terms of Sri Lankan Rupees.
This is the value of one Chinese Yuan in terms of Sri Lankan Rupees.
Two primary types of crude oil prices that are widely referenced in the global market are Brent
crude and West Texas Intermediate (WTI) crude. This is named after the Brent oil field in the
North Sea, Brent crude is extracted from several oil fields in the North Sea, including those in
the United Kingdom and Norway. It represents the cost of a barrel of Brent crude oil on the
global market. This is measured in USD.
36
Silver Price
This represents the gold prices in Sri Lanka per ounce in LKR from January 2014 to September
2022.
This indicator serves as a gauge for inflation in Sri Lanka. The index is based on several years,
including 2002, 2006, and 2013. This data is sourced from the economic data library of the
Central Bank of Sri Lanka. The year 2013 was selected as the base year for constructing the
CCPI variable in the dataset.
S&P SL 20
S&P SL 20 Index is a stock market index that represents the performance of the top 20
companies listed on the Colombo Stock Exchange (CSE) in Sri Lanka. It is designed to reflect
the overall performance of the Sri Lankan stock market by including a diversified group of
leading companies from various sectors. In April 2020, the CSE was shut down, leading to a
halt in all operations and trading activities. Hence, the data set excludes the records for that
particular month.
Moving on to the second phase, a refined approach is adopted. Both models are subjected to
retraining, this time incorporating only the most important variables as identified in the earlier
analysis. Subsequently, a thorough analysis is conducted on the results derived from this second
phase, providing a comprehensive understanding of the impact and significance of the chosen
variables.
4.5.1 Phase 1
[Link] XGBoost
Parameter Tuning
In the process of refining our model's performance, grid search in conjunction with a five-fold
time series cross-validation strategy was used. This method allows to systematically explore
and evaluate various combinations of hyperparameter values, ensuring that our model is fine-
37
tuned for optimal performance across diverse subsets of the data. By incorporating five-fold
time series cross-validation, it is aimed to obtain a robust assessment of the model's
generalization capabilities, enhancing its reliability on unseen data.
The following table shows the best parameters obtained for the XGBoost model with all the
predictor variables.
Table 4: Optimal Hyperparameter for XGBoost with all the Predictor Variables
Parameter Value
n_estimators 875
learning_rate 0.15
alpha 1.5
reg_lambda 3
max_depth 4
sub_sample 1
gamma 0.5
colsample_bytree 1
Measures of Accuracy
Values obtained for the accuracy measures are given in the following table.
Table 5: Table of Accuracy Measures for XGBoost with all the Predictor Variables
Measure Value
R2 0.88
MSE 1,914,180,697.98
RMSE 43,751.35
MAPE 5.80%
MAE 28,908.73
Feature Selection
38
The feature importance mechanism in XGBoost was used to identify the most influencing
variables for the prediction in the model. The plot of important features is shown below.
The variables with a F score above 500, were chosen as the most influencing features and
XGBoost was retrained with these variables.
The following table shows the best parameters obtained for the Random Forest Regressor model
with all the predictor variables.
Table 6: Optimal Hyperparameters for Random Forest Regressor with all the Predictor Variables
Parameter Value
n_estimators 890
max_depth 3
min_samples_leaf 2
min_samples_split 2
39
Measures of Accuracy
Values obtained for the accuracy measures are given in the following table.
Table 7: Table of Accuracy Measures for Random Forest Regressor with all the Predictor Variables
Measure Value
R2 0.88
MSE 1,937,515,837.57
RMSE 44,017.22
MAPE 8.40%
MAE 37,642.06
Feature Selection
The feature importance mechanism in Random Forest was used to identify the most influencing
variables for the prediction in the model. The plot of important features is shown below.
The variables which are most significant (From Silver_Lag1 to S&P_Lag7) were chosen as
the most influencing factor under the Random Forest and the model was retrained using the
selected variables.
40
4.5.2 Phase 2
[Link] XGBoost
Parameter Tuning
The following table shows the best parameters obtained for the XGBoost model with the
selected predictor variables.
Parameter Value
n_estimators 1,000
learning_rate 1
alpha 7
reg_lambda 6
max_depth 4
sub_sample 1
gamma 0.5
colsample_bytree 1
Measures of Accuracy
Values obtained for the accuracy measures are given in the following table.
Table 9: Table of Accuracy Measures for XGBoost with the Selected Predictor Variables
Measure Value
R2 0.89
MSE 1,696,944,054.71
RMSE 41,193.98
MAPE 5.38%
MAE 27,155.94
41
Important Features
The following table shows the best parameters obtained for the Random Forest model with the
selected predictor variables.
Table 10: Optimal Hyperparameters for Random Forest Regressor with the Selected Predictor Variables
Parameter Value
n_estimators 300
max_depth 3
min_samples_leaf 2
min_samples_split 4
Measures of Accuracy
Values obtained for the accuracy measures are given in the following table.
42
Table 11: Table of Accuracy Measures for Random Forest Regressor with the Selected Predictor Variables
Measure Value
R2 0.81
MSE 3,045,017,182.36
RMSE 55,181.67
MAPE 7.73%
MAE 21,039.54
Important Features
Figure 29: Feature Importance of the Selected Predictor Variables in the Random Forest
43
4.6 Summary of Measures of Accuracy
44
features changes the complexity of the model. Feature importance is calculated based on how
features contribute to the model's ability to reduce the error. If the model is simpler with fewer
features, the importance scores may be distributed differently.
Based on the figure 28, it can be concluded that below lag features are important in predicting
Sri Lankan gold price.
Table 13: Important Lags to Predict Sri Lankan Gold Price Using XGBoost
Variable Lag
Crude Oil Price 1,2,3
S&P 20 1,2,7,21
Silver Price 1
Gold Price 1,2
USD LKR Exchange Rate 1,14
45
Chapter 5
Conclusion and Future Work
46
II. The study was constrained to the time frame, spanning from January 2014 to September
2022, due to limitations in the data available from various resources.
III. Certain variables suggested in previous literature were omitted from the forecasting of the
gold price due to unavailability of the required data.
IV. The gold price is influenced not only by the independent variables proposed in this study
but also by other factors like macroeconomic news announcements and money supply,
which play a significant role in affecting gold prices.
V. The research utilized data obtained from secondary sources, and as a result, the inherent
limitations associated with secondary data are expected to impact this study.
VI. The findings of this study are relevant exclusively to the Sri Lankan market and provide
benefits primarily for Sri Lankan policymakers and participants in the local market.
Firstly, despite the determinants proposed in this study, it is recommended that future
researchers consider incorporating additional variables that significantly impact gold prices.
Factors such as money supply, macroeconomic announcements, and jewelry demand could
provide a more comprehensive understanding of the dynamics influencing gold prices. Given
gold's dual nature as a safe haven asset and investment tool, policymakers may require a more
intricate framework to inform their investment decisions. Hence, future researchers are
encouraged to explore gold in different contexts, such as its volatility and returns rather than
just its price. This approach could offer valuable insights into the dynamics of individual
assets and portfolios, leading to improved risk and portfolio management strategies for
financial planners.
Secondly, future researchers are advised to investigate related issues under different economic
conditions, particularly during economic crises and periods of stability. Different factors may
significantly influence gold prices under varying economic circumstances, and a
comprehensive examination can provide decision-makers with tailored insights into
investment strategies.
47
Further research studies can be conducted to explore the application of advanced models, such
as Neural Network, GARCH, ARCH, hybrid models and similar approaches, to predict gold
prices.
Furthermore, as this study focuses solely on Sri Lanka, future research should adopt a broader
perspective by considering countries with diverse economic statuses, encompassing
developed, developing, and emerging economies. By selecting multiple countries, researchers
can uncover unique findings from different regions, contributing more meaningful and
relevant information for policymakers, economists, and investors worldwide.
5.5 Conclusion
Predicting daily gold price is a bit challenging due to the high volatility in the price. This
study, predicts Sri Lanka’s daily gold price using exogenous variables such as Brent crude oil
price, USD LKR exchange rate, CNY LKR exchange rate, S&P SL 20 index, silver price, and
CCPI. The Augmented Dickey-Fuller test showed that all the variables including the response
variable were non-stationary. Hence first differencing was used on the full data set in order to
make the variables stationary. The significant lags of the predictor variables were identified
using the CCF plots while the significant lags of the response variable were found using the
PACF. The XGBoost and Random Forest models were trained and tested using the selected
lag variables of differenced data. Then inverse differencing was performed on the predicted
differenced data in order to predict on the original scale. Finally, R2, MSE, MAPE, RMSE
and MAE were used to evaluate the performance of the model with feature selection and
without feature selection.
The highest R2 value and lowest MSE, MAPE, RMSE and MAE values proved that the best
ML model to predict the gold price in Sri Lanka is XGBoost, between XGBoost and Random
Forest. The proposed XGBoost model is a moderately good technique due to its ability to
predict the daily gold price with an approximate error of 5.38%. Given the high volatility in
the Sri Lankan gold price, this model can be considered moderately good.
Furthermore, it was identified that following day’s gold price is highly affected by the
previous day’s Brent crude oil price. Also, it is identified previous day’s Brent crude oil price
is affecting more to the following day’s gold price than the previous day’s gold price.
The following features given are required in predicting the gold price in Sri Lanka.
48
I. Brent crude oil previous 3 days’ prices
II. S&P SL 20 previous two days, previous week same day value, previous 3 weeks’ same
day value
III. Silver previous day price
IV. Previous 2 days’ gold prices
V. Previous day and previous 2 weeks’ same day USD LKR exchange rate
49
Appendices
I
Appendix B: Snapshots of the Python Code
II
Appendix Figure B-3: Sample Code for Differencing of the Data
Appendix Figure B-4: Sample Code for Creating the Log Variables
III
IV
Appendix Figure B-5: Sample Code for Modeling XGBoost and Random Forest
V
REFERENCES
[1] “The History of Gold as a Currency and Store of Value Throughout Human
Civilization.” Accessed: Aug. 01, 2023. [Online]. Available:
[Link]
and-store-of-value-throughout-human-civilization
[2] “When and Why Do Gold Prices Plummet?” Accessed: Aug. 01, 2023. [Online].
Available: [Link]
[Link]
[3] “What Is the Gold Standard? Advantages, Alternatives, and History.” Accessed: Aug.
01, 2023. [Online]. Available: [Link]
[Link]
[4] “Gold Price Prediction using Machine Learning - Javatpoint.” Accessed: Aug. 02,
2023. [Online]. Available: [Link]
machine-learning
[5] S. Ben Jabeur, S. Mefteh-Wali, and J. L. Viviani, “Forecasting gold price with the
XGBoost algorithm and SHAP interaction values,” Ann Oper Res, 2021, doi:
10.1007/s10479-021-04187-w.
[6] P. Zhang and B. Ci, “Deep belief network for gold price forecasting,” Resources
Policy, vol. 69, Dec. 2020, doi: 10.1016/[Link].2020.101806.
[7] D. Liu and Z. Li, “Gold price forecasting and related influence factors analysis based
on random forest,” in Advances in Intelligent Systems and Computing, Springer Verlag,
2017, pp. 711–723. doi: 10.1007/978-981-10-1837-4_59.
[8] D. Makala and Z. Li, “Prediction of gold price with ARIMA and SVM,” in Journal of
Physics: Conference Series, IOP Publishing Ltd, Feb. 2021. doi: 10.1088/1742-
6596/1767/1/012022.
[9] I. ul and K. Nazir, “Predicting Future Gold Rates using Machine Learning Approach,”
International Journal of Advanced Computer Science and Applications, vol. 8, no. 12,
2017, doi: 10.14569/ijacsa.2017.081213.
[10] M. YURTSEVER, “Gold Price Forecasting Using LSTM, Bi-LSTM and GRU,”
European Journal of Science and Technology, Dec. 2021, doi: 10.31590/ejosat.959405.
[11] Manjula and Karthikeyan, Gold Price Prediction using Ensemble based
Machine Learning Techniques.
[12] I. E. Livieris, E. Pintelas, and P. Pintelas, “A CNN–LSTM model for gold price time-
series forecasting,” Neural Comput Appl, vol. 32, no. 23, pp. 17351–17360, Dec. 2020,
doi: 10.1007/s00521-020-04867-x.
[13] S. Patalay, “Gold Price Prediction Using Machine Learning Model Trees.” [Online].
Available: [Link]
[14] Z. Alameer, M. A. Elaziz, A. A. Ewees, H. Ye, and Z. Jianhua, “Forecasting gold price
fluctuations using improved multilayer perceptron neural network and whale
optimization algorithm,” Resources Policy, vol. 61, pp. 250–260, Jun. 2019, doi:
10.1016/[Link].2019.02.014.
[15] D. N. Gono, H. Napitupulu, and Firdaniza, “Silver Price Forecasting Using Extreme
Gradient Boosting (XGBoost) Method,” Mathematics, vol. 11, no. 18, Sep. 2023, doi:
10.3390/math11183813.
[17] C. Pierdzioch and M. Risse, “Forecasting precious metal returns with multivariate
random forests,” Empir Econ, vol. 58, no. 3, pp. 1167–1184, Mar. 2020, doi:
10.1007/s00181-018-1558-9.
[19] M. Appuhamilage and K. Sriyalatha, “Does the All Share Price Index represent the
Colombo Stock Market ?”
[20] “XGBoost for Time Series extrapolation: You’re gonna need a bigger boat | by Saupin
Guillaume | Towards Data Science.” Accessed: Dec. 20, 2023. [Online]. Available:
[Link]
boat-9d329efa6814
VII
[21] “What is Machine Learning? Definition, Types, Tools & More | DataCamp.” Accessed:
Nov. 04, 2023. [Online]. Available: [Link]
learning
[23] “Random Forest Regression in Python - GeeksforGeeks.” Accessed: Dec. 22, 2023.
[Online]. Available: [Link]
python/
[24] “Random Forest Regression in Python Explained | Built In.” Accessed: Dec. 21, 2023.
[Online]. Available: [Link]
VIII