Business Forecasting (MBA3017)
END-TERM PROJECT
TIME SERIES FORECASTING USING
ARIMA
Submitted to
Dr. Rosewine Joy
Professor
School of Management (SOM)
Presidency university
Submitted By:
ATMANAND P GARAG
20212MBA0055
SECTION “2”
Date of Submit:
07/12/2022
INTRODUCTION
TIME SERIES ANALYSIS
Time series forecasting focuses on analyzing data changes across equally spaced
time intervals. Time series analysis is used in a wide variety of domains, ranging
from econometrics to geology and earthquake prediction; it’s also used in almost all
applied sciences and engineering. Examples of time series data include S&P 500
Index, disease rates, mortality rates, blood pressure tracking, global temperatures.
This post will be looking at how the autoregressive integrated moving
average(ARIMA) models work and are fitted to time series data. The first point to
consider before moving forward is the difference between Multi and Univariate
forecasting. The former uses only the previous values in time to
forecast future values.
DATA SET
The monthly average price of crude oil in India
Date
Price($/bbl)
We will explore how Crude oil prices in India Have varied from 2000-2022
DATA SOURCE
Socio-Economic Statistics India, Statistical Data Figures 2022 ([Link])
Time Series in R
Time Series in R is used to see how an object behaves over a period of time. In R, it
can be easily done by the ts function with some parameters. Time series takes the
data vector and each data is connected with a timestamp value as given by the user.
This function is mostly used to learn and forecast the behavior of an asset in business
for a period of time. For example, sales analysis of a company, inventory analysis,
price analysis of a particular stock or market, population analysis, etc.
Syntax: objectName <- ts(data, start, end, frequency)
where,
data represents the data vector
start represents the first observation in the time series
end represents the last observation in the time series
frequency represents the number of observations per unit of time. For example,
frequency=1 for monthly data.
Components of Time Series Analysis
The reasons or forces that change the attributes of a time series are known as the
Components of Time Series.
The following are the components of time series −
Trend
Seasonal Variations
Cyclical Variations
Random or Irregular Movements
TIME SERIES GRAPHICS
The first thing to do in any data analysis task is to plot the data. Graphs enable many
features of the data to be visualised, including patterns, unusual observations,
changes over time, and relationships between variables. The features that are seen in
plots of the data must then be incorporated, as much as possible, into the forecasting
methods to be used. Just as the type of data determines what forecasting method to
use, it also determines what graphs are appropriate. But before we produce graphs,
we need to set up our time series in R.
For time series data, the obvious graph to start with is a time plot. That is, the
observations are plotted against the time of observation, with consecutive
observations joined by straight lines. Below Figure shows the Monthly consumption
of LPG in india.
Here we can see the random variation in the Data sets
DECOMPOSE
R uses the default additive time series model to decompose the data. To use the
multiplicative model, we specify type equals to multiplicative in the decompose
function. We don’t calculate the trend with the first and last few values. The seasonal
component repeats from year to year.
Here we see a very nice clean visualization showing our original time series, trend,
seasonal and random components.
ADF TEST
Augmented Dickey-Fuller Test in R, If a time series has no trend, constant variance
over time, and a consistent autocorrelation structure across time, it is considered to be
“stationary.”
An augmented Dickey-Fuller test, which uses the following null and alternative
hypotheses to determine whether a time series is stationary, is one technique to do so
Code For ADF TEST
Results
Here the p-value is more than 0.05 that means it’s not a stationary
Differencing to remove a trend or seasonal effects
An alternative to decomposition for removing trends is differencing. We saw in the
lecture how the difference operator works and how it can be used to remove linear
and nonlinear trends as well as various seasonal features that might be evident in the
data. As a reminder, we define the difference operator as
In R we can use the diff() function for differencing a time series, which requires 3
arguments: x (the data), lag (the lag at which to difference), and differences (the
order of differencing
For example, first-differencing a time series will remove a linear trend
(i.e., differences = 1); twice-differencing will remove a quadratic trend
(i.e., differences = 2). In addition, first differencing a time series at a lag equal to the
period will remove a seasonal trend (e.g., set lag = 12 for monthly data).
Let’s use diff() to remove the trend and seasonal signal from the data
CODES
STATIONARY DATA PLOT
Here the P value is 0.01 which means the data is stationary
ARIMA MODEL
ARIMA stands for Autoregressive Integrated Moving Average and is specified by
three order parameters: (p, d, q).
AR(p) Autoregression: A regression model that utilizes the dependent
relationship between a current observation and observations over a previous
period. An autoregressive (AR(p)) component refers to the use of past values in
the regression equation for the time series.
I(d) Integration: Uses differencing of observations (subtracting an observation
from observation at the previous time step) in order to make the time series
stationary. Differencing involves the subtraction of the current values of a series
with its previous values d number of times.
MA(q) Moving Average: A model that uses the dependency between an
observation and a residual error from a moving average model applied to lagged
observations. A moving average component depicts the error of the model as a
combination of previous error terms. The order q represents the number of terms
to be included in the model.
1. Auto-regressive Component
It implies the relationship of a value of a series at a point in time with its own
previous values. Such a relationship can exist with any order of lag. Lag is
basically the value at a previous point in time. It can have various orders as
shown in the table below. It hints toward a pointed relationship.
Moving average components
It implies the current deviation from the mean depends on previous deviations.
Such a relationship can exist with any number of lags that decides the order of
the moving average.
The moving Average is the average of consecutive values at various time
periods. It can have various orders as shown in the table below. It hints toward
a distributed relationship as moving itself is derivative of various lags.
ARIMA Modelling procedure
When fitting an ARIMA model to a set of (non-seasonal) time series data, the
following procedure provides a useful general approach.
1. Plot the data and identify any unusual observations.
2. If necessary, transform the data (using a Box-Cox transformation) to stabilize the
variance.
3. If the data are non-stationary, take the first differences of the data until the data are
stationary.
4. Examine the ACF/PACF: Is an ARIMA(p,d,0p,d,0) or ARIMA(0,d,q0,d,q) model
appropriate?
5. Try your chosen model(s), and use the AICc to search for a better model.
6. Check the residuals from your chosen model by plotting the ACF of the residuals,
and doing a portmanteau test of the residuals. If they do not look like white noise,
try a modified model.
7. Once the residuals look like white noise, calculate forecasts.
The Hyndman-Khandakar algorithm only takes care of steps 3–5. So even if you use it,
you will still need to take care of the other steps yourself.
Autocorrelation Function (ACF)
Correlation between time series with a lagged version of itself. The correlation
between the observation at the current time spot and the observations at previous time
spots. The autocorrelation function starts a lag 0, which is the correlation of the time
series with itself and therefore results in a correlation of 1.
The ACF plot can provide answers to the following questions:
Is the observed time series white noise/random?
Is an observation related to an adjacent observation, an observation twice removed,
and so on?
Can the observed time series be modeled with an MA model? If yes, what is the
order?
Codes for Autocorrelation Function(ACF)
ACF with Lag of 01
ACF with Lag of 05
ndiff Functions
Functions to estimate the number of differences required to make a given time series
stationary. estimates the number of first differences necessary
Codes for ndiffs
Output
Here the number of differences required to make a given the data is 1 that means
the d value in the arima is 1
PACF
PACF is the partial autocorrelation function that explains the partial correlation
between the series and lags itself. In simple terms, PACF can be explained using a
linear regression where we predict y(t) from y(t-1), y(t-2), and y(t-3) [2]. In
PACF, we correlate the “parts” of y(t) and y(t-3) that are not predicted by y(t-
1) and y(t-2).
Codes for PACF
PACF Plot
PACF with lag of 5
Through acf and pacf model we found different arima models i.e (pdq)
Model p d q
s
A 2 1 2
B 0 1 0
C 1 1 0
D 0 1 1
E 1 1 1
Outputs of the Arima models
Model AIC RMSE
s
A 2075.1 11.0003
4 1
B 2099.8 11.75
5
C 2101.6 11.74
9
D 2101.6 AIC
11.74
9 The Akaike
E 2103.6 11.75 information
9 criterion (AIC) is a
mathematical method
for evaluating how well a model fits the data it was generated from
There is no value for AIC that can be considered “good” or “bad” because
we simply use AIC as a way to compare regression models. The model with the
lowest AIC offers the best fit
RMSE
Root mean squared error (RMSE) is the square root of the mean of the square
of all of the error. RMSE is considered an excellent general-purpose error
metric for numerical predictions. RMSE is a good measure of accuracy, but
only to compare prediction errors of different models or model configurations
for a particular variable and not between variables, as it is scale-dependent. It
is the measure of how well a regression line fits the data points
The model with the highest RMSE offers the best fit
Here, “A” fits the best model because of the lower AIC
Forecast of Consumption of LPG using ARIMA
Out PUT
Auto Arima
we have been going through the process of manually fitting different models
and deciding which one is best. So, we are going to automate the process.
Basically, it takes the data and fits many models in a different order before
comparing the characteristics. However, the processing time increases
substantially, when we try to fit complex models
Conclusion
For checking of the accuracy of ARIMA model the mean percentage error has
been calculated after comparing the actual price and the forecasted price. The
mean percentage error was calculated which shows ARIMA gives accuracy.
The result showed that the suggested model with the lower AIC, was more
accurate in forecasting the time series. Final fitted model (0,1,0) with fitted
values .
Reference
[Link]
[Link]
[Link]
[Link]
Codes
#ATMANAND P GARAG
#20212MBA0055
# SEC2
#END-TERM PROJECT
# [Link] series -objects are called ts objects
#command for timeseries is ts
#name<-ts(data,additional argumets)
[Link]("datasets")
[Link]("forecast")
[Link]("graphics")
[Link]("stats")
[Link]("tseries")
[Link]("ets")
library(forecast)
library(graphics)
library(stats)
library(tseries)
library(ggplot2)
library(datasets)
library(readxl)
# ts()- adding time series
# plot()- plot a time series
# start()- Returning the starting time of a timeseries
# end()- Returning the end time of a timeseries
#frequency()-Return the period of a timeseries
#window()-Subsets of a timeseries
class(Data)
View(Atma)
Data<-ts(Atma$Price,start = c(2000,04),end = c(2022,10),frequency=12)
Data
class(Data)
plot(Data)
autoplot(Data)
plot(Data)
#[Link] Plot
seasonplot(Data,[Link]="TRUE",main="SEASONAL RANDOM-DATA")
#2. Decompose function is used for identifying the components in TS .
#ts components : Trend, Sesonal ,Cyclical and Random
DC<-decompose(Data)
DC
plot(DC)
#[Link]
[Link](Data)
#we know that we need to adress two issues before we test stationary seris
#one we need to remove un equal verieances. we do this using log of the seris
#Two we need to adress the tend componet
#Converting non stationary Into stationary
ndiffs(Data)
Data1<-diff(Data,differences = 1)
Data1
[Link](Data1)
plot(Data1)
autoplot(Data1)
#ARIMA
#ARIMA- Autoregressive integrated moving average (ARIMA)
#To find Lag function is lag(ts,k)
#ACF plot autocorrelation function - Acf(ts)
#PACF plot partial auto correlation -pacf(ts)
#differential for stationarity ndiffs(ts)
#An ARIMA model is characterized by 3 terms: p, d, q
#p is the order of the AR term
#q is the order of the MA term
#p A pure Auto Regressive (AR only) model is one where Yt depends only on its
own lags. That is, Yt is a function of the 'lags of Yt'.
#d is the number of differencing required to make the time series stationary
#q Likewise a pure Moving Average (MA only) model is one where Yt depends only
on the lagged forecast errors.
View(Data)
class(Data)
#Augmented Dickey -Fuller (ADF) test
m<-[Link](Data)
m
#ndifs(d)
ndiffs(Data)
#d=1
#ACF -Auto correlation
acf(Data)
acf(Data,[Link] = 1)
acf(Data,[Link] = 5)
#partial Auto correlation
pacf(Data)
pacf(Data,[Link] = 1)
pacf(Data,[Link] = 5)
[Link](Data)
#Arima Models
A<-arima(Data,order=c(2,1,2))
A
accuracy(A)
B<-arima(Data,order=c(0,1,0))
B
accuracy(B)
C<-arima(Data,order=c(1,1,0))
C
accuracy(C)
D<-arima(Data,order=c(0,1,1))
D
accuracy(D)
E<-arima(Data,order=c(1,1,1))
E
accuracy(E)
#aic values
A
B
C
D
E
#RMSE
accuracy(A)
accuracy(B)
accuracy(C)
accuracy(D)
accuracy(E)
#Forecast
FC<-forecast(A)
plot(FC)
autoplot(FC)
FC
#auto arima
[Link](Data)
FA<-[Link](Data,ic="aic",trace = TRUE)
FC<-forecast(FA)
plot(FC)
FC