ML/DL for Nifty Stock Prediction
ML/DL for Nifty Stock Prediction
Abstract – The key objective of a successful stock market vast volume of user-generated data has become cost-prohibitive,
prediction strategy is to not only generate the highest possible returns leading to the development of automated systems like Sentiment
but also to minimize inaccuracies in stock price estimations. In Analysis [2]. Sentiment Analysis can swiftly determine the overall
trading, utilizing sentiment analysis helps investors make well- sentiment of news stories, providing a valuable tool amidst the
informed choices about where to put their money. However, growing popularity of these strategies Consequently, it has become
forecasting stock prices is a complex task due to their susceptibility to easier to comprehend the evolving trends in the stock market,
a wide array of influences, including shifts in investor mood, economic offering potentially profitable returns with minimal effort [3].
and political landscapes, leadership transitions, and more. Predictions
based solely on past data or textual content tend to be unreliable. To The realm of stock market analysis stands as a dynamic and
improve accuracy, there's a growing focus on integrating the pivotal area of inquiry, where the quest for forecasting its behaviors
sentiment from news sources with existing stock price information. A is critically vital in the contemporary era. The stock market is a
deep learning method has been developed to track the trends of very dynamic and uncertain field, so the stock market's prediction
Nifty50 stocks, utilizing data scraped from social media platforms like naturally becomes a burning topic. The inherent complexity of
Twitter, Facebook, StockTwits, and YouTube. This data was cleaned predicting market fluctuations necessitates a deep and
and analyzed to obtain subjectivity and polarity scores, reflecting comprehensive examination of data patterns. To tackle this
positive, neutral, or negative sentiments. By integrating these
complexity, a blend of specialized statistical methodologies and the
sentiment scores with market data, a novel approach was formed to
predict Nifty50 returns using the deep learning model.
prowess of artificial intelligence becomes indispensable, guiding us
Keywords— Socia media, sentiment analysis, deep learning, stock toward more precise outcomes. The employment of a spectrum of
movement prediction machine learning and deep learning techniques holds the promise
of delivering robust predictions characterized by reduced margins
I. INTRODUCTION of error. These advanced computational approaches, by analyzing
historical data and identifying underlying patterns, enable
A. Background stakeholders to make more informed decisions with a higher degree
As communications technologies have advanced and high- of confidence [4]. The convergence of AI and ML in forecasting
speed internet has become more accessible globally, a diverse array stock market trends marks a significant shift towards a more
of individuals from various backgrounds and cultures has analytical and data-driven approach to stock trading, diminishing
increasingly engaged with social media. The ubiquitous presence of the reliance on speculative guesswork. As computational
the internet has made social media Networks, blogs, Facebook, and technology runs forward, it promises to unlock even more
Twitter very popular and effective. People interact and share their sophisticated AI and ML capabilities, potentially elevating the
ideas, opinions, interests, and personal information [1]. These precision and efficiency of stock market predictions to
social media channels have profoundly altered how people unprecedented levels [5] This technological advancement is not
communicate and collaborate. However, manually analyzing the only advantageous for individual traders and financial institutions
but also plays a critical role in enhancing the stability and sentiment analysis and in predicting time series data, underscoring
transparency of financial markets on a global scale. the diverse methodologies researchers are implementing to tackle
the dynamic challenge of stock market prediction [14]. Deng's
The abundance of data sources enhances the depth of
study employed sentiments from individual, institutional, and
understanding in stock market analysis, leading to more accurate
foreign investors as predictors for the directional trends of the
stock price predictions than possible. Certain techniques establish
Shanghai Stock Exchange index [15]. The rapid advancement of
connections between historical data and future stock price
the Internet, particularly social media, has made it possible for
movements, utilizing past trends to forecast upcoming changes [6].
online textual content to mirror investor sentiment and forecast
Stock investors use market trend forecasts to decide the best times
trends in the stock market. Consequently, there is a need for an
to buy or sell stocks, aiming to buy low and sell high to maximize
efficient approach to derive insights from the vast volume of
profits [7]. However, accurately predicting the stock market is
textual documents available [16]. Techniques like sentiment
challenging due to many factors, including the effects of social
analysis, opinion mining, natural language processing, and data
media. These variables can significantly impact market trends
mining are employed to extract perspectives, feelings, and opinions
positively or negatively, making them crucial for investors to
from text-based content [17]. A vital component of these prediction
consider for successful market predictions (Khan et al., 2022).
methods is Sentiment analysis, often referred to as opinion mining,
Navigating the stock market's volatility requires a disciplined which is a process used in natural language processing and text
approach for investors seeking substantial returns. Before analysis to systematically identify, extract, quantify, and study
investing, diligent evaluation of a company's market performance affective states and subjective information. It is commonly used to
is crucial, which often involves analyzing its presence on social determine the sentiment or emotional tone behind words in a text
media and financial news platforms. However, the sheer volume of and to understand the attitudes, opinions, and emotions expressed
data available from these sources exceeds what investors can [18], [19] specifically applied these sentiment analysis techniques
feasibly process on their own, underscoring the need for automated to analyze a Twitter dataset, an endeavor that presents unique
decision support systems. Such systems leverage machine learning challenges due to the informal, concise, and dynamic nature of
algorithms to sift through vast datasets, identifying trends and language on social media. Their work serves as an extensive guide
making predictions about stock performance [9]. The quest to to sentiment analysis within the realm of NLP, showcasing how
pinpoint the most effective algorithms for analyzing external data different methods can be leveraged to interpret sentiments
sources, such as financial news and social media, is critical. expressed in the vast and varied Twitter landscape [20] The study
Accurate predictions based on these external factors can by Singh in 2020 emphasizes the importance of method selection
significantly enhance investors' profits, sparking considerable based on project needs, considering factors like computational
interest among machine learning researchers dedicated to resources, real-time processing capabilities, and desired accuracy
improving stock market investment strategies. The surge in levels. This approach underlines the nuanced considerations
popularity of advanced analytical strategies has significantly required to effectively employ sentiment analysis on social media
enhanced the clarity and comprehensibility of stock market trends. data, highlighting the evolving nature of language and sentiment
These methods offer a commendable return on investment with expression online [21].
minimal effort required from the investor's side. Given the inherent
Weng study created an advanced financial system capable of
dynamism of the stock market, where prices and trends are in
assessing the sentiment scores of news articles related to specific
constant flux, the ability to accurately forecast future movements of
stocks, to predict short-term stock price movements [22].
stock prices becomes paramount [10] This necessitates not only a
deep understanding of the market's historical and current behaviors II. RESEARCH METHODOLOGY
but also an adeptness at employing sophisticated prediction tools
and models. These advancements have made it increasingly A. Data Gathering
feasible for investors to navigate the complexities of the stock Secondary Data for the proposed study was extracted from
market, thereby democratizing access to strategies that yield Yahoo Finance and four social media platforms, including Twitter,
respectable returns with reduced effort and risk. StockTwits, Facebook, and YouTube, for four years from 2018 to
B. Literature Review 2021. Numerous feeds are generated daily on these social media
platforms about Nifty50 by different market experts. Therefore, the
Various researchers have developed methodologies to enhance feeds have been aggregated and aligned day-wise, corresponding to
the accuracy of stock market predictions, employing a range of the market return date for the sentiment score calculation. The daily
approaches. Jayanth Balaji explored the efficacy of 14 different returns based on the closing price of the Nifty50 were taken as the
deep learning models in forecasting the stock prices of companies, dependent variable to observe the trends corresponding to derived
demonstrating the potential of deep learning in financial sentiments and market positions of Nifty50, including open, high,
predictions [11]. Similarly, Tsung Wuu Lin focused on leveraging low, adjusted close, and volume as the independent variables. The
Artificial Neural Networks (ANN) to optimize profitability, data for Twitter (tweets) has been collected via Twitter API.
showcasing the capability of ANN in financial modeling [12]. Octoparse and Facepager were used to manage the data for
Autoregressive models are highlighted for their robustness in stock StockTwits and Facebook, respectively. Video Transcripts were
market forecasting, offering valuable insights into time series taken for data collection on YouTube.
analysis and yielding precise predictions. Additionally, sentiment
analysis has emerged as a powerful tool for stock market B. Data Cleaning and Pre-processing
forecasting, with social media analytics playing a crucial role [13]. Data preprocessing and cleaning are essential steps in preparing
The ARIMA model, in particular, is noted for its effectiveness in the dataset for the development of machine learning models for
trend prediction of Nifty50. The dataset comprised 20,867 samples Support Vector Machines (SVMs) play a pivotal role in
with 15 attributes. Collinearity among predictor variables was accurately predicting datasets, owing to their robustness and
assessed to potentially impact model performance and adaptability to complex market dynamics [32].
interpretability [23]. Using Kendall's and Spearman's rank SVMs excel in discerning intricate patterns by effectively
correlation coefficients, a correlation matrix was computed, separating data points into distinct classes using hyperplanes. This
revealing high collinearity between attributes ('Open', 'High', 'Low', capability allows SVMs to capture both linear and nonlinear
'Adjusted Close'). To address this, a new feature, 'New feature', was relationships, making them versatile for modeling diverse
engineered to represent the average of these highly correlated behaviors [33].
attributes [24]. Outliers were detected and removed using Z-scores,
which measure the deviation of each data point from the mean in The SVM algorithm is widely used in machine learning as it
terms of standard deviations [25]. can handle both linear and nonlinear classification tasks. However,
when the data is not linearly separable, kernel functions are used to
( X−μ ) transform the data to higher-dimensional space to enable linear
Z= (1) separation. This application of kernel functions can be known as
σ
the "kernel trick", and the choice of the kernel function, such as
Where: X is the data point, μ is the mean of the sample & σ is the linear kernels, polynomial kernels, radial basis function (RBF)
standard deviation of the sample kernels, or sigmoid kernels, depends on data characteristics and the
Data points exceeding a threshold of 3 standard deviations were specific use case.
considered outliers and subsequently removed from the dataset So, to separate the multi-dimensional data we use hyperplane. To
resulting in (20,589 x 5) attributes [26]. define a hyperplane for two-dimensional data which can be linearly
The dataset was then standardized using the StandardScaler to separable by a line.
ensure uniformity in feature scales, preventing any particular Now we are renaming x with x1 and y with x2 then we get:
feature from dominating the modeling process [27].
a x 1−x 2 +b=0 (3)
if we define x = (x1 , x2) and ω = (a ,1 )
Finally, the pre-processed dataset was split into training and testing
sets, with 70% allocated for training and 30% for testing,
facilitating model development and evaluation [28].
ω . x +b=0 (4)
C. Data Preparation
After obtaining the hyperplane, we utilize it for making predictions.
A total of 20930 samples were collected to build a (20930, 7) We define the hypothesis function h as:
matrix with input and output features for the deep-structured
classifier. The input characteristics were scaled via z-score
normalization [29], as mathematically represented in Equation 1: {
h= +1if ω . x+ b≥ 0
−1 if ω . x +b< 0 (5)
I s=( I a −µ ) / χ (2) The point above or on the hyperplane will be classified as class +1,
and the point below the hyperplane will be classified as class -1.
Where Ia and Is represent the actual and scaled data, the terms,
µ and χ denote the sample mean and standard deviation. The b) Gradient Boosting Classifier
scaled dataset was divided into 80% training, 10% validation, and Gradient Boosting Classifier (GBC), pioneered by Jerome
10% test sets. Finally, the scaled data was reshaped as (BS, 6, 1) to Friedman, is an ensemble method for regression and classification.
obtain an acceptable input to the LSTM layer. The output labels It iteratively improves the model by combining weak learners and
include two categories, negative and positive trends of the Nifty50 minimizing a loss function through gradient descent. [34].
stock.
The learning process aims to minimize errors from previous
D. Machine Learning model iterations, enhancing predictive performance. Parameter
Machine learning techniques offer a systematic approach to tuning, including adjusting the number of trees, learning rate, and
analyzing large volumes of historical market data and identifying tree depth, is pivotal for optimizing results, mitigating overfitting,
patterns that may influence future trends. By employing ML and improving model accuracy [35].
algorithms, we aim to enhance our understanding of the underlying The formulation for updating the prediction in gradient
dynamics driving Nifty50 movements and develop predictive boosting at iteration m can be expressed as [36].
models capable of forecasting market trends with greater accuracy
[30]. ^y m (x )= ^y m−1 (x )+ λ ⋅hm (x ) (8)
a) Support vector Machine Where ^y m ( x )represents the predicted value at iteration m for
SVMs were developed in the 1990s by Vladimir N. Vapnik and input x , ^y m−1 (x ) is the prediction from the previous iteration.
his colleagues, and they published this work in a paper titled h m (x) is the weak learner (e.g., decision tree) trained to fit the
"Support Vector Method for Function Approximation, Regression
residuals, λ is the learning rate, controlling the step size in the
Estimation, and Signal Processing" in 1995 [31].
gradient descent process.
To study two implementations of Gradient Boosting Machines their positions based on the best search agent's position. To
algorithms: – XGBoost [37], and LightGBM [38]. accomplish this, we employ the following formulas:
⃗
D α =¿ ⃗
C 1. ⃗
X α −⃗
(13)
Hyperparameters used in the gradient boosting classifier here X ∨¿
are learning rate, n_estimators, subsample, and max depth, using
these hyperparameters we aimed to achieve maximum accuracy ⃗ ⃗2 . ⃗
using LightGBM. D β=¿ C X β− ⃗
X ∨¿ (14)
⃗
D δ =¿ ⃗
C3 . ⃗
X δ −⃗
X ∨¿ (15)
E. Grey Wolf Optimization
Grey Wolf Optimization (GWO) is a metaheuristic optimization
algorithm inspired by the social behavior and hunting mechanism ⃗
X ₁= ⃗ X α −⃗A ₁. ⃗
Da (16)
of grey wolves. Introduced by Mir Jalili et al. in 2014, GWO ⃗ ⃗ ⃗ ⃗
X 2 = X β− A ₂. Dβ
mimics the leadership hierarchy observed in wolf packs, where the
(17)
alpha, beta, and delta wolves represent the pack's leaders [39].
⃗
X ₃= ⃗
X δ− ⃗
A ₃. ⃗
Dδ (18)
In GWO, the search population consists of two types of grey
wolves: wolf leaders (dominant wolves) and follower wolves. In
the group of wolf leaders, there are three members: the alpha (α) ⃗
X 1 +⃗
X 2 +⃗
X3
⃗
X ( t+ 1 )= (19)
wolf, representing the best solution found so far in the search 3
space; the beta (β) wolf, representing the second-best solution; and
the delta (δ) wolf, representing the third-best solution. The rest of 4. Attacking prey(exploitation)
the population are considered as followers, namely omega (ω) To model the wolves' approach to the prey, the algorithm
wolves [40]. In this subsection, the steps that are taken into decreases the value of a, representing the fluctuation range of A,
consideration by gray wolves to attack the prey are depicted in the from 2 to 0 across iterations. ˉAˉ becomes a random value in the
sequence along with an explanation of the social hierarchy. interval [-a, a]. When random values of A fall within [-1, 1], a
1. Tracking according to social hierarchy search agent's next position can be anywhere between its current
position and the prey's position. Figure 5(a) illustrates that when
To mathematically model the social hierarchy of wolves when <1A<1, wolves are directed to attack towards the prey.
designing GWO, we consider the fittest solution as the alpha (α).
Consequently, the second and third best solutions are named beta 5. Searching for prey(exploration)
(ẞ) and delta (δ) respectively. The rest of the candidate solutions In the Grey Wolf Optimization (GWO) algorithm, randomness
are assumed to be omega (ω). In the GWO algorithm, the hunting is introduced through parameters like ∣A∣ and ∣C∣ to encourage
(optimization) is guided by a, β, and δ. The wolves follow these divergence among search agents, promoting global exploration. ∣A ∣
three wolves. facilitates exploration, while ∣C∣ provides random weights to
2. Encircling the prey influence prey factors, aiding in avoiding local optima. Unlike ∣A ∣,
∣C∣ maintains randomness throughout optimization, preventing
As mentioned above, grey wolves encircle prey during the hunt. stagnation in local optima. GWO starts with a random population
To mathematically model encircling behavior the following of wolves, iteratively estimating prey positions by alpha, beta, and
equations are proposed: delta wolves while adjusting their distances accordingly.
⃗ ⃗ .⃗
D=¿ C X (t )− ⃗
X ( t )∨¿ (9)
⃗
X ( t+ 1 )=⃗
X p ( t )−⃗
A.⃗
D (10)
A and ⃗
Where t indicates the current iteration, ⃗ C are coefficient
vectors, ⃗
Xp is the position vector of the prey, and ⃗
X indicates the
position vector of a grey wolf. The vectors ⃗ A and \overline⃗
C are
calculated as follows:
⃗
A=2∗a . r⃗ 1−a (11)
⃗ =2 . ⃗r2
C (12)
3. Hunting
To mathematically simulate the hunting behavior of grey
wolves, we suppose that the alpha (best candidate solution) beta,
and delta have better knowledge about the potential location of
prey.
Hence, we store the top three solutions attained thus far and
direct the remaining search agents, including the omegas, to adjust
Flowchart
Begin
Output &
No t<M
display
optimization
parameter Yes
results
No i<S
End
F. Model Paradigm
The framework proposed in this study, as depicted in Fig. 1, a) Comparison of different models with and without GWO
outlines a schematic flow diagram for predicting trends in the
Nifty50 indices. The input and output labels are fed into a Table 1 Train data
predictive model, which undergoes preprocessing. A notable Measures SVM SVM-GWO GBC GBC-GWO
preprocessing step involves addressing multicollinearity, where
rows with a threshold greater than 0.75 are averaged out into a new Accuracy % 55.81 83.32 79.47 94.22
column, effectively treating them as a new feature. Incorporating
Grey Wolf Optimization (GWO) for hyperparameter tuning Precision % 55.81 84.00 75.94 92.47
significantly enhances the predictive performance of our models. Recall % 61.25 86.61 92.52 96.15
Hyperparameter tuning is vital for optimizing machine learning and
deep learning models, aiming to identify the optimal set of F1-score 0.71 0.85 0.83 0.94
hyperparameters that maximize performance metrics such as
accuracy, precision, or recall. Manual hyperparameter tuning is Table 1 Test data
often time-consuming and requires domain expertise. However, by
leveraging metaheuristic algorithms like GWO, we automate the Measures SVM SVM-GWO GBC GBC-GWO
process of exploring the hyperparameter space efficiently, thus
mitigating the challenges associated with manual tuning. Accuracy % 57.21 69.09 79.25 93.68
H. CONCLUSIONS