Stock article title sentiment-based classification
using PhoBERT⋆
Nguyen Son Tung[0000−0001−9244−7093] , Nguyen Ngoc Long[0000−0002−6979−4473] ,
Trang Tran[0000−0003−3370−6272] , Nguyen Thu Thao[0000−0002−4026−7362] , Duong
T. Thu Phuong[0000−0002−4267−7162] , and Tuan Nguyen⋆⋆[0000−0002−3616−5267]
National Economics University, Hanoi, Vietnam
{209tungns,ngoclong1282001,huyentrang201ciel,thaothu2742001,
duongthithuphuong26122001}@[Link]
nttuan@[Link]
Abstract. Text classification is a typical and important part of super-
vised learning, it has several applications in economics and attracted
the attention of many stock market investors. For a long time, the news
is frequently an unanticipated stock investment variable that instanta-
neously influences stock price directions. In front of an enormous vol-
ume of news, investors are always searching for models that automati-
cally categorize news quickly and accurately. Thus, in this research, we
have utilized different models like PhoBERT, SVM, Logistic Regression,
LSTM, Random Forest, and Naive Bayes to classify news articles into
three categories [negative, neutral, or positive] based on their titles. The
results demonstrated that after training with a dataset of over 1000 news
samples from [Link], the PhoBERT model outperformed other mod-
els with an accuracy up to 93%. The code and dataset is available at
[Link]
Keywords: classification · PhoBERT · sentiment analysis · stock arti-
cles
1 Introduction
Text classification is a traditional word processing problem using machine learn-
ing. The first idea is to map a text to a known topic from a finite set of topics
based on the semantics of the text. Text documents are typically used for classi-
fication, which is done based on selected documents and features. However, the
⋆
Copyright © by the paper’s authors. Use permitted under Creative Commons Li-
cense Attribution 4.0 International (CC BY 4.0). In: N. D. Vo, O.-J. Lee, K.-H. N.
Bui, H. G. Lim, H.-J. Jeon, P.-M. Nguyen, B. Q. Tuyen, J.-T. Kim, J. J. Jung, T. A.
Vo (eds.): Proceedings of the 2nd International Conference on Human-centered Arti-
ficial Intelligence (Computing4Human 2021), Da Nang, Viet Nam, 28-October-2021,
published at [Link]
⋆⋆
Corresponding author.
226 Tung et al.
classes are chosen before the experiment analysis, which is referred to as super-
vised machine learning operations. Due to the growing number of documents,
the demand for text classification is expanding, and the tasks are getting increas-
ingly diverse, such as sentiment analysis of reviews and news categorization, so
on. An article in a newspaper, for example, could fall under one (or more) of
these categories (such as sports, health, information technology, etc.). The use of
spam filtering in email [1] [2], web services [3][4], fake currency identification [5],
fake news identification [6], and opinion mining techniques [7] are also prominent
and important applications in this field. Automatically categorizing text into a
certain topic makes it easier to organize, store, and query documents later.
Forecasting is always a difficult task in the stock market because it is highly
volatile and dynamic. Many methods have been proposed to forecast the future
direction of the stock market. Financial news, for instance, has an immediate
positive or negative impact on stock prices. Before purchasing a stock, investors,
for instance, evaluate a company based on its activities on its official website and
financial news about the company. However, investors can not fully assess such
vast amounts of financial news data on their own. As a result, investors require
a model that can assist them in quickly sorting through financial news articles.
In this research, we collected a dataset of over 1000 financial news arti-
cles about the stock market from the website [Link]. Afterward, we used
LSTM [8], PhoBert[9], SVM[10], and other models to categorize news articles
from the above dataset into three categories [positive, negative, and neutral].
The PhoBERT model provided outstanding accuracy results of up to 93% after
training.
The remainder of this paper is organized as follows. Section 2 introduces
related works . The proposed model is introduced in Section 3. Section 4 discusses
the results of experiments and is followed by a conclusion in Section 5.
2 Related works
Sentiment analysis, also referred to as a classification task, aims to forecast the
overall sentiment of a text, which could be a tweet or a review of a movie or
product. The main goal is to determine if the text’s conveyed impression is
positive or negative, in some cases, with a score or confidence metric.
In English, a lot of publications on sentiment analysis have been undertaken.
For the problem of sentiment classification, Pang et al compared multiple super-
vised learning algorithms [11], including Naive Bayes [12], KNN [13], Maximum
Entropy Models [14], and Support Vector Machines. They tested several types
of features and achieved the highest accuracy of 82.9% on a corpus of movie
reviews. Zhou utilized the Stanford Sentiment Treebank (SST) dataset [15] to
describe the sentiment categorization of movie reviews. In comparison to multi-
layered CNN [16] and RNN [17] models, the architecture combining CNN and
LSTM models produced better performance. The two classes (positive and neg-
ative) dataset had an accuracy of 87.7%, whereas the five classes (very positive,
positive, neutral, very negative, negative) dataset had a 49.2%. In another sen-
Stock article title sentiment-based classification using PhoBERT 227
timent analysis study performed on the SST dataset, Manish Munikar et al. [18]
applied BERT [19] - the latest state-of-the-art in the NLP [20] field proposed by
Google in 2018. The architecture contained a dropout regularization and soft-
max classifier layers on top of the pre-trained BERT layer. Their proposed model
was presented that achieved the highest of 94.7% correct in SST-2 and 84.2%
when performed in SST-5, surpassing every aforementioned technique.
In Vietnamese, Kieu and Pham [21] performed studies on a corpus of com-
puter product reviews by offering a rule-based system for sentiment classifi-
cation in Vietnam utilizing the GATE framework [22]. This approach reached
67.35% precision overall, but designing the rules seems to be a difficult and
time-consuming task. Quan et al. [23] presented a multi-channel LSTM - CNN
model for Vietnamese sentiment analysis that combines Long Short-Term Mem-
ory (LSTM) and CNN. This combination had an accuracy of 87.72% on the VS
dataset and 59.61% on VLSP, which was also proposed in their research.
In this paper, we proposed a Vietnamese stock news sentiment classification
model, which is a novel approach to sentiment analysis in Vietnamese. The
proposed model achieved 93.12% accuracy and was constructed using a pre-
trained PhoBERT, a state-of-the-art language model for Vietnamese based on
BERT architecture. In addition, we built a dataset that included 1000 titles
of financial articles taken from [Link] and labeled them into three groups
[negative, neutral, or positive].
3 Proposed Model
Our proposed model consists of 2 stages described in Fig. 1. The input is the
financial article’s headlines, which will proceed through the first stage to pre-
process the data to convert it into a format that the PhoBERT model can
understand and improve its accuracy. Following that, in the second stage, our
PhoBERT-based model will be tasked with assessing content from the header
broadcast and categorizing it into one of three classes represented as -1, 0, or 1
(i.e., -1 as negative, 0 as neutral, and 1 as positive direction).
Fig. 1. Proposed model
228 Tung et al.
3.1 Preprocessing
The preprocessing procedure was separated into two phases. In Phase 1, first, we
applied VnCoreNLP’s Named entity recognition [24] to extract all the proper
nouns and replace those words that signify location with the word ”loc” or
”name” for the organization name, stock code, or person’s name. To avoid any
confusion when the model predicts, the punctuation was then removed, hence
increasing the model’s accuracy. Considering the fact that white space is also
utilized to separate syllables that make up words in Vietnamese, in the last step
of Phase 1, we adopted Rdrsegmenter from VnCoreNLP to separate words for
input data. Furthermore, as an input for the PhoBERT model the title needed
to be tokenized, therefore we utilized BPE tokenizer [25].
In Phase 2, we had the symbol vocabulary with the character vocabulary,
and each word was represented as a sequence of characters with a unique end-of-
word symbol ” < /s > ” that allowed us to recover the original tokenization after
translation. In example, we counted all symbol pairs iteratively and replaced each
occurrence of the most common pair (”A”, ”B”) with the new symbol ”AB”.
Each merge process generates a new symbol that represents an n-gram of char-
acters. BPE does not require a shortlist because frequently occurring character
n-grams (or complete words) are finally combined into a single symbol. Thus,
the amount of the final symbol vocabulary is equal to the original vocabulary.
Then we mapped each subword to its corresponding ID in the PhoBERT
vocabulary, and because each title is varied in length, we employed pad sequences
to match them all in length. i.e. sentences that shorter than 125 subwords are
padded with 0 at the end, while longer are trimmed to produce 125.
3.2 Training details
First, we installed all the necessary materials included transformers library, Vn-
CoreNLP Python wrapper and its word segmentation component (i.e. RDRSeg-
menter), then fastBPE to convert the input text into a list of subwords. After
train and test dataset are done prepared as described in Section 3.1, DataLoader
is created to load data into the model. Then we loaded P hoBERTBASE from
HuggingFace’s transformers library as the pre-trained model. The optimizer we
chose for the training stage is AdamW optimizer, which is an improved version
of Adam optimizer from the transformers library. In addition, this study used
batch size = 32 with 10 epochs divided into two stages with different learning
rate. The initially learning rate which we utilized was α = 5e − 6 in the first 5
epochs in order for the loss to converge faster. Once the first training phase was
completed, we reduced the learning rate to α = 5e − 7 to achieved the smallest
potential loss. The model was then trained for another 5 epochs since the loss
on the validation dataset appears to stabilize after this number of cycles.
Finally, the softmax classification layer (which includes three nodes corre-
sponding to three classes in the dataset) will output the probabilities of the
input text belonging to each of the class labels, with the total of the probabili-
ties equal to 1. The dense layer consists of a fully connected neural network with
Stock article title sentiment-based classification using PhoBERT 229
the softmax activation function. The softmax function σ : RK → RK is given
in (1).
ez i
σ(z)i = PK f or i = 1, ..., K (1)
j=1 ezj
where z = (z1 ,...,zK ) ∈ RK is the softmax layer’s intermediate output (also
called logits). The predicted label for the input is then chosen from the output
node with the highest likelihood. The output of the proposed model will be
represented as -1, 0, or 1. The entire training process was deployed on PyTorch
[26] framework.
3.3 Method
BERT. BERT stands for Bidirectional Encoder Representations from Trans-
former. It is a transformer-based [27] machine learning technique developed by
Google for pre-training in natural language processing (NLP). In 2018, Jacob
Devlin and his colleagues created and published BERT. BERT includes two orig-
inal models in English. The first model is BERTBASE : It consists of 12 encoders
with 12 bidirectional self-attention heads. The second model is BERTLARGE :
24 encoders with 16 bidirectional self-attention heads. BERT is designed for pre-
training from unlabeled texts with 800 million words by BooksCorpus and 2,500
million words by Wikipedia.
BERT has performed more than 10 natural language processing tasks with good
results. It has improved the GLUE benchmark to 80.5%, pushed MultiNLI ac-
curacy to 86.7%, absolute 5.1 point improvement in SQuAD v2.0 Test F1, etc.
With L is the number of sub-layer blocks in the transformer, H: the size of the
embedding vector, A: the number of heads in the multi-head layer, model BERT
has two architectures as follows:
– BERTBASE (L=12, H=768, A=12): Total parameters are 110 million.
– BERTLARGE (L=24, H=1024, A=16): Total parameters are 340 million.
PhoBERT. The BERT model’s release marked a watershed moment in the NLP
industry. Following the public release of the BERT model, a slew of open-source
BERT training programs have sprung up. There are also numerous unilingual
and multilingual BERT pre-train models that are commonly used. Since then,
PhoBERT has been particularly trained for Vietnamese and released by VinAI
Research in March 2020.
PhoBERT is based on the design and approach of RoBERTa [28], which was
introduced by Facebook in 2019 and is an improvement over the original BERT.
PhoBERT was trained from about 20GB of data, including approximately 1GB
of the Vietnamese Wikipedia Corpus and 19GB remaining from the Vietnamese
News Corpus. This type of data is also ideal for training a model like BERT.
PhoBERT, similar to BERT, is available in two versions. The first version is
P hoBERTBASE with 12 transformer blocks and the second version with 24
transformer blocks is named P hoBERTLARGE .
230 Tung et al.
VnCoreNLP. VnCoreNLP (A Vietnamese Natural Language Processing Tool-
kit) is a Java Natural Language Processing toolkit designed to aid NLP research
in Vietnam. Through essential NLP components such as word segmentation,
POS tagging, and NER, VnCoreNLP provides extensive linguistic annotations.
In the course of NLP research, the Vietnamese standard dataset was published.
In early 2013, the first VLSP evaluation campaign used datasets for word seg-
mentation and POS tagging. In 2014, a high-quality dependency treebank was
published, and a NER dataset was published for the 2016 VLSP review cam-
paign. The architectural system design is depicted in Fig. 2.
Fig. 2. In pipeline architecture of VnCoreNLP
4 Experimental results
4.1 Dataset
To be able to use PhoBERT to evaluate and categorize the news’ impact, we pro-
vided a dataset that included 1000 titles of financial articles taken from [Link]
and labeled them into three groups [negative, neutral, or positive] with the help
of experts. The dataset contains 187 articles having a negative impact, 248 ar-
ticles with no impact, and 565 articles with a positive impact. After that, we
divided the dataset into three sets, 80% for training, 10% for validation and
10% for testing. The training set was used to train the model, validation set was
utilized to tune the hyper-parameter. Finally, the result of model was evaluated
on testing set. The examples of our dataset are shown in Table 1.
4.2 Result
After experimenting with 6 different models on the same dataset using various
preprocessing techniques in order to achieve the best possible results, Table 2
was obtained.
As seen in the table above, our model, PhoBERT, outperformed other popu-
lar and sophisticated NLP models with 93.12% accuracy and was 10.54% higher
than the second-highest approach using Logistics Regression. Our model also
achieved the best performances in other metrics such as precision, recall and F1
score.
Stock article title sentiment-based classification using PhoBERT 231
Table 1. Dataset examples
Label Titles Titles (English)
Vĩnh Hoàn (VHC): Doanh thu Vinh Hoan (VHC): April 4/2021
tháng 4/2021 đạt 800 tỷ đồng, revenue reached VND 800 billion,
Positive (1)
các thị trường xuất khẩu đồng export markets simultaneously
loạt tăng tốt increased well
Lịch sự kiện và tin vắn chứng Calendar of events and short
Neutral (0)
khoán ngày 17/5 stocks news on May 17
Khối ngoại tiếp tục bán Foreign investors continued
Negative (-1) ròng gần 630 tỷ đồng to net sell nearly VND 630
trong phiên 18/5 billion in May 18
Table 2. Experimental result (%) of our models compared to other models
Model Accuracy Precision Recall F1 score
Logistics Regression 82.58 82.83 78.64 80.00
SVM 80.09 80.39 76.66 77.91
Naive Bayes 79.6 80.36 76.00 77.22
Random Forest 80.10 83.64 74.25 77.04
LSTM 75.56 72.67 73.42 72.22
PhoBERT 93.12 90.97 92.64 90.63
5 Conclusion
In this research, sentiment classification was performed on Vietnamese stock ar-
ticles collected from the site [Link]. The whole dataset consists of 1000 titles
divided into positive, neutral, and negative news. Because of the fact that white
space is also utilized to separate syllables that make up words in Vietnamese,
we utilize Rdrsegmenter from VnCoreNLP (a word splitting library published
by the author of PhoBERT) to separate words for input data. Then, using the
BPE Encoder, convert text into a list of subwords, then map each subword
to its ID in the PhoBERT vocabulary. The proposed model in this study em-
ployed a state-of-the-art language model for Vietnamese named PhoBERT, and
the entire training process has been deployed on PyTorch. As a result, our ap-
proach achieved 93.12% accuracy, surpassing other popular and sophisticated
NLP models when performed on the same dataset.
As previously stated, financial news can directly impact stock prices, so our
proposed model could be used to assist with stock price forecasting problems
using machine learning. In future work, we also want to explore the effect of
using word embeddings on sentiment classification. Furthermore, we aim to in-
vestigate multiclass categorization of news data using various Deep Learning
models, which will comprise classes such as the economy, sports, health, and
technology. Also, we intend to extend sentiment classification to more domains
by crawling data from numerous Vietnamese websites, such as product reviews,
hotel reviews, and book reviews.
232 Tung et al.
References
1. Bhowmick, A., Hazarika, S.: Machine Learning for E-mail Spam Filtering: Review,
Techniques and Trends. ArXiv. (2016) [Link]
7 61
2. J, R.K., G, M., P, S.: Email Spam Detection using Machine Learning Techniques.
IARJSET. 8, 189–193 (2020). [Link]
3. Shafi, S., Qamar, U.: [WiP] Web Services Classification Using an Improved Text
Mining Technique, 2018 IEEE 11th Conference on Service-Oriented Computing and
Applications (SOCA). 210-215 (2018) [Link]
4. Crasso, M., Zunino, A., Campo, M.: AWSC: An approach to Web service classifi-
cation based on machine learning techniques. INTELIGENCIA ARTIFICIAL. 12,
(2008). [Link]
5. P Gayathri: Texture Classification for Fake Indian Currency Detec-
tion. International Journal of Engineering Research and. V9, (2020).
[Link]
6. Jain, A., Kasbe, A.: Fake News Detection, 2018 IEEE International Students’ Con-
ference on Electrical, Electronics and Computer Science (SCEECS). 1-5 (2018)
[Link]
7. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends
in Information Retrieval. 2, 1–135 (2008).
8. Staudemeyer, R., Morris, E.: -Understanding LSTM - a tutorial into Long Short-
Term Memory Recurrent Neural Networks. (2019).
9. Nguyen, D., Nguyen, A.: PhoBERT: Pre-trained language models for Vietnamese.
Association for Computational Linguistics (2020).
10. M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt and B. Scholkopf: Support vector
machines, in IEEE Intelligent Systems and their Applications, vol. 13, no. 4, pp.
18-28 (1998), doi: 10.1109/5254.708428.
11. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends
in Information Retrieval. 2, 1–135 (2008).
12. Zhang, H., Li, D.: Naı̈ve Bayes Text Classifier, 2007 IEEE Interna-
tional Conference on Granular Computing (GRC 2007). 708-708 (2007).
[Link]
13. Cunningham, P., Delany, S.J.: k-Nearest Neighbour Classifiers: 2nd Edition (with
Python examples). arXiv:2004.04523 [cs, stat]. (2020).
14. Ziebart, B., Maas, A., Bagnell, J., Dey, A.: Maximum Entropy Inverse Reinforce-
ment Learning. (2008)
15. Zhang, L., Wang, S., Liu, B.: Deep Learning for Sentiment Analysis : A Survey.
arXiv:1801.07883 [cs, stat]. (2018).
16. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X.,
Wang, G., Cai, J., Chen, T.: Recent advances in convolutional neural networks. Pat-
tern Recognition. 77, 354–377 (2018). [Link]
17. Sherstinsky, A.: Fundamentals of Recurrent Neural Network (RNN) and Long
Short-Term Memory (LSTM) network. Physica D: Nonlinear Phenomena. 404,
132306 (2020). [Link]
18. Munikar, M., Shakya, S., Shrestha, A.: Fine-grained Sentiment Classification using
BERT. (2019)
19. Devlin, J., Chang, M.-W., Lee, K., Google, K., Language, A.: BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding. (2019).
Stock article title sentiment-based classification using PhoBERT 233
20. Gelbukh, A.: Natural language processing. Fifth International Conference on Hy-
brid Intelligent Systems (HIS’05). (2005) [Link]
21. Kieu, B.T., Pham, S.B.: Sentiment Analysis for Vietnamese. 2010 Second Interna-
tional Conference on Knowledge and Systems Engineering. (2010)
22. Huynh, T., Hoang, K.: GATE framework based metadata extraction from scientific
papers. 2010 International Conference on Education and Management Technology.
(2010). [Link]
23. Vo, Q.-H., Nguyen, H.-T., Le, B., Nguyen, M.-L.: Multi-channel LSTM-
CNN model for Vietnamese sentiment analysis, 2017 9th International
Conference on Knowledge and Systems Engineering (KSE). 24-29. (2017)
[Link]
24. Vu, T., Quoc Nguyen, D., Nguyen, D., Dras, M., Johnson, M.: VnCoreNLP: A
Vietnamese Natural Language Processing Toolkit. (2018).
25. Wang, C., Cho, K., Gu, J.: Neural Machine Translation with Byte-Level Subwords.
Proceedings of the AAAI Conference on Artificial Intelligence. 34, 9154–9160 (2020).
[Link]
26. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury Google, J., Chanan, G.,
Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang,
E., Devito, Z., Raison Nabla, M., Tejani, A., Chilamkurthy, S., Ai, Q., Steiner,
B., Facebook, L.: PyTorch: An Imperative Style, High-Performance Deep Learning
Library. (2019).
27. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in Transformer.
28. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., Stoyanov, V., Allen, P.: RoBERTa: A Robustly Optimized BERT
Pretraining Approach. (2019).