0% found this document useful (0 votes)
15 views6 pages

End-to-End Multimodal Sentiment Analysis

The paper presents a novel end-to-end model for Multimodal Aspect-Based Sentiment Analysis (MABSA) that integrates text and images using the Boosting technique (RoBERTa-LGBM) to improve sentiment analysis accuracy. The model demonstrates superior performance on Twitter datasets, achieving higher accuracy compared to existing methods. The study emphasizes the importance of combining textual and visual data to gain comprehensive insights into customer opinions and preferences.

Uploaded by

aloha1ga23ci024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

End-to-End Multimodal Sentiment Analysis

The paper presents a novel end-to-end model for Multimodal Aspect-Based Sentiment Analysis (MABSA) that integrates text and images using the Boosting technique (RoBERTa-LGBM) to improve sentiment analysis accuracy. The model demonstrates superior performance on Twitter datasets, achieving higher accuracy compared to existing methods. The study emphasizes the importance of combining textual and visual data to gain comprehensive insights into customer opinions and preferences.

Uploaded by

aloha1ga23ci024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2023 Seventh International Conference on Image Information Processing (ICIIP)

A Transformer Model for end-to-end Image and


Text Aspect-Based Sentiment Analysis
2023 Seventh International Conference on Image Information Processing (ICIIP) | 979-8-3503-7140-6/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICIIP61524.2023.10537622

Amit Chauhan Aman Sharma Rajni Mohana


Department of Computer Science Department of Computer Science Department of Computer Science
JUIT, Waknaghat JUIT, Waknaghat Amity School of Engineering, Mohali
Solan, India Solan, India Punjab, India
chauhanamit37@[Link] [Link]@[Link] [Link]@[Link]

Abstract—Opinion mining is an increasingly important field or feature of a product or service that a user may express
with tremendous potential, including End-to-End Multimodal an opinion about. For instance, in a restaurant review, the
Aspect-Based Sentiment Analysis (MABSA). MABSA aims to aspects could be the quality of food, the service, the ambience,
identify aspect-sentiment pairs from a combination of text and
images. However, many MABSA methods do not incorporate the price, etc. ABSA involves a combination of techniques
aspect and sentiment information in their textual and visual such as rule-based, unsupervised, and supervised methods to
representations, which limits their ability to account for the draw valuable insights into customer opinions and preferences.
distinct effects of visual elements on each word or aspect. In this Rule-based methods use a set of predefined rules to identify
paper, we propose an end-to-end model for multimodal tasks the aspects and their corresponding sentiments in the text [4].
that combines text and images using the Boosting technique
(RoBERTa-LGBM). We achieved state-of-the-art results for both Clustering algorithms group similar aspects and sentiments,
datasets, with a higher accuracy of 1.55% for the Twitter 2015 while machine learning algorithms predict text’s aspect and
dataset and a 1.78% increase for the Twitter 2017 dataset. sentiment through labelled data.
Index Terms—Aspect Based Sentiment Analysis, LGBM,
BERT, Ensemble

I. I NTRODUCTION
Sentiment analysis (SA) involves identifying and extract-
ing subjective information from text data and analyzing and
classifying opinions as positive, negative, or neutral. [1] SA
has gained immense significance in recent years owing to the
vast amount of user-generated content on various platforms
like social media, product reviews, etc. Aspect-based sentiment
analysis (ABSA) is an advanced form of sentiment analysis
that considers the different aspects or features of a product or
service that a user may be expressing their opinion about.
[2]. An extension of SA, ABSA measures more than just
how a product or service is perceived overall. Each distinct
component or attribute of the good or service is intended to
be identified by ABSA in terms of sentiment. Factors may
comprise cost, excellence, client support, timeliness, etc. By Fig. 1. Steps Involved in Aspect Based Sentiment Analysis
examining these particular facets, ABSA offers a more in-
depth understanding of consumer preferences and attitudes. In Figure 1, we can see how ABSA (Aspect-Based Senti-
Businesses looking to improve their goods or services may ment Analysis) is related to the necessary steps. ABSA mod-
find great value in this information [3]. ABSA is a technique els, which use deep learning techniques such as RNNs, CNNs,
that identifies opinions and emotions in text data. It is used and Transformer models, are state-of-the-art. They can process
by businesses to understand customer feedback/reviews. complex and variable-length inputs and capture context and
In ABSA, the text is analyzed at a more detailed level to dependencies between different aspects and sentiments in the
identify the sentiment polarity associated with each aspect text.
Generally, ABSA is used in industries including e-
Identify applicable funding agency here. If none, delete this. commerce, healthcare, and Online reviews. It enables busi-

Authorized licensed use limited to: Global Academy Of Technology. Downloaded on September 12,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
979-8-3503-7140-6/23/$31.00 ©2023 IEEE 277
2023 Seventh International Conference on Image Information Processing (ICIIP)

nesses to monitor customer feedback and improve product demonstrate the superiority of their method through exten-
development. sive experiments on annotated Twitter datasets. [6] A new
Multimodal Aspect-Based Sentiment Analysis (MABSA) dataset called Multimodal Aspect-Category Sentiment Anal-
is a newly popular field in sentiment analysis. By fusing ysis (MACSA) has been recently introduced by the authors.
textual and visual data, MABSA’s primary goal is to determine This dataset contains over 21,000 text and image pairs with
the sentiment polarity of various characteristics or qualities detailed annotations for both textual and visual content. The
of a good or service. Because it incorporates data from dataset uses the aspect category as a pivot to align the elements
several modalities, MABSA is more complicated than regular between the two modalities. The Multimodal ACSA task and
ABSA. However, it might also provide businesses with a more a graph-based aligned model called the Multimodal Graph-
accurate and comprehensive insight of the preferences and based Aligned Model (MGAM) have been proposed by the
opinions of their customers. This could enable them to enhance authors using this dataset. The model employs a fine-grained
their products or services. cross-modal fusion method to achieve excellent results. The
MABSA offer more information that isn’t contained in the experimental outcomes imply that the proposed method can
text alone. This can include information about a product’s serve as a baseline for future research on this dataset. [7] The
appearance and quality or a service provider’s behaviour that authors have proposed an interactive fusion network that uses
can be used to deduce the sentiment behind a particular recurrent attention to improve image classification accuracy.
feature. Furthermore, visual data can help clarify the polarity Firstly, two encoders are used to encode text and image data.
of ambiguous words or phrases in the text, increasing the Then, the attention mechanism is employed to obtain the
precision of sentiment analysis. semantic information of the image at the token level. After
that, the GRU filters out the noise in the image and fuses
A. Motivation information from different modalities. Finally, the authors
The application of Multimodal Aspect-Based Sentiment design a decoder with recurrent attention to progressively
Analysis (MABSA) has attracted much interest lately. Re- learn aspect-specific sentiment features for classification. The
searchers are conducting many investigations to create study results on two Twitter datasets demonstrate that the
MABSA models that are more efficient. There are still several proposed method outperforms all baselines. [8] The authors
difficulties in this subject despite these efforts. Accurately have presented A new dataset for Aspect-Based Emotion
aligning text and visual data, recognising relevant elements and Analysis (ABEA). They have also attempted to explore the
emotions, and combining data from multiple sources are a few potential of multimodal co-reference resolution within an
examples. Effective MABSA models can benefit industries like ABEA framework. The dataset contains 4,900 comments on
e-commerce, hospitality, and entertainment, providing valuable 175 images, and it has been annotated with aspect and emotion
insights into customer opinions, feedback, and preferences. categories, along with the emotional dimensions of valence
Therefore, further research and development is necessary to and arousal. The initial experiments indicate that ABEA does
create more accurate and efficient MABSA models. not benefit from multimodal co-reference resolution and that
aspect and emotion classification solely require textual infor-
B. Paper Organisation mation. However, image recognition could be crucial when
more specific information about aspects is required. [9]The
There are multiple sections in this article. The introduction authors of this paper employed Graph artificial intelligence
and motivation section comes first, then a section that explores methods to combine various modalities by leveraging cross-
prior research, preliminary work that introduces key terms, modal dependencies through geometric relationships. The re-
proposed work that describes research procedures and data searchers combined different datasets using graphs and fed
collection methods, results and analysis that present the study’s them into advanced multimodal architectures. These architec-
findings, and a conclusion that summarises the findings, as- tures were classified as image-focused, knowledge-based, or
sesses their significance and implications, and offers possible language-oriented models. The paper also presents a road map
directions for further investigation. for multimodal graph learning, which can be used to explore
existing techniques and develop new models.
II. R ELATED W ORK
III. P RELIMINARIES
[5] The authors of a recent study introduced a new task
named Multimodal Entity-Category-Sentiment Triple Extrac- A. Ensemble
tion (MECSTE), which aims to extract entities, their cor- An effective method for increasing model accuracy in
responding fine-grained categories, and sentiment polarities machine learning and deep learning is ensemble learning. In
from text simultaneously. They created two datasets for this essence, it combines several models’ predictions—each with
task using two existing Twitter corpora and developed a its own advantages and disadvantages—to produce a prediction
generative multimodal approach using a pre-trained sequence- that is more trustworthy and accurate. Imagine it as a team of
to-sequence model. The authors propose transforming entity- specialists together to find a solution to a dilemma. Because
category-sentiment triples into natural language sentences, each expert has a unique set of abilities and expertise, when
framing MECSTE as a paraphrase generation problem. They they collaborate, they can produce a better solution than any

Authorized licensed use limited to: Global Academy Of Technology. Downloaded on September 12,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
278
2023 Seventh International Conference on Image Information Processing (ICIIP)

one of them could alone. Ensemble learning can be applied IV. P ROPOSED W ORK
in a variety of methods, including bagging, boosting, and
stacking. Whereas boosting concentrates on the samples that
were incorrectly identified, bagging trains various models on
distinct portions of the data. By stacking, several models’
predictions are combined.
A flexible method that works well for a wide range of
machine learning issues is ensemble learning. The Boosting
Technique, a well-liked and practical method for raising ma-
chine learning models’ accuracy, was applied in this article.

B. LGBM

LGBM, which stands for Light Gradient Boosting Machine


[10], A well-known machine learning method for classifica-
tion, regression, and ranking tasks is LGBM, which is also
noted for its speed and accuracy. Because it can manage big
datasets with plenty of features and yet be effective for training
and prediction, data scientists utilise it in competitions and
real-world applications.
The simplicity with which LGBM may be tuned and opti-
mised is one of its strongest points, particularly when dealing
with missing data and categorical characteristics. It is therefore
a great option for those who wish to maximise the benefits of
their data analysis.
An additional feature that sets LGBM apart is its capacity to
manage data that is not balanced. For instance, when dealing
with fraud or anomalies, the data may be highly unbalanced. Fig. 2. The Architecture of the study
However, because LGBM’s objective function is based on
gradient boosting, which can be tailored to accommodate
Our study involved several steps, as shown in Figure 2.
various data imbalances, it can handle this with ease.
Firstly, we collected datasets and performed preprocessing on
both Twitter datasets from 2015 to 2017. Next, we generated
C. RoBERTa captions and paired them with the corresponding images. Fol-
lowing this, we paired aspects with their respective sentiments.
Facebook AI has developed a natural language processing In the final stage, we measured accuracy using the performance
model called RoBERTa. Its outstanding performance in a metrics of accuracy and F1-Measure.
range of Natural Language Processing (NLP) tasks has led
to its enormous popularity. The state-of-the-art results were A. Pre-processing
also attained by RoBERTa, an enhanced version of BERT, The process of preprocessing is shown in Figure 3. Various
the original Bidirectional Encoder Representations from the techniques were used to clean the study’s data [11]. To help
Transformers model. us understand the data better, we first divided the text into
Compared to BERT, RoBERTa was trained over a substan- smaller, meaningful units called tokens using a technique
tially longer period of time and with a much larger dataset. called tokenization. Then, we removed common words like
These elements play a part in its exceptional NLP performance ”and,” ”the,” ”in,” ”of,” ”to,” ”is,” and ”a,” because they didn’t
[10]. It was able to surpass its predecessor and get even greater provide us with much useful information. Next, we made
performance as a result. It also brought several significant the text easier to work with by converting all the letters to
changes to the training procedure, like the elimination of the lowercase and removing punctuation. Finally, we used Word
next sentence prediction job and the introduction of dynamic embedding with GloVe to create word vectors that helped us
masking, which contributed to the diversification of the train- analyze the data more effectively. [12].
ing data. RoBERTa’s adaptability in managing a range of
natural language processing (NLP) activities, including named B. Caption Generation
entity recognition, text categorization, and question answering, Generating captions refers to the task of creating a written
is one of its primary benefits. Because of its outstanding description that precisely represents the contents of an image
performance, it is frequently chosen for NLP applications, and or video. With the advancements in deep learning and com-
the NLP community is still actively researching this topic. puter vision technologies, caption generation has emerged as a

Authorized licensed use limited to: Global Academy Of Technology. Downloaded on September 12,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
279
2023 Seventh International Conference on Image Information Processing (ICIIP)

critical research area in artificial intelligence [13]. The ultimate combines textual and visual data, enhancing the user experi-
goal of caption generation is to create image and video de- ence in general.
scriptions that are similar to those made by humans. This can
be especially beneficial for people with visual impairments or D. Dataset
those who seek more information and context while browsing Twitter 15 and 17 are highly used multimodel datasets
through vast amounts of visual content. Caption generation among researchers. When researching about ABSA, datasets
can also be valuable in a variety of applications, such as image like Twitter 15 and 17 are commonly used [14]. Text and
search, product recommendations, and social media platforms, image data have recently been added to datasets. Researchers
where images and videos are frequently shared and viewed. now have access to a wider range of data, which will help them
better grasp the comments and viewpoints of their customers.
In 2018, Wang et al. presented the multimodal Twitter
15 dataset. More than 14,000 tweets with aspect categories,
sentiment polarities, and pertinent photos are included in this
collection. Using the aspect categories listed in the tweets
as a guide, Google Image Search was used to gather the
photographs. Three sets of the dataset have been created: test,
validation, and training. Ten percent of the data are in each of
the test and validation sets, compared to eighty percent in the
training set.

Fig. 3. Steps involved in Preprocessing Fig. 4. Sample of dataset

C. Image-Text Pairing The Multimodal Twitter 17 dataset was presented by Huang


An important component of the ABSA Multimodal experi- et al. in 2019. It comprises more than 6,000 tweets annotated
ence is matching pertinent images with textual content. This with aspect categories, sentiment polarities, and photos. Based
approach’s ultimate goal is to give users a thorough grasp of on the aspect categories specified in the tweets, these pho-
the material that is delivered, which will increase the content’s tographs were collected via image search engines and Twitter’s
engagement and enhance the user’s cognitive experience. APIs. Eighty per cent of the dataset is in the training set, ten
The ABSA Multimodal does this by analysing the con- per cent in the validation set, and ten per cent in the test set.
tent of the image and matching it with pertinent text using Because these multimodal datasets capture both the textual
sophisticated algorithms. Users are guaranteed to obtain the and visual parts of the data, they offer researchers a more
most relevant and accurate information possible thanks to this thorough knowledge of customer reviews and opinions. For
method. For example, if a picture shows someone holding this reason, they are indispensable. They can also aid in
a coffee cup, the text that goes with it might describe the enhancing the functionality of ABSA models by offering
different kinds of coffee that the store sells [13]. One useful further data that can be utilised to more accurately define
method for improving content accessibility for people with aspect categories and emotion polarities.
impairments is to pair text with graphics. Users with visual Research in the subject of ABSA can greatly benefit from
impairments can still comprehend the information and context the multimodal communication provided by the Twitter 15
if an image description is provided. The ABSA Multimodel and Twitter 17 datasets. These datasets offer a realistic and
image-text matching feature is a potent tool that effectively varied set of data that may be utilised to create sophisticated

Authorized licensed use limited to: Global Academy Of Technology. Downloaded on September 12,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
280
2023 Seventh International Conference on Image Information Processing (ICIIP)

models for social media platform analysis of client thoughts


and feedback.

V. R ESULTS

This section discusses our proposed framework’s analysis


and results. We compared our model with existing models
based on accuracy and F1 metrics, and we assessed the
algorithms using several performance indicators.
For analysis, we employed two methods: boosting and
Light Gradient Boosting Machine (LGBM). First, we used
RoBERTa, a neural network model for natural language pro-
Fig. 6. Twitter 2017 Dataset Results
cessing that has already been trained, to train our model.
We applied the RoBERTa results to LGBM after acquiring
them. Boosting strategies were used to raise the accuracy and VI. C ONCLUSION
performance of the model. By combining these two potent In this study, we have employed an end-to-end aspect-
methods, we were able to analyse the data efficiently and based sentiment analysis using a transformer model. The main
derive trustworthy conclusions. conclusions are that the author can outperform earlier models
For the Twitter 2015 and 2017 datasets, we were able to by utilising a boosting strategy. We achieved state-of-the-art
produce state-of-the-art findings. In particular, we increased results for both the Twitter datasets from 2015 and 2017. For
the accuracy by 1.78% for the Twitter 2017 dataset and by the Twitter 2015 dataset, we obtained a greater accuracy of
1.55% for the Twitter 2015 dataset. A comparison of the 1.55%, while for the Twitter 2017 dataset, we achieved an
baseline models and both datasets (Twitter 2015-2017) may improvement of 1.78%. At the moment, emoticon analysis
be found in Table 1. is restricted to a particular set of information. On the other
hand, we acknowledge the significance of comprehending the
TABLE I practical applications of these symbols. Consequently, our goal
COMPARISON WITH BASELINE MODELS is to expand the scope of our study to include a greater variety
of datasets so that we can learn more about the subtleties of
Twitter-2015 Twitter-2017
emoticon usage.
Models Accuracy F1-M Accuracy F1-M
TomBERT(ResNet) 76.60 71.57 69.42 67.70 R EFERENCES
[15]
TomBERT(Faster 77.03 72.85 69.77 67.59 [1] Y. Wang, G. Huang, J. Li, H. Li, Y. Zhou, and H. Jiang, “Refined global
R-CNN) [15] word embeddings based on sentiment concept for sentiment analysis,”
IEEE Access, vol. 9, pp. 37 075–37 085, 2021.
LGBM+RoBERTa 0.7858 0.7288 0.7155 0.6920 [2] H. Silva, E. Andrade, D. Araújo, and J. Dantas, “Sentiment analysis of
(Proposed) tweets related to sus before and during covid-19 pandemic,” IEEE Latin
America Transactions, vol. 20, no. 1, pp. 6–13, 2022.
[3] J. He, A. Wumaier, Z. Kadeer, W. Sun, X. Xin, and L. Zheng, “A local
Figures 5 and 6 show the results of Twitter 2015 and 2016, and global context focus multilingual learning model for aspect-based
sentiment analysis,” IEEE Access, vol. 10, pp. 84 135–84 146, 2022.
comparing the baseline models with our proposed model. [4] Y. Bie and Y. Yang, “A multitask multiview neural network for end-to-
end aspect-based sentiment analysis,” Big Data Mining and Analytics,
vol. 4, no. 3, pp. 195–207, 2021.
[5] L. Yang, J. Wang, J.-C. Na, and J. Yu, “Generating paraphrase sentences
for multimodal entity-category-sentiment triple extraction,” Knowledge-
Based Systems, vol. 278, p. 110823, 2023.
[6] H. Yang, Y. Zhao, J. Liu, Y. Wu, and B. Qin, “Macsa: A multimodal
aspect-category sentiment analysis dataset with multimodal fine-grained
aligned annotations,” arXiv preprint arXiv:2206.13969, 2022.
[7] J. Wang, Q. Wang, Z. Wen, X. Liang, and R. Xu, “Interactive fusion
network with recurrent attention for multimodal aspect-based sentiment
analysis,” in CAAI International Conference on Artificial Intelligence.
Springer, 2022, pp. 298–309.
[8] L. De Bruyne, A. Karimi, O. De Clercq, A. Prati, and V. Hoste, “Aspect-
based emotion analysis and multimodal coreference: A case study of
customer comments on adidas instagram posts,” in Proceedings of the
Thirteenth Language Resources and Evaluation Conference, 2022, pp.
574–580.
[9] Y. Ektefaie, G. Dasoulas, A. Noori, M. Farhat, and M. Zitnik, “Multi-
modal learning with graphs,” Nature Machine Intelligence, vol. 5, no. 4,
Fig. 5. Twitter 2015 Dataset Results pp. 340–350, 2023.

Authorized licensed use limited to: Global Academy Of Technology. Downloaded on September 12,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
281
2023 Seventh International Conference on Image Information Processing (ICIIP)

[10] P. Thiengburanathum and P. Charoenkwan, “Setar: Stacking ensemble


learning for thai sentiment analysis using roberta and hybrid feature
representation,” IEEE Access, vol. 11, pp. 92 822–92 837, 2023.
[11] J. Khan, A. Alam, and Y. Lee, “Intelligent hybrid feature selection
for textual sentiment classification,” IEEE Access, vol. 9, pp. 140 590–
140 608, 2021.
[12] B. Jabir, I. De La Torre Dı́ez, E. F. B. Thompson, D. L. R. Vargas, and
G. K. Castilla, “Ensemble partition sampling (eps) for improved multi-
class classification,” IEEE Access, vol. 11, pp. 48 221–48 235, 2023.
[13] B. L. V. S. Aditya and S. N. Mohanty, “Heterogenous social media
analysis for efficient deep learning fake-profile identification,” IEEE
Access, vol. 11, pp. 99 339–99 351, 2023.
[14] L. Xu and W. Wang, “Improving aspect-based sentiment analysis with
contrastive learning,” Natural Language Processing Journal, vol. 3, p.
100009, 2023.
[15] J. Yu and J. Jiang, “Adapting bert for target-oriented multimodal
sentiment classification.” IJCAI, 2019.

Authorized licensed use limited to: Global Academy Of Technology. Downloaded on September 12,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
282

Common questions

Powered by AI

RoBERTa has achieved advancements over BERT by training over a longer period and with larger datasets, which improved its NLP performance. Significant training procedure modifications such as removing the next sentence prediction task and introducing dynamic masking, diversified the training data further enhancing model performance. These changes enabled RoBERTa to achieve superior results across various NLP tasks compared to BERT .

MECSTE is considered a significant advancement as it aims to extract sentiment-oriented triples involving entities, categories, and sentiments from multimodal data sources, enhancing the granularity and comprehensiveness of sentiment analysis. By integrating both textual and visual data, it provides a multilayered understanding of sentiments related to specific entities and categories, reflecting more nuanced customer insights .

Developing effective MABSA models poses challenges such as accurately aligning text and visual data, recognizing relevant elements and emotions, and combining data from multiple sources. These challenges stem from the complexity introduced by the different modalities involved. Despite efforts, overcoming these issues is crucial for producing models that can provide accurate and comprehensive sentiment insights, relevant for industries like e-commerce, hospitality, and entertainment .

LGBM effectively handles large datasets due to its speed and efficiency in training and prediction. It manages data imbalances through the adaptability of its objective function, which uses gradient boosting to tailor approaches for different imbalances, such as in fraud detection scenarios where data may be highly skewed .

Visual data in MABSA provides additional context that text alone might miss, such as a product's appearance or a service's presentation. This integration helps clarify the sentiment polarity of ambiguous text phrases, thus increasing the analysis's precision. Consequently, it offers businesses more comprehensive insights into customer preferences, beyond what text-based analysis can achieve .

Ensemble learning improves accuracy by combining the predictions from various models, leveraging their individual strengths and offsetting weaknesses. Techniques used include bagging, which trains models on different data portions; boosting, which corrects models' mistakes by focusing more on previously misclassified samples; and stacking, which combines predictions from multiple models into a final prediction. This results in predictions that are typically more reliable and accurate than what single models can achieve .

The Twitter 15 and 17 datasets are valuable for MABSA research due to their incorporation of text and image data, aspect categories, and sentiment polarities. They offer a multimodal approach to sentiment analysis, capturing a comprehensive view of customer opinions. These datasets enhance the accuracy of aspect category definitions and emotion polarities, making them indispensable resources for improving ABSA models .

RoBERTa's performance improvements over BERT resulted from modifications like training over a more extensive period with a larger dataset, removing the next sentence prediction task, and introducing dynamic masking. These changes provided a richer diversity of training data, allowing RoBERTa to adapt better to various NLP tasks and achieve higher accuracy .

Caption generation enhances accessibility by providing textual descriptions of images or videos, which benefits visually impaired users by offering them an understanding of visual content. These advancements are valuable in image search, product recommendations, and social media, where ensuring accessibility and providing context are important for improving user experience .

Image-text pairing enhances user cognitive experience by presenting a cohesive and thorough understanding of the content. This combination ensures that visual elements are contextually matched with relevant textual information, leading to better user engagement and more informed content interpretation. For users, especially those with impairments, it offers enhanced comprehension and accessible delivery of information .

You might also like