Evaluating LLMs for Text Summarization

Uploaded by

bhaskar agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views5 pages

Evaluating LLMs for Text Summarization

Uploaded by

bhaskar agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

A Comprehensive Evaluation of Large Language

Models for Summarization Evaluation

Bhaskar Vivek Agarwal Deepeshwar Manish Kumar Soum Nag
Btech in Computer Science Btech in Computer Science Btech in Computer Btech in Computer Science
BML Munjal University BML Munjal University Science BML Munjal BML Munjal University
Gurugram, India Gurugram, India University Gurugram, India Gurugram, India

Abstract—The challenges in evaluating the quality of text

summarization. It highlights the divergence between existing which are inadequate for assessing the true quality of the text
automatic evaluation metrics like BLEU/ROUGE and human beyond a certain threshold. To align human evaluation with
assessment, which can consider both objective criteria (like automatic metrics, we face two main challenges: 1) How to model
grammar and correctness) and subjective ones (such as objective evaluation criteria such as coherence and grammar. 2)
informativeness, succinctness, and appeal). To address these
challenges, the text proposes a new evaluation framework based
How to model subjective evaluation criteria such as
on large language models (LLMs). This framework aims to interestingness, comprehensiveness, and usefulness from the users'
provide a comprehensive evaluation by comparing the generated perspective. Natural language has various modes of expression for
text with the reference text from both objective and subjective the same concept, so assessing its quality based on a few static
perspectives. The key aspects of the proposed approach are criteria is hard.
Modeling objective and subjective dimensions of the generated
text using a role-player prompting mechanism. Introducing a
context-based prompting mechanism to generate dynamic role-
player profiles based on the input context. Designing a multi-
role-player prompting technology that uses batch prompting and
integrates multiple outputs into the final evaluation results. The
text reports experimental results on three real-world
summarization datasets show that the proposed model is highly
competitive and has a very high consistency with human
annotators.
Index Terms— Text Summarization, Evaluation Metrics, BLEU/ROUGE,
Human Evaluation, Objective Criteria, Subjective Criteria, Large Language
Models (LLMs), Evaluation Framework, Role-Player Prompting, Context-
Based Prompting, Multi-Role-Player Prompting, Batch Prompting,
Consistency with Human Annotators, Automatic Evaluation, Comprehensive
Evaluation.

I. INTRODUCTION
Reward Learning with Human Feedback (RLHF) is a
methodology in which human preferences are used to train and
align large language models (LLMs) with desired behaviors
and responses. The fundamental concept behind RLHF is
to leverage human judgment to guide the training process,
thus ensuring that the models produce outputs that align with
human expectations and ethical standards. This approach is
particularly crucial given the complex and nuanced nature of
many tasks that LLMs are deployed to handle, such as natu-
ral language understanding, problem-solving, and generating.
mends in the field of aligning large language models with
human preference.
Text summarization has numerous applications across various
research and practical fields. Recent studies have highlighted a
significant gap between existing metrics like BLEU, ROUGE,
Bert Score, and human annotations. While traditional overlap-
based and model-based metrics can measure lexical or semantic
similarity between generated text and reference text, they fail to
capture specific dimensions such as coherence, grammar, and
interestingness. As shown in Figure 1, the summarization task
exposes the limitations of metrics like BLEU and ROUGE,
offers a streamlined, reinforcement-free alternative that ad-
dresses many of the scalability and stability issues associated
with traditional methods.

II. DATASET
We have used the same Hugging face Dialog Sum dataset,
which is a crucial resource designed to facilitate the training
and evaluation of models using comprehensive dialogue
summarization dataset containing 13,460 dialogues,
specifically curated for training NLP models to perform
abstractive summarization. The dataset encompasses a wide
range of everyday conversational topics, including medical
appointments, casual discussions, and event planning,
providing an extensive source for understanding and
summarizing real-life dialogues.
A. Key Characteristics of Dialog Sum Dataset
• Dialogue: The entire conversation texts.
• Summary: A manually crafted summary of the dialogue.
• Topic: A succinct topic or headline of the dialogue,
serving as an additional summary or focus point. .
• Id: A unique identifier for each dialogue instance.

For example, we can mention that having human labelers for the
III. METHODOLOGY
entire finetuning process can be expensive. A practical way to
In this document, we use Reinforcement learning (RL) to avoid that is to use a reward model. use feedback generated by a
prepare a reward model that tests toxicity. Reinforcement model You will use Meta AI's Robert a-based hate speech model
Learning (RL) is one type of machine learning where agents for the reward model. This model will output logits and then
take actions in an environment aimed at maximizing their predict probabilities across two classes: not hate and hate. The
cumulative rewards. The agent's behavior is defined by the logits of the output not hate will be taken as a positive reward.
policy. And the goal of reinforcement learning is for the agent to Then, the model will be fine-tuned with PPO using those reward
learn an optimal, or nearly optimal, policy that maximizes the values. Create the instance of the required model class for the
reward function. In the previous section the original policy is Roberta model. You also need to load a tokenizer to test the model.
based on the instruct PEFT model - this is the LLM before Notice that the model label 0 will correspond to the class not hate
detoxification. Then you could ask human labelers to give and label 1 to the class hate.
feedback on the outputs' toxicity. However, it can be expensive
to use them for the entire fine-tuning process. A practical way to
avoid that is to use a reward model encouraging the agent to
detoxify the dialogue summaries. The intuitive approach would
Take some non-toxic text, tokenize it, and pass it to the model.
be to do some form of sentiment analysis across two classes
Print the output logits, probabilities, and the corresponding reward
(nothate and hate) and give a higher reward if there is higher a
that will be used for fine-tuning. The outputs are the logits for both
chance of getting class not hate as an output.
not hate (positive) and hate (negative) classes. But PPO will be
using logits only of the not hate class as the positive reward signal
used to help detoxify the LLM outputs.
Evaluate Toxicity
To evaluate the model before and after fine- Perform Fine-Tuning to Detoxify the Summaries
tuning/detoxification you need to set up the toxicity Optimize a RL policy against the reward model using
evaluation metric. The toxicity score is a decimal Proximal Policy Optimization (PPO).Set up the
value between 0 and 1 where 1 is the highest configuration parameters. Load the ppo_model and the
[Link] evaluator can be used to compute the tokenizer. You will also load a frozen version of the
toxicity of the dialogues prepared. You will need to model ref_model. The first model is optimized while the
pass the test dataset (dataset["test"]), the same second model serves as a reference to calculate the KL-
tokenizer which was used in that section, the frozen divergence from the starting point. This works as an
PEFT model prepared and the toxicity evaluator. It is additional reward signal in the PPO training to make
convenient to wrap the required steps in the function sure the optimized model does not deviate too much
evaluate_toxicity.And perform the calculation of the from the original [Link] fine-tuning loop consists of
model toxicity before fine-tuning/detoxification. the following main steps:

Get the query responses from the policy LLM (PEFT

model).Get sentiments for query/responses from hate
speech RoBERTa model. Optimize policy with PPO
using the (query, response, reward) triplet.
The operation is running if you see the following
Figure 1toxicity tokenizer. metrics appearing:
• objective/kl: minimize kl divergence.
• Ppo/returns/mean: maximize mean returns,
• Ppo/policy/advantages_mean: maximize advantages.

Figure 2 output for a nontoxic test

Figure 3 output for a toxic text.

1) Human Feedback Collection: Similar to PPO, pairs of
model-generated responses to prompts are presented to human [Link]
[1] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk
annotators who indicate their preferences. Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc
2) Reward Model Training: A reward model is trained Le, and Charles Sutton. 2021. Program synthesis with large language
using the collected human feedback. This model predicts a models.
[2] Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo,
reward score based on the likelihood of a response being
Daniele Calandriello, Michal Valko, and Re´mi Munos. 2023. A general
preferred by humans. theoret ical paradigm to understand learning from human preferences.
3) Reinforcement Learning: The policy model (language [3] Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad
model) is further fine-tuned using reinforcement learning al- Saleh, Peter J. Liu, and Jialu Liu. 2024. Statistical rejection sampling
improves preference optimization.
girths like Direct Preference Optimization (PPO). Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo
4) Evaluation: The performance of the RLHF-trained Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
model is evaluated using a separate validation set from the Nakano, Christopher Hesse, and John Schulman. 2021a. Train- ing verifiers to
Dialogue Sum dataset. solve math word prob lems. CoRR, abs/2110.14168.
[4] Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding
Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat
III. RESULTS language models by scaling high-quality instructional conver sations.
[5] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi
Through the finetuning of the model and reinforcement Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank
learning optimization adaptation of large language models.
[6] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christo-
. The following are our training outputs: pher D. Manning, and Chelsea Finn. 2023. Direct preference optimiza-
• training loss=0.40485486303243723
tion: Your language model is secretly a reward model.
[7] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang,
• train runtime: 4124.22 and Fei Huang. 2023. Rrhf: Rank responses to align language models
• train samples per second: 0.21 with human feed back without tears.
[8] Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani,
• train steps per second: 0.023
Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra,
• poch: 2.980392156862745 Cle´mentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar San seviero,
Alexander M. Rush, and Thomas Wolf. 2023. Zephyr: Direct distillation
of lm alignment.
[9] Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad
Saleh, and Peter J. Liu. 2023. Slic-hf: Sequence likelihood calibration
with human feed back.
[10] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer,
Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil,
Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur,
Guy Gur-Ari, and Vedant Misra. 2022. Solving quantita tive reasoning
problems with language models.

Table 5: Model Training Outputs for mean and median values

The model improved its ability to generate well-reasoned and

good quality responses

Preference Optimization in LLMs
No ratings yet
Preference Optimization in LLMs
89 pages
Evaluating Reward Models with RM-BENCH
No ratings yet
Evaluating Reward Models with RM-BENCH
29 pages
Generalist Reward Models in LLMs
No ratings yet
Generalist Reward Models in LLMs
23 pages
RLHF and PPO in Language Models
No ratings yet
RLHF and PPO in Language Models
37 pages
Language Models for Reward Design in RL
No ratings yet
Language Models for Reward Design in RL
18 pages
Chapter 2 Evaluation of LLMs Using RL
No ratings yet
Chapter 2 Evaluation of LLMs Using RL
4 pages
Efficient RFT with Plug-and-Play LLM
No ratings yet
Efficient RFT with Plug-and-Play LLM
25 pages
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
No ratings yet
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
25 pages
Reinforcement Learning in Language Models
No ratings yet
Reinforcement Learning in Language Models
4 pages
RLAIF: AI Feedback in RL Scaling
No ratings yet
RLAIF: AI Feedback in RL Scaling
18 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
154 pages
RLAIF: AI Feedback in Reinforcement Learning
No ratings yet
RLAIF: AI Feedback in Reinforcement Learning
31 pages
Understanding RLHF in Language Models
No ratings yet
Understanding RLHF in Language Models
16 pages
Reinforcement Learning for LLM Alignment
No ratings yet
Reinforcement Learning for LLM Alignment
41 pages
Fine-tuning vs Prompting in NLU
No ratings yet
Fine-tuning vs Prompting in NLU
7 pages
WARM: Enhancing Reward Models in RLHF
No ratings yet
WARM: Enhancing Reward Models in RLHF
33 pages
Implicit Preference Optimization in LLMs
No ratings yet
Implicit Preference Optimization in LLMs
16 pages
Text Summarization with RLHF Guide
No ratings yet
Text Summarization with RLHF Guide
53 pages
RewardBench: Evaluating Reward Models
No ratings yet
RewardBench: Evaluating Reward Models
44 pages
RLHF: Theory, Implementation, Challenges
No ratings yet
RLHF: Theory, Implementation, Challenges
5 pages
Reward Collapse in Aligning Large Language Models
No ratings yet
Reward Collapse in Aligning Large Language Models
24 pages
Direct Reasoning Optimization in LLMs
No ratings yet
Direct Reasoning Optimization in LLMs
17 pages
Pretraining Language Models With Human Preferences: Orange Orange Orange
No ratings yet
Pretraining Language Models With Human Preferences: Orange Orange Orange
28 pages
Introduction to NLP and Reinforcement Learning
No ratings yet
Introduction to NLP and Reinforcement Learning
27 pages
Enhancing LLM Readability with RL
No ratings yet
Enhancing LLM Readability with RL
6 pages
Module 3
No ratings yet
Module 3
34 pages
Reinforcement Learning from Human Feedback
No ratings yet
Reinforcement Learning from Human Feedback
44 pages
Challenges in RLHF Feedback Quality
No ratings yet
Challenges in RLHF Feedback Quality
14 pages
39021924
No ratings yet
39021924
129 pages
Writing-Zero: Bridging Non-verifiable RL Tasks
No ratings yet
Writing-Zero: Bridging Non-verifiable RL Tasks
16 pages
Advances in Reinforcement Learning Techniques
No ratings yet
Advances in Reinforcement Learning Techniques
21 pages
LLM Basics To Sampling Methods (Top-K & Nucleus)
No ratings yet
LLM Basics To Sampling Methods (Top-K & Nucleus)
8 pages
Evaluating Reward Models with REWARDBENCH
No ratings yet
Evaluating Reward Models with REWARDBENCH
38 pages
Rule-Based Rewards for LLM Safety
No ratings yet
Rule-Based Rewards for LLM Safety
14 pages
Proxy-RLHF: Efficient LLM Alignment
No ratings yet
Proxy-RLHF: Efficient LLM Alignment
8 pages
Uncertainty-Aware RLHF for Alignment
No ratings yet
Uncertainty-Aware RLHF for Alignment
25 pages
Efficient Reasoning with Conciseness Reward
No ratings yet
Efficient Reasoning with Conciseness Reward
15 pages
GPU Specs for Language Models Explained
No ratings yet
GPU Specs for Language Models Explained
5 pages
DPO and RAG in LLM Fine Tuning
No ratings yet
DPO and RAG in LLM Fine Tuning
73 pages
Current Trends in Large Language Models 1768399745
No ratings yet
Current Trends in Large Language Models 1768399745
38 pages
LLMOps for Chatbots and Customer Support
No ratings yet
LLMOps for Chatbots and Customer Support
7 pages
Ethical AI Model Selection Framework
No ratings yet
Ethical AI Model Selection Framework
20 pages
Understanding RLHF in Language Models
No ratings yet
Understanding RLHF in Language Models
10 pages
Improving Online RL For LLM Agents With Sac
No ratings yet
Improving Online RL For LLM Agents With Sac
13 pages
Safe RLHF: Balancing AI Helpfulness and Harmlessness
No ratings yet
Safe RLHF: Balancing AI Helpfulness and Harmlessness
27 pages
RL Finetuning Techniques Overview
No ratings yet
RL Finetuning Techniques Overview
70 pages
Advanced Evaluation for RAG Systems
No ratings yet
Advanced Evaluation for RAG Systems
22 pages
Financial LLMs: Benchmarking and Insights
No ratings yet
Financial LLMs: Benchmarking and Insights
2 pages
Chip Huyen RLHF
No ratings yet
Chip Huyen RLHF
28 pages
Reward-Tampering in Language Models
No ratings yet
Reward-Tampering in Language Models
11 pages
Reinforcement Learning for LLM Explanations
No ratings yet
Reinforcement Learning for LLM Explanations
19 pages
Enhancing LLMs with Human Feedback
No ratings yet
Enhancing LLMs with Human Feedback
8 pages
Training LLM Conversational Recommenders
No ratings yet
Training LLM Conversational Recommenders
24 pages
Understanding LLMs and RLHF Techniques
No ratings yet
Understanding LLMs and RLHF Techniques
7 pages
RL in NLP
No ratings yet
RL in NLP
19 pages
Verbosity Bias in LLM Preference Evaluation
No ratings yet
Verbosity Bias in LLM Preference Evaluation
10 pages
RLHF: Aligning AI with Human Feedback
No ratings yet
RLHF: Aligning AI with Human Feedback
51 pages
Causal Reward Modeling for LLM Alignment
No ratings yet
Causal Reward Modeling for LLM Alignment
19 pages
VeriFree: Enhancing Reasoning Without Verifiers
No ratings yet
VeriFree: Enhancing Reasoning Without Verifiers
21 pages
Stock Price Prediction with Boosting ML
No ratings yet
Stock Price Prediction with Boosting ML
112 pages
Student Subtyping with EM-IRL Techniques
No ratings yet
Student Subtyping with EM-IRL Techniques
11 pages
Machine Learning with Python Guide
No ratings yet
Machine Learning with Python Guide
487 pages
UAV Swarm Confrontation via HAAC Algorithm
No ratings yet
UAV Swarm Confrontation via HAAC Algorithm
16 pages
Machine Learning Basics and Regression
No ratings yet
Machine Learning Basics and Regression
50 pages
ANYTASK: Automated Robot Data Generation
No ratings yet
ANYTASK: Automated Robot Data Generation
28 pages
Fine-Tuning VLMs with Reinforcement Learning
No ratings yet
Fine-Tuning VLMs with Reinforcement Learning
37 pages
RL for Clock Skew Optimization
No ratings yet
RL for Clock Skew Optimization
18 pages
Reinforcement Learning in Robotics
No ratings yet
Reinforcement Learning in Robotics
5 pages
Reinforcement Learning Assignment 2
No ratings yet
Reinforcement Learning Assignment 2
4 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
111 pages
Reinforcement Learning For Test Case Prioritization-Issta17 0 PDF
No ratings yet
Reinforcement Learning For Test Case Prioritization-Issta17 0 PDF
11 pages
Cloud Manufacturing Service Management Guide
100% (7)
Cloud Manufacturing Service Management Guide
17 pages
Deep RL Framework for Autonomous Driving
No ratings yet
Deep RL Framework for Autonomous Driving
7 pages
Intelligent Control for Hybrid EVs
No ratings yet
Intelligent Control for Hybrid EVs
15 pages
LLM-Based Agents: A Comprehensive Survey
No ratings yet
LLM-Based Agents: A Comprehensive Survey
44 pages
Comprehensive Machine Learning Overview
No ratings yet
Comprehensive Machine Learning Overview
40 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
3 pages
Data Science and Visualization Course Overview
No ratings yet
Data Science and Visualization Course Overview
34 pages
Andre Carpathy on GPT-4's Self-Improvement
No ratings yet
Andre Carpathy on GPT-4's Self-Improvement
4 pages
NUS SOC TLIP Brochure
No ratings yet
NUS SOC TLIP Brochure
22 pages
Reinforcement Learning in Predictive Maintenance
No ratings yet
Reinforcement Learning in Predictive Maintenance
63 pages
n-step Tree Backup Algorithm Explained
No ratings yet
n-step Tree Backup Algorithm Explained
3 pages
DRL for Optimal Control in DC Microgrids
No ratings yet
DRL for Optimal Control in DC Microgrids
12 pages
Multi-Agent Path Planning for AGVs
No ratings yet
Multi-Agent Path Planning for AGVs
6 pages
Scaling MARL for 11v11 Robotic Football
No ratings yet
Scaling MARL for 11v11 Robotic Football
30 pages
Put MLT
No ratings yet
Put MLT
2 pages
A I F: I - D R L P O: Dvancing Nvestment Rontiers Ndustry Grade EEP Einforcement Earning For Ortfolio Ptimization
No ratings yet
A I F: I - D R L P O: Dvancing Nvestment Rontiers Ndustry Grade EEP Einforcement Earning For Ortfolio Ptimization
25 pages
Behavior Foundation Model for Robots
No ratings yet
Behavior Foundation Model for Robots
8 pages
Automated Gantry System for Y-Grooves
No ratings yet
Automated Gantry System for Y-Grooves
11 pages