0% found this document useful (0 votes)
10 views5 pages

Evaluating LLMs for Text Summarization

Uploaded by

bhaskar agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

Evaluating LLMs for Text Summarization

Uploaded by

bhaskar agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A Comprehensive Evaluation of Large Language

Models for Summarization Evaluation


Bhaskar Vivek Agarwal Deepeshwar Manish Kumar Soum Nag
Btech in Computer Science Btech in Computer Science Btech in Computer Btech in Computer Science
BML Munjal University BML Munjal University Science BML Munjal BML Munjal University
Gurugram, India Gurugram, India University Gurugram, India Gurugram, India

Abstract—The challenges in evaluating the quality of text


summarization. It highlights the divergence between existing which are inadequate for assessing the true quality of the text
automatic evaluation metrics like BLEU/ROUGE and human beyond a certain threshold. To align human evaluation with
assessment, which can consider both objective criteria (like automatic metrics, we face two main challenges: 1) How to model
grammar and correctness) and subjective ones (such as objective evaluation criteria such as coherence and grammar. 2)
informativeness, succinctness, and appeal). To address these
challenges, the text proposes a new evaluation framework based
How to model subjective evaluation criteria such as
on large language models (LLMs). This framework aims to interestingness, comprehensiveness, and usefulness from the users'
provide a comprehensive evaluation by comparing the generated perspective. Natural language has various modes of expression for
text with the reference text from both objective and subjective the same concept, so assessing its quality based on a few static
perspectives. The key aspects of the proposed approach are criteria is hard.
Modeling objective and subjective dimensions of the generated
text using a role-player prompting mechanism. Introducing a
context-based prompting mechanism to generate dynamic role-
player profiles based on the input context. Designing a multi-
role-player prompting technology that uses batch prompting and
integrates multiple outputs into the final evaluation results. The
text reports experimental results on three real-world
summarization datasets show that the proposed model is highly
competitive and has a very high consistency with human
annotators.
Index Terms— Text Summarization, Evaluation Metrics, BLEU/ROUGE,
Human Evaluation, Objective Criteria, Subjective Criteria, Large Language
Models (LLMs), Evaluation Framework, Role-Player Prompting, Context-
Based Prompting, Multi-Role-Player Prompting, Batch Prompting,
Consistency with Human Annotators, Automatic Evaluation, Comprehensive
Evaluation.

I. INTRODUCTION
Reward Learning with Human Feedback (RLHF) is a
methodology in which human preferences are used to train and
align large language models (LLMs) with desired behaviors
and responses. The fundamental concept behind RLHF is
to leverage human judgment to guide the training process,
thus ensuring that the models produce outputs that align with
human expectations and ethical standards. This approach is
particularly crucial given the complex and nuanced nature of
many tasks that LLMs are deployed to handle, such as natu-
ral language understanding, problem-solving, and generating.
mends in the field of aligning large language models with
human preference.
Text summarization has numerous applications across various
research and practical fields. Recent studies have highlighted a
significant gap between existing metrics like BLEU, ROUGE,
Bert Score, and human annotations. While traditional overlap-
based and model-based metrics can measure lexical or semantic
similarity between generated text and reference text, they fail to
capture specific dimensions such as coherence, grammar, and
interestingness. As shown in Figure 1, the summarization task
exposes the limitations of metrics like BLEU and ROUGE,
offers a streamlined, reinforcement-free alternative that ad-
dresses many of the scalability and stability issues associated
with traditional methods.

II. DATASET
We have used the same Hugging face Dialog Sum dataset,
which is a crucial resource designed to facilitate the training
and evaluation of models using comprehensive dialogue
summarization dataset containing 13,460 dialogues,
specifically curated for training NLP models to perform
abstractive summarization. The dataset encompasses a wide
range of everyday conversational topics, including medical
appointments, casual discussions, and event planning,
providing an extensive source for understanding and
summarizing real-life dialogues.
A. Key Characteristics of Dialog Sum Dataset
• Dialogue: The entire conversation texts.
• Summary: A manually crafted summary of the dialogue.
• Topic: A succinct topic or headline of the dialogue,
serving as an additional summary or focus point. .
• Id: A unique identifier for each dialogue instance.

For example, we can mention that having human labelers for the
III. METHODOLOGY
entire finetuning process can be expensive. A practical way to
In this document, we use Reinforcement learning (RL) to avoid that is to use a reward model. use feedback generated by a
prepare a reward model that tests toxicity. Reinforcement model You will use Meta AI's Robert a-based hate speech model
Learning (RL) is one type of machine learning where agents for the reward model. This model will output logits and then
take actions in an environment aimed at maximizing their predict probabilities across two classes: not hate and hate. The
cumulative rewards. The agent's behavior is defined by the logits of the output not hate will be taken as a positive reward.
policy. And the goal of reinforcement learning is for the agent to Then, the model will be fine-tuned with PPO using those reward
learn an optimal, or nearly optimal, policy that maximizes the values. Create the instance of the required model class for the
reward function. In the previous section the original policy is Roberta model. You also need to load a tokenizer to test the model.
based on the instruct PEFT model - this is the LLM before Notice that the model label 0 will correspond to the class not hate
detoxification. Then you could ask human labelers to give and label 1 to the class hate.
feedback on the outputs' toxicity. However, it can be expensive
to use them for the entire fine-tuning process. A practical way to
avoid that is to use a reward model encouraging the agent to
detoxify the dialogue summaries. The intuitive approach would
Take some non-toxic text, tokenize it, and pass it to the model.
be to do some form of sentiment analysis across two classes
Print the output logits, probabilities, and the corresponding reward
(nothate and hate) and give a higher reward if there is higher a
that will be used for fine-tuning. The outputs are the logits for both
chance of getting class not hate as an output.
not hate (positive) and hate (negative) classes. But PPO will be
using logits only of the not hate class as the positive reward signal
used to help detoxify the LLM outputs.
Evaluate Toxicity
To evaluate the model before and after fine- Perform Fine-Tuning to Detoxify the Summaries
tuning/detoxification you need to set up the toxicity Optimize a RL policy against the reward model using
evaluation metric. The toxicity score is a decimal Proximal Policy Optimization (PPO).Set up the
value between 0 and 1 where 1 is the highest configuration parameters. Load the ppo_model and the
[Link] evaluator can be used to compute the tokenizer. You will also load a frozen version of the
toxicity of the dialogues prepared. You will need to model ref_model. The first model is optimized while the
pass the test dataset (dataset["test"]), the same second model serves as a reference to calculate the KL-
tokenizer which was used in that section, the frozen divergence from the starting point. This works as an
PEFT model prepared and the toxicity evaluator. It is additional reward signal in the PPO training to make
convenient to wrap the required steps in the function sure the optimized model does not deviate too much
evaluate_toxicity.And perform the calculation of the from the original [Link] fine-tuning loop consists of
model toxicity before fine-tuning/detoxification. the following main steps:

Get the query responses from the policy LLM (PEFT


model).Get sentiments for query/responses from hate
speech RoBERTa model. Optimize policy with PPO
using the (query, response, reward) triplet.
The operation is running if you see the following
Figure 1toxicity tokenizer. metrics appearing:
• objective/kl: minimize kl divergence.
• Ppo/returns/mean: maximize mean returns,
• Ppo/policy/advantages_mean: maximize advantages.

Figure 2 output for a nontoxic test

Figure 3 output for a toxic text.


1) Human Feedback Collection: Similar to PPO, pairs of
model-generated responses to prompts are presented to human [Link]
[1] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk
annotators who indicate their preferences. Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc
2) Reward Model Training: A reward model is trained Le, and Charles Sutton. 2021. Program synthesis with large language
using the collected human feedback. This model predicts a models.
[2] Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo,
reward score based on the likelihood of a response being
Daniele Calandriello, Michal Valko, and Re´mi Munos. 2023. A general
preferred by humans. theoret ical paradigm to understand learning from human preferences.
3) Reinforcement Learning: The policy model (language [3] Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad
model) is further fine-tuned using reinforcement learning al- Saleh, Peter J. Liu, and Jialu Liu. 2024. Statistical rejection sampling
improves preference optimization.
girths like Direct Preference Optimization (PPO). Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo
4) Evaluation: The performance of the RLHF-trained Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
model is evaluated using a separate validation set from the Nakano, Christopher Hesse, and John Schulman. 2021a. Train- ing verifiers to
Dialogue Sum dataset. solve math word prob lems. CoRR, abs/2110.14168.
[4] Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding
Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat
III. RESULTS language models by scaling high-quality instructional conver sations.
[5] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi
Through the finetuning of the model and reinforcement Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank
learning optimization adaptation of large language models.
[6] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christo-
. The following are our training outputs: pher D. Manning, and Chelsea Finn. 2023. Direct preference optimiza-
• training loss=0.40485486303243723
tion: Your language model is secretly a reward model.
[7] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang,
• train runtime: 4124.22 and Fei Huang. 2023. Rrhf: Rank responses to align language models
• train samples per second: 0.21 with human feed back without tears.
[8] Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani,
• train steps per second: 0.023
Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra,
• poch: 2.980392156862745 Cle´mentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar San seviero,
Alexander M. Rush, and Thomas Wolf. 2023. Zephyr: Direct distillation
of lm alignment.
[9] Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad
Saleh, and Peter J. Liu. 2023. Slic-hf: Sequence likelihood calibration
with human feed back.
[10] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer,
Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil,
Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur,
Guy Gur-Ari, and Vedant Misra. 2022. Solving quantita tive reasoning
problems with language models.

Table 5: Model Training Outputs for mean and median values

The model improved its ability to generate well-reasoned and


good quality responses

You might also like