RLAIF: AI Feedback in Reinforcement Learning
RLAIF: AI Feedback in Reinforcement Learning
with AI Feedback
Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu,
Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash
Google Research
{harrisonlee,samratph,hassan}@[Link]
“1” and “2” and compute the softmax to obtain a averaged to obtain the final preference distribution.
preference distribution.
2.1.2 Chain-of-thought Reasoning
There are numerous alternatives to obtain pref-
erence labels from LLMs, such as extracting the We experiment with eliciting chain-of-thought
preference from a free-form generated response (CoT) reasoning (Wei et al., 2022) from our AI
(e.g. “The first response is better”), or represent- labelers through a two-step inference procedure.
ing the preference distribution as a one-hot encod- First, we replace the Ending of the standard prompt
ing. However, we choose our method because it (e.g. “Preferred Summary=”) with a sentence ask-
is straightforward to implement and conveys more ing for thoughts and explanation (e.g. “Consider
information than a one-hot encoding through its the coherence, accuracy, coverage, and overall
distributed representation of preferences. quality of each summary and explain which one is
We experiment with two styles of preambles: better. Rationale:”) and then decode a response
“Base”, which essentially asks “which response is from the LLM. Then, we concatenate the origi-
better?”, and “Detailed”, which resembles detailed nal prompt, the response, and the standard Ending
rating instructions that would be given to human string together, and follow the scoring procedure in
preference annotators (see Table 16 for pream- Section 2.1 to obtain a preference distribution. See
bles for the summarization task). We also experi- Figure 3 for an illustration.
ment with in-context learning (Brown et al., 2020), In zero-shot prompts, the LLM is not given an
where high-quality exemplars were hand-selected example of what reasoning should look like. In
to cover a range of topics. few-shot prompts, we provide examples of CoT
reasoning for the model to follow. See Tables 17
2.1.1 Addressing Position Bias and 18 for examples.
The order in which candidates are shown to
2.2 Reinforcement Learning from AI
an LLM can bias which candidate it prefers
Feedback
(Pezeshkpour and Hruschka, 2023; Wang et al.,
2023). We find evidence of position bias, which 2.2.1 Distilled RLAIF
is more pronounced with smaller sizes of LLM We describe our adaptation of the canonical RLAIF
labelers (see Appendix B). setup below, which we also refer to as “distilled
To mitigate position bias in preference labeling, RLAIF”. Unless otherwise mentioned, RLAIF is
we make two inferences for every pair of candi- carried out using this method.
dates, where the order in which candidates are pre- After labeling preferences with an LLM, a re-
sented to the LLM is reversed for the second in- ward model (RM) is trained on these labels. Since
ference. The results from both inferences are then our approach produces soft labels (e.g. [0.6, 0.4]),
Figure 3: An illustration of the process of obtaining AI-generated labels for summarization preferences. The LLM
is first prompted to explain its thoughts on the quality of the two candidates (blue). The LLM’s response is then
appended to the original prompt (orange) and fed to the LLM a second time to generate a preference distribution
over “1” vs. “2” based on their log-probabilities (green).
we apply a cross-entropy loss to the softmax of the the canonical setup when the AI labeler is larger
reward scores generated by the RM. The softmax than the RM.
converts the RM scores into a probability distri-
2.3 Evaluation
bution. We note that training a RM on a dataset
of AI labels can be viewed as a form of model We evaluate our results with three metrics - AI
distillation. Labeler Alignment, Win Rate, and Harmless Rate.
Finally, we conduct reinforcement learning to AI Labeler Alignment measures the accuracy of
train the RLAIF policy model, using the RM to AI-labeled preferences with respect to human pref-
assign rewards to model responses. erences. For a single example, a soft AI-labeled
preference is first converted to a binary representa-
2.2.2 Direct RLAIF tion (e.g. [0.6, 0.4] → [1, 0]). Then, a 1 is assigned
An alternative approach is to directly use LLM if the label agrees with the human preference and
feedback as the reward signal in RL. This enables 0 otherwise. The alignment accuracy zacc can be
bypassing the intermediate stage of training a RM expressed as follows:
that approximates the preferences of the LLM. D
1 X
The LLM is prompted to rate the quality of a zacc = 1[arg max Pi,j
AI
= pH
i ],
D j
generation between 1 and 10. Similar to the for- i=1
mat mentioned in Section 2.1, the prompt contains where D is the size of the preference dataset,
high-level details on the structure of the input and P AI ∈ RD×2 is the matrix of soft AI preferences,
the dimensions along which to rate a generation and phuman ∈ RD is the corresponding vector of
(e.g. factuality, coherence). Then, the likelihood human preferences, containing elements 0 or 1 to
of each score token between 1 and 10 is com- denote whether the first or second response is pre-
puted, the likelihoods are normalized to a prob- ferred, respectively.
ability distribution, a weighted score is calculated Win Rate evaluates the end-to-end quality of two
as s(x|c) = 10
P
i=1 iP (i|x, c), and then the score is policies by measuring how often one policy is pre-
again normalized to the range [−1, 1]. Additional ferred by human annotators over another. Given an
details on the prompting technique can be found in input and two generations, human annotators select
the Appendix D. which generation they prefer. The percentage of
Finally, RL is conduct RL in a similar manner to instances where policy A is preferred over policy
“distilled RLAIF”, where the direct score is used B is referred to as the “win rate of A vs. B”. A
as reward instead of the score from a RM. This 50% win rate indicates that A and B are equally
approach is more computationally expensive than preferred.
Harmless Rate measures the percentage of re- there were roughly 3-4k examples for each task5 .
sponses that are considered harmless by human AI labeler alignment metrics were calculated on
evaluators. We evaluate the harmless dialogue gen- these downsampled datasets.
eration task with this metric instead of Win Rate, PaLM 2 (Google et al., 2023) is used as the LLM
because we find that many responses are equally for labeling preferences. The versions used are
safe, making it difficult to assign relative rankings. instruction-tuned but not previously trained with
RL. Unless otherwise specified, AI labels were
3 Experimental Details generated using PaLM 2 Large (L) with the best-
3.1 Datasets performing prompt in Section 4.4. For more details
on LLM labeling, see Appendix D.
We use the following datasets for our experiments:
3.3 Model Training
• Reddit TL;DR (Stiennon et al., 2020) - posts
from Reddit3 accompanied by summaries of All SFT models are initialized from PaLM 2 Extra-
the posts. Small (XS). For summarization, the SFT model is
• OpenAI’s Human Preferences (Stiennon et al., produced by fine-tuning PaLM 2 XS on the Reddit
2020) - a dataset created from a subset of Red- TL;DR dataset. For all other tasks, an instruction-
dit TL;DR. Each example comprises a post, tuned variant of PaLM 2 is used in lieu of task-
two candidate summaries, and a rating from a specific fine-tuning.
human annotator indicating which summary RMs are also derived from PaLM 2 XS. RMs
is preferred. are fine-tuned on the entire training split of the
• Anthropic Helpful and Harmless Human Pref- corresponding preference dataset, where the label
erences (Bai et al., 2022a) - conversations be- is the AI preference for AI feedback RMs and the
tween a human and an AI assistant, where original human preference label in the dataset for
each conversation has two possible AI assis- human feedback RMs. RM accuracies can be found
tant responses - one preferred and the other in Appendix G.
non-preferred, according to a human annota- In the RL phase, the policy is trained with a
tor. Preference is based on which response modified version of REINFORCE (Williams, 1992)
is more informative and honest for the help- adapted to the language modeling domain. While
ful task, and which response is safer for the many recent works use Proximal Policy Optimiza-
harmless task. tion (PPO) (Schulman et al., 2017) - a related
method that adds a few techniques to make train-
More dataset details can be found in Appendix C.
ing more conservative and stable (e.g. clipping the
We also experimented with the Stanford Human objective function), we use REINFORCE with a
Preferences dataset (Ethayarajh et al., 2022), but baseline given that it is simpler yet still effective for
we found that both RLHF and RLAIF policies did the problem at hand. Both policy and value models
not show meaningful improvements over the SFT are initialized from the SFT model. For summa-
baseline after correcting for length biases, using rization, the policy is rolled out on the training split
the procedure in Appendix J. of the Reddit TL;DR dataset. In other words, the
initial states for RL are the original posts from the
3.2 LLM Labeling
dataset prior to summarization. For the helpful and
To enable fast experiment iteration when evaluating harmless tasks, the initial states are drawn from
AI labeling techniques, we randomly downsampled the training splits of the preference datasets. For
the training split of each preference dataset. For summarization, simple post-processing is applied
summarization, an additional filter was applied to to responses generated by RL-trained policies as
only include examples where human annotators described in Appendix E.
preferred one summary over the other with high For additional details on the RL formulation and
confidence4 . After downsampling and filtering, model training, see Appendices F and G.
3
[Link]
4 5
This follows the evaluation procedure in Stiennon et al. We sample 15%, 10%, and 10% of the training splits
(2020). Examples with confidence scores of 1, 2, 8, and 9 for summarization, helpful dialogue generation, and harmless
were considered to be “high-confidence” dialogue generation, respectively.
3.4 Human Evaluation perform the SFT policy, and by similar margins to
For experiments evaluated by win rates, evaluators one another. See Appendix J for details.
were presented with an input context and multiple One natural question that arises is whether there
responses generated from different policies (e.g. is value in combining human and AI feedback. We
RLAIF, RLHF, and SFT). They were then asked experimented with combining both types of feed-
to rank responses in order of quality without ties, back but did not see an improvement beyond using
as seen in Figure 4. Input contexts were drawn human feedback alone. However, we believe that
from test splits of datasets, which were not used there are several alternative training setups that
for training or any other evaluation6 . Rankings could demonstrate value in combining both forms
were used to calculate win rates with respect to of feedback. See Appendix K for details.
pairs of policies. For harmless dialogue generation, These results suggest that RLAIF is a viable
evaluators were asked to independently rate each alternative to RLHF that does not depend on human
response as harmless or harmful. annotation. In addition to expediting labeling time
For more details on human evaluation, see Ap- and reducing dependence on annotation services,
pendix I. another key benefit of AI labeling is cost reduction.
We estimate the cost of labeling with an LLM to
4 Results be over 10x cheaper than human annotation. See
Appendix L for detailed calculations.
4.1 RLAIF vs. RLHF
RLAIF achieves performance gains on par with or 4.2 Towards Self-Improvement
better than RLHF on all three tasks (see Figure 1
In Section 4.1, the LLM used to label preferences
and Table 1). RLAIF and RLHF are preferred by
(PaLM 2 L) is much larger than the policy being
human evaluators over the baseline SFT policy 71%
trained (PaLM 2 XS). Going one step further, one
and 73% of the time for summarization7 and 63%
might wonder if RLAIF can yield improvements
and 64% for helpful dialogue generation, respec-
when the AI labeler is the same size as the policy.
tively. The difference in win rates between RLAIF
On the task of summarization, we conduct RLAIF
vs. SFT and RLHF vs. SFT are not statistically sig-
where PaLM 2 XS is used as the AI labeler instead
nificant. When directly comparing RLAIF against
of PaLM 2 L. The rest of the setup mimics the
RLHF, they are equally preferred - i.e. the win rate
experiment in Section 4.1. We refer to this setup as
is not statistically significantly different from 50%.
“same-size RLAIF”.
For harmless dialogue generation, RLAIF achieves
a harmless rate of 88%, outperforming both RLHF Human annotators prefer same-size RLAIF 68%
and SFT, which score 76% and 64%, respectively8 . of the time over SFT (see Table 1). For reference,
RLAIF using an AI labeler larger than the policy is
Figure 5 contains an example of SFT, RLAIF,
preferred 71% over SFT9 . This result demonstrates
and RLHF summaries. To better understand how
that RLAIF can yield improvements even when the
RLAIF compares to RLHF, we qualitatively com-
AI labeler is the same size as the policy LLM.
pare responses generated by both policies for sum-
marization in Section 5. We note that the AI labeler and initial policy are
As observed in Stiennon et al. (2020), RLAIF not the exact same model. The AI labeler is the
and RLHF policies tend to generate longer re- instruction-tuned PaLM 2 XS, whereas the initial
sponses than the SFT policy, which may be par- policy is PaLM 2 XS fine-tuned on Reddit TL;DR
tially responsible for their higher win rates. We summarization. Additionally, the summaries rated
conduct post-hoc analysis to control for length and by the AI labeler were generated by policies created
find that both RLAIF and RLHF policies still out- by the original dataset curators. For these reasons,
we do not consider this experiment a strict case of
6
For summarization, we used the test split of Reddit “self-improvement”(Huang et al., 2022). However,
TL;DR. For helpful and harmless dialogue generation, we
used test splits from the preference datasets, detailed in Ap-
we believe that these results show great promise
pendix C. for this research direction.
7
RLAIF and RLHF are also preferred over the human
reference summaries in Reddit TL;DR 79% and 80% of the 9
The difference between win rates between “same-size
time, respectively. RLAIF vs. SFT” and “RLAIF vs. SFT” is not statistically
8
RLAIF achieves a statistically significant improvement significant. For a two-sample t-test, p-value = 0.07. At alpha
over RLHF and SFT, according to a two-sample t-test. = 0.05, this difference is not statistically significant.
Win Rate Harmless Rate
Summa Helpful Harmless
Comparison Model
-rization dialogue dialogue
RLAIF vs SFT 71% 63% SFT 64%
RLHF vs SFT 73% 64% RLHF 76%
RLAIF vs RLHF 50% 52% RLAIF 88%
Same-size RLAIF vs SFT 68%
Direct RLAIF vs SFT 74%
Direct RLAIF vs Same-size RLAIF 60%
Table 1: Left side: Win rates when comparing generations from two different models for the summarization and the
helpful dialogue tasks, judged by human evaluators. Right side: Harmless rates across policies for the harmless
dialogue task, judged by human evaluators.
We would like to thank many people who have Tom Brown, Benjamin Mann, Nick Ryder, Melanie
helped make this work complete. We thank Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Chen Zhu for optimizing our LLM inference Askell, et al. 2020. Language models are few-shot
setup, Le Hou for suggesting prompt improvements learners. Advances in neural information processing
and experimenting with self-consistency, Léonard systems, 33:1877–1901.
Hussenot for bringing the problem of position bias
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
in LLMs to our attention, and Bradley Green, Ewa Maarten Bosma, Gaurav Mishra, Adam Roberts,
Dominowska, and Blaise Aguera y Arcas for sup- Paul Barham, Hyung Won Chung, Charles Sutton,
porting this research. Sebastian Gehrmann, et al. 2022. Palm: Scaling
We thank everyone who thoroughly reviewed language modeling with pathways. arXiv preprint
arXiv:2204.02311.
our work and provided valuable feedback: Hakim
Sidahmed, Meiqi Guo, Michal Valko, Nevan Wich- Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar-
ers, Sian Gooding, and Yuan Cao. tic, Shane Legg, and Dario Amodei. 2017. Deep
We thank Mo Azar, Daniel Guo, Andrea Michi, reinforcement learning from human preferences. Ad-
vances in neural information processing systems, 30.
Nicolas Perez-Nieves, and Marco Selvi for their
work in developing a RLAIF training setup that Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken
directly prompts an LLM to obtain reward scores. Chia, Boyang Li, Shafiq Joty, and Lidong Bing. 2023.
Is GPT-3 a good data annotator? In Proceedings
Finally, we thank the individuals who designed of the 61st Annual Meeting of the Association for
and built the RL training infrastructure used in this Computational Linguistics (Volume 1: Long Papers),
paper: Léonard Hussenot, Johan Ferret, Robert pages 11173–11195, Toronto, Canada. Association
Dadashi, Geoffrey Cideron, Alexis Jacq, Sabela for Computational Linguistics.
Ramos, Piotr Stanczyk, Sertan Girgin, Danila Kawin Ethayarajh, Yejin Choi, and Swabha
Sinopalnikov, Amélie Héliou, Nikola Momchev, Swayamdipta. 2022. Understanding dataset
and Olivier Bachem. difficulty with V-usable information. In Proceedings
of the 39th International Conference on Machine
Learning, volume 162 of Proceedings of Machine
References Learning Research, pages 5988–6008. PMLR.
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Tom Everitt and Marcus Hutter. 2016. Avoiding wire-
Christiano, John Schulman, and Dan Mané. 2016. heading with value reinforcement learning. In Arti-
Concrete problems in ai safety. arXiv preprint ficial General Intelligence: 9th International Con-
arXiv:1606.06565. ference, AGI 2016, New York, NY, USA, July 16-19,
2016, Proceedings 9, pages 12–22. Springer.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda
Askell, Anna Chen, Nova DasSarma, Dawn Drain, Angela Fan, Mike Lewis, and Yann Dauphin. 2018.
Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Hierarchical neural story generation. In Proceedings
2022a. Training a helpful and harmless assistant with of the 56th Annual Meeting of the Association for
reinforcement learning from human feedback. arXiv Computational Linguistics (Volume 1: Long Papers),
preprint arXiv:2204.05862. pages 889–898, Melbourne, Australia. Association
for Computational Linguistics.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu,
Amanda Askell, Jackson Kernion, Andy Jones, Anna Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chan-
Chen, Anna Goldie, Azalia Mirhoseini, Cameron dar, Soroush Vosoughi, Teruko Mitamura, and Ed-
McKinnon, Carol Chen, Catherine Olsson, Christo- uard Hovy. 2021. A survey of data augmentation
pher Olah, Danny Hernandez, Dawn Drain, Deep approaches for NLP. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021, Saeta, Rajkumar Samuel, Renee Shelby, Ambrose
pages 968–988, Online. Association for Computa- Slone, Daniel Smilkov, David R. So, Daniel Sohn,
tional Linguistics. Simon Tokumine, Dasha Valter, Vijay Vasudevan, Ki-
ran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui
Roy Fox, Ari Pakman, and Naftali Tishby. 2015. Tam- Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin
ing the noise in reinforcement learning via soft up- Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui
dates. arXiv preprint arXiv:1512.08562. Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang
Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu.
Yang Gao, Christian M Meyer, Mohsen Mesgar, and 2023. Palm 2 technical report.
Iryna Gurevych. 2019. Reward learning for efficient
reinforcement learning in extractive document sum- Ronald A Howard. 1960. Dynamic programming and
marisation. arXiv preprint arXiv:1907.12894. markov processes. John Wiley.
Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu,
2019. A theory of regularized markov decision pro- Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022.
cesses. In International Conference on Machine Large language models can self-improve. arXiv
Learning, pages 2160–2169. PMLR. preprint arXiv:2210.11610.
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau,
2023. Chatgpt outperforms crowd-workers for text- José Miguel Hernández-Lobato, Richard E Turner,
annotation tasks. arXiv preprint arXiv:2303.15056. and Douglas Eck. 2017. Sequence tutor: Conserva-
tive fine-tuning of sequence generation models with
Amelia Glaese, Nat McAleese, Maja Trebacz, John
kl-control. In International Conference on Machine
Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh,
Learning, pages 1645–1654. PMLR.
Laura Weidinger, Martin Chadwick, Phoebe Thacker,
et al. 2022. Improving alignment of dialogue agents Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
via targeted human judgements. arXiv preprint Brown, Benjamin Chess, Rewon Child, Scott Gray,
arXiv:2209.14375. Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.
Scaling laws for neural language models. arXiv
Google. 2023. Ai platform data labeling service
preprint arXiv:2001.08361.
pricing. [Link]
ai-platform/data-labeling/pricing# M. G. Kendall and B. Babington Smith. 1939. The
labeling_costs. Accessed: 2023-09-28. Problem of m Rankings. The Annals of Mathemati-
Rohan Anil Google, Andrew M. Dai, Orhan Firat, cal Statistics, 10(3):275 – 287.
Melvin Johnson, Dmitry Lepikhin, Alexandre Pas- Minae Kwon, Sang Michael Xie, Kalesha Bullard, and
sos, Siamak Shakeri, Emanuel Taropa, Paige Bai- Dorsa Sadigh. 2022. Reward design with language
ley, Zhifeng Chen, Eric Chu, Jonathan H. Clark, models. In The Eleventh International Conference
Laurent El Shafey, Yanping Huang, Kathy Meier- on Learning Representations.
Hellstern, Gaurav Mishra, Erica Moreira, Mark
Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo,
Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gus- Thuat Nguyen, Franck Dernoncourt, Ryan A Rossi,
tavo Hernandez Abrego, Junwhan Ahn, Jacob and Thien Huu Nguyen. 2023. Okapi: Instruction-
Austin, Paul Barham, Jan Botha, James Brad- tuned large language models in multiple languages
bury, Siddhartha Brahma, Kevin Brooks, Michele with reinforcement learning from human feedback.
Catasta, Yong Cheng, Colin Cherry, Christopher A. arXiv preprint arXiv:2307.16039.
Choquette-Choo, Aakanksha Chowdhery, Clément
Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang,
Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li,
Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Mengshen He, Zhengliang Liu, et al. 2023. Summary
Freitag, Xavier Garcia, Sebastian Gehrmann, Lu- of chatgpt/gpt-4 research and perspective towards
cas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi the future of large language models. arXiv preprint
Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jef- arXiv:2304.01852.
frey Hui, Jeremy Hurwitz, Michael Isard, Abe Itty-
cheriah, Matthew Jagielski, Wenhao Jia, Kathleen Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, et al. 2023. Self-refine: Iterative refinement with
Hanzhao Lin, Zhongtao Liu, Frederick Liu, Mar- self-feedback. arXiv preprint arXiv:2303.17651.
cello Maggioni, Aroma Mahendru, Joshua Maynez,
Vedant Misra, Maysam Moussalem, Zachary Nado, James Manyika. 2023. An overview of
John Nham, Eric Ni, Andrew Nystrom, Alicia Par- bard: an early experiment with genera-
rish, Marie Pellat, Martin Polacek, Alex Polozov, tive ai. [Link]
Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, documents/[Link].
Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Accessed: 2023-08-23.
Yu Meng, Martin Michalski, Jiaxin Huang, Yu Zhang, Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam
Tarek Abdelzaher, and Jiawei Han. 2023. Tun- Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
ing language models as training data generators for Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al.
augmentation-enhanced few-shot learning. In Inter- 2022. Lamda: Language models for dialog applica-
national Conference on Machine Learning, pages tions. arXiv preprint arXiv:2201.08239.
24457–24477. PMLR.
Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen,
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, You Wu, Luke Zettlemoyer, and Huan Sun. 2022a.
Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle- Towards understanding chain-of-thought prompting:
moyer. 2022. Rethinking the role of demonstrations: An empirical study of what matters. arXiv preprint
What makes in-context learning work? In Proceed- arXiv:2212.10001.
ings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pages 11048–11064. Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai
Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui.
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, 2023. Large language models are not fair evaluators.
Long Ouyang, Christina Kim, Christopher Hesse, arXiv preprint arXiv:2305.17926.
Shantanu Jain, Vineet Kosaraju, William Saunders,
Shuohang Wang, Yang Liu, Yichong Xu, Chenguang
et al. 2021. Webgpt: Browser-assisted question-
Zhu, and Michael Zeng. 2021a. Want to reduce label-
answering with human feedback. arXiv preprint
ing cost? gpt-3 can help. In Findings of the Associ-
arXiv:2112.09332.
ation for Computational Linguistics: EMNLP 2021,
OpenAI. 2023a. Gpt-4 technical report. pages 4195–4205.
OpenAI. 2023b. Openai pricing. [Link] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le,
com/pricing. Accessed: 2023-09-28. Ed H Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. 2022b. Self-consistency improves
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, chain of thought reasoning in language models. In
Carroll Wainwright, Pamela Mishkin, Chong Zhang, The Eleventh International Conference on Learning
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Representations.
2022. Training language models to follow instruc-
Zirui Wang, Adams Wei Yu, Orhan Firat, and Yuan Cao.
tions with human feedback. Advances in Neural
2021b. Towards zero-label language learning. arXiv
Information Processing Systems, 35:27730–27744.
preprint arXiv:2109.09193.
Pouya Pezeshkpour and Estevam Hruschka. 2023. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Large language models sensitivity to the order of Adams Wei Yu, Brian Lester, Nan Du, Andrew M
options in multiple-choice questions. arXiv preprint
Dai, and Quoc V Le. 2021. Finetuned language mod-
arXiv:2308.11483.
els are zero-shot learners. In International Confer-
Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Ge- ence on Learning Representations.
offrey Cideron, Robert Dadashi, Matthieu Geist, Ser- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
tan Girgin, Léonard Hussenot, Orgad Keller, et al. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
2023. Factually consistent summarization via rein- et al. 2022. Chain-of-thought prompting elicits rea-
forcement learning with textual entailment feedback. soning in large language models. Advances in Neural
arXiv preprint arXiv:2306.00186. Information Processing Systems, 35:24824–24837.
John Schulman, Filip Wolski, Prafulla Dhariwal, Ronald J Williams. 1992. Simple statistical gradient-
Alec Radford, and Oleg Klimov. 2017. Proxi- following algorithms for connectionist reinforcement
mal policy optimization algorithms. arXiv preprint learning. Machine learning, 8:229–256.
arXiv:1707.06347.
Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-
Noam Shazeer and Mitchell Stern. 2018. Adafactor: Yan Liu. 2018. A study of reinforcement learning
Adaptive learning rates with sublinear memory cost. for neural machine translation. In Proceedings of the
CoRR, abs/1804.04235. 2018 Conference on Empirical Methods in Natural
Language Processing, pages 3612–3621.
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le,
Dario Amodei, and Paul F Christiano. 2020. Learn- Mohammad Norouzi, Wolfgang Macherey, Maxim
ing to summarize with human feedback. Advances Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
in Neural Information Processing Systems, 33:3008– 2016. Google’s neural machine translation system:
3021. Bridging the gap between human and machine trans-
lation. arXiv preprint arXiv:1609.08144.
Richard S Sutton, David McAllester, Satinder Singh,
and Yishay Mansour. 1999. Policy gradient methods Yuxiang Wu and Baotian Hu. 2018. Learning to extract
for reinforcement learning with function approxima- coherent summary via deep reinforcement learning.
tion. Advances in neural information processing In Proceedings of the AAAI Conference on Artificial
systems, 12. Intelligence, page 5602.
Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng,
and Yuandong Tian. 2023. Rlcd: Reinforcement
learning from contrast distillation for language model
alignment.
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B
Brown, Alec Radford, Dario Amodei, Paul Chris-
tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
guage models from human preferences. arXiv
preprint arXiv:1909.08593.
A RLHF Preliminaries B Position Bias in LLM Labelers
We review the RLHF pipeline introduced in Sti-
ennon et al. (2020); Ouyang et al. (2022), which
consists of 3 phases: supervised fine-tuning, reward Model Size % Same Position Preferred
model training, and reinforcement learning. PaLM 2 L 18%
PaLM 2 S 21%
A.1 Supervised Fine-tuning PaLM 2 XS 56%
A pre-trained LLM is fine-tuned on a high quality Table 4: Position bias is more prevalent in smaller model
labeled dataset for a downstream task (e.g. summa- sizes, measured by the percentage of examples where
rization) using token-level supervision to produce the LLM prefers the same position even after swapping
a supervised fine-tuned (SFT) model π SF T . the order of candidates (“% Same Position Preferred”).
Analysis is conducted using the “Detailed + CoT 0-shot”
A.2 Reward Modeling prompt for the summarization task.
Given an input x, we sample a pair of responses
(y1 , y2 ) ∼ π from one or more models, where
Our analysis on the summarization task suggests
oftentimes π is the SFT model. The input and
that the LLMs used for preference labeling are
responses are sent to human annotators to rate
biased by the order in which candidates are shown.
which response is better according to some cri-
For each example in our AI labeling evaluation
teria. These annotations form a dataset of triplets
set, we query the LLM preferences for the pair of
D = {(x, yw , yl )}, where yw and yl are the pre-
candidates, swap the order in which candidates are
ferred and non-preferred responses, respectively. A
presented, and then query the LLM preferences
reward model (RM) rϕ is trained by minimizing
again.
the following loss:
We consider an LLM to be more biased if it
h i prefers the same position on both the original and
Lr (ϕ) = −E log σ rϕ (x, yw ) − rϕ (x, yl ) ,
(x,yw ,yl )∼D reversed inferences. For example, let candidates A
and B be in positions 1 and 2 for the first inference
where σ is the sigmoid function. and in positions 2 and 1 for the second inference.
A.3 Reinforcement Learning If the LLM prefers the same position on both infer-
ences, we consider the LLM to be position-biased.
A policy πθRL is initialized from the SFT model We measure position bias by computing “% Same
weights and then optimized with reinforcement Position Preferred” - the percentage of inference
learning to maximize the reward given by the RM, pairs where this occurs. A higher metric value
which serves as a proxy for human preferences. Op- indicates a more biased LLM.
tionally, a Kullback-Leibler (KL) divergence term
DKL is added to the objective to penalize πθRL for We find that PaLM 2 L, S, and XS prefer the
deviating from the original SFT policy π SF T , con- same position 18%, 21%, and 56% of the time, re-
trolled by the hyperparameter β (Fox et al., 2015; spectively, suggesting that position bias is inversely
Geist et al., 2019). The KL loss helps prevent correlated with model size (see Table 4). One hy-
πθRL from drifting into a region where it generates pothesis is that larger models are more capable and
language that is highly rewarded by the RM yet therefore more faithfully judge preferences based
consists of low-quality or unnatural language - a on the content of the candidates rather than their
phenomenon known as “reward hacking” (Everitt positions, which are supposed to be immaterial.
and Hutter, 2016; Amodei et al., 2016). The op- We also observe that for PaLM 2 L, of the 18%
timization objective is described by the equation of cases where it prefers the same position on both
below: inferences, 94% of the time it prefers the first candi-
date shown. On the other hand, PaLM 2 S and XS
h show affinity for the second candidate shown when
J(θ) = E (1 − β)rϕ (y|x)
y∼πθ (·|x) the same position is preferred on both inferences,
i
− βDKL πθRL (y|x) || π SF T (y|x) , preferring it 91% and 99% of the time, respectively.
These biases are statistically significant under a
where β is a hyperparameter between 0 and 1. two-sided binomial test at α = 0.05.
C Dataset Details the summary to score preceded by “SUMMARY: ”,
and a final “SCORE: ”.
For summarization, we use the filtered Reddit PaLM 2 models are publicly available through
TL;DR dataset (Stiennon et al., 2020), containing Google Cloud’s Vertex AI14 , though we cannot
posts from Reddit12 that have been filtered to en- guarantee full reproducibility as the models acces-
sure high quality. The dataset contains 123k posts, sible through Google Cloud are subject to change.
where ∼5% is held out as a validation set.
Additionally, we use OpenAI’s human prefer- E Post-RL Response Formatting
ence dataset created from the filtered Reddit TL;DR
For summarization, we observed that summaries
dataset. For a given post, two candidate summaries
generated by RLHF and RLAIF policies often in-
were generated - often from different policies, and
cluded superfluous symbols like periods or spaces
human labelers were asked to rate which summary
at the end of the response - possibly due to “reward
they preferred. The total dataset comprises 92k
hacking”. Given that these extra tokens do not have
pairwise comparisons.
any meaningful content, we programmatically re-
For helpful and harmless dialogue generation,
moved certain symbols at the end of summaries.
we use Anthropic’s Helpful and Harmless prefer-
This ensured that human evaluators could focus on
ence datasets13 (Bai et al., 2022a). Each example
the content and not be distracted by the formatting
consists of a conversation history between a human
of the response.
and an AI assistant accompanied by a preferred and
non-preferred response from the AI assistant. Pref- F REINFORCE for Language Models
erence is based on which response is more helpful
and honest for the helpful task, and which response Consider a deterministic, finite-horizon MDP M =
is safer and less harmful for the harmless task. Each (X , A, R, P, γ) (Howard, 1960). At each step t,
dataset comprises over 40k training examples and given the current state Xt ∈ X and the next
2k test examples. We further split the test sets into action At ∈ A, the model receives a reward
validation and test sets by randomly assigning two- Rt = R(Xt , At ) and transitions to the next state
thirds of examples to validation and one-third to Xt+1 = P (Xt , At ).
test. In the context of language models, Xt is the con-
catenation of the input text and all text generated
D LLM Labeling Details by the policy until time t. Action At is the token
from the considered vocabulary decoded at time
For LLM labeling, we set a maximum input con- t by the stochastic policy πθ (·|Xt ), where θ rep-
text length of 4096 tokens. For chain-of-thought resents the policy parameters. Finally, the reward
generation, we set a maximum decoding length of Rt is given by the RM. The RM is only evaluated
512 tokens and sample with temperature T = 0.0 when the language model response has been fully
(i.e. greedy decoding). For self-consistency ex- generated; all rewards prior to the final token are
periments in Appendix M, we use temperatures set to 0, while the reward corresponding to the final
varying from T = 0.3 to T = 1.0 with top-K token is set to RT .
sampling (Fan et al., 2018), where K = 40. The cumulative sum of rewards received when
In Section 4.3, we use the AI labeler to directly following the policy πθ from time-step t is called
compute a score that we leverage as the reward for the
PT return. Generally, it is defined as Zt =
γ s−t R . However, since only the terminal
RL. We use the following prompt: “You are an ex- s=t s
pert summary rater. Given a TEXT (completed with reward is non-zero and we set γ = 1, the return
a SUBREDDIT and a TITLE) and a SUMMARY, can be simplified to Zt = RT .
your role is to provide a SCORE from 1 to 10 that Given a trajectory (Xt , At , Rt )Tt=0 generated un-
rates the quality of the SUMMARY given the TEXT, der πθ , the policy gradient loss from REINFORCE
with 1 being awful and 10 being a perfect SUM- is then defined as follows:
MARY.”, followed by the input Reddit post, then X
LPG (θ) = − log πθ (At |Xt ) Zt − Vψπ (Xt ) ,
12
[Link] t
13
We use the helpful-base and harmless-base
14
datasets from [Link] [Link]
datasets/Anthropic/hh-rlhf. docs/generative-ai/learn/models
where the bar notation denotes that no gradient is To select the final checkpoint for each RL pol-
passed through the advantage term during back- icy, we first selected 4 candidate checkpoints from
propagation. RL training that scored high rewards on validation
The baseline value function Vψπ (x) estimates the prompts. We then prompted an off-the-shelf LLM
return-to-go Zt when following the policy πθ and is to judge the win rate of the RL checkpoint’s sum-
parameterized by ψ (Williams, 1992; Sutton et al., maries vs. the SFT policy’s summaries. We also
1999). It is trained with the following loss: conducted manual inspection of a dozen examples.
X We picked the checkpoint with the best combina-
LV (ψ) = (Zt − Vψπ (Xt ))2 .
tion of win rate and quality as judged by manual
t
inspection as our final RL policy.
In practice, we optimize the regularized objec-
tive in Sec. A.3. We incorporate the KL divergence H Reward Model Accuracy
in the policy gradient loss described above, as com-
monly seen in other work (Jaques et al., 2017). Human AI
Task
G Model Training Details Feedback Feedback
Summarization 79.3% 74.2%
SFT models for the summarization task are trained Helpful Dialogue 76.0% 67.8%
on the Reddit TL;DR dataset, with a batch size Harmless Dialogue 72.1% 69.7%
of 128 for a single epoch. We use the Adafac-
tor (Shazeer and Stern, 2018) optimizer with a Table 5: Pairwise accuracies of human feedback and AI
learning rate of 10−5 , and the maximum input and feedback reward models across all tasks. Metrics are
output lengths are 1024 and 128 tokens, respec- calculated on a held out set of human preference data
for each task.
tively. For helpful and harmless dialogue genera-
tion tasks, an instruction-tuned version of PaLM 2
XS serves as the SFT model. Human AI
RMs for all tasks are trained until the training Initialization
Feedback Feedback
loss and accuracy curves plateau, which happens PaLM 2 XS 79.3% 73.0%
in 2-3 epochs. We use the Adafactor optimizer SFT 78.7% 74.2%
with a learning rate of 10−5 . Batch size is 128 for
summarization RMs and 32 for RMs of other tasks. Table 6: Results of initializing the summarization RMs
We train all our RMs with maximum input length on PaLM 2 XS vs. the SFT model.
of 1152 tokens to account for 1024 context tokens
and 128 response tokens. We report the accuracies
AI
of the RMs in Appendix H. RM Variant
Feedback
For summarization, the AI feedback RM is ini-
Trained on “Base 0-shot” labels 77.9%
tialized from the SFT model (i.e. PaLM 2 XS fine-
Trained on labels from PaLM 2 XS 66.4%
tuned on Reddit TL;DR), and the human feedback
RM is initialized from PaLM 2 XS. We experi- Table 7: Accuracy values for variants of RMs trained
mented with initializing the human feedback RM on AI labels for the task of summarization.
from the SFT model but found that it resulted in
lower accuracy on the held out set of human pref- Pairwise Accuracy for RMs measures how ac-
erences (see Table 6). For helpful and harmless curate a trained reward model is with respect to a
dialogue generation tasks, we initialize both the held-out set of human preferences. Given an input
human and AI feedback RMs from the instruction- context and pair of candidate responses, the value
tuned version of PaLM 2 XS. is 1 if the RM scores the preferred candidate higher
For reinforcement learning, we use the SFT than the non-preferred candidate, according to the
model for each task as the initial policy. We sample human label. Otherwise the value is 0. This quan-
from our language model policies for all tasks with tity is averaged over multiple examples to obtain
a temperature of T = 0.9 to encourage exploration. the pairwise accuracy of the RM.
We train with a batch size of 128 and learning rate We report RM accuracy on a held out set of
of 10−5 for 8 epochs. We set β = 0.05 for the KL human preferences for all tasks in Table 5. For
divergence loss. summarization, we also report RM accuracy when
initializing on different checkpoints in Table 6. In J Controlling for Response Length
Table 7, we report accuracy for RM variants used in
the end-to-end sensitivity experiment in Appendix Response length often can influence human evalua-
N and the same-size RLAIF experiment in Section tors’ perception of quality (Stiennon et al., 2020),
4.2. and our various policies generate responses that
differ in length. For example, in the summarization
We observe that RMs trained on human feed- task, the summaries produced by RLAIF, RLHF,
back outperform those trained on AI feedback, both and SFT policies sent to human evaluation have
of which are measured against a held out set of an average character-length of 164, 161, and 132,
human preferences. This pattern seems natural, respectively. For all experiments presented in this
given that the human preferences are trained on paper, we conduct post-hoc analysis to estimate the
data drawn from the same distribution as the val- win rates after controlling for length.
idation dataset. However, it is interesting to note We take an approach similar to Stiennon et al.
that despite the gap in accuracy between AI and (2020) and calculate the “length-adjusted win rate
human preference RMs, RLAIF achieves compa- of policy A vs. policy B”. Given policy A, we train
rable results to RLHF on two tasks and surpasses a logistic regression model where the input is the
RLHF on one task. Additionally, we note that the ratio of the policy A’s response length to policy B’s
summarization RMs trained on “Base 0-shot” and summary length (in characters), and the target is a
“Detailed + CoT 0-shot” (i.e. the default prompt- binary label indicating whether policy A’s response
ing technique) achieve accuracies of 77.9% and was preferred over policy B’s response. After fit-
74.2%, respectively, which is the inverse order of ting the model, we estimate a length-controlled win
their final performance after RL (see Appendix N). rate by asking the logistic regressor to predict the
These gaps in RM accuracy suggest that RM ac- win rate given a length ratio of 1.0, which repre-
curacy, while correlated with RM usefulness, may sents the scenario where both the responses are of
not be a perfect reflection of RM effectiveness in equal length.
RLHF and RLAIF. Ultimately, we believe that the
After controlling for length for the summariza-
usefulness of RMs is assessed through conducting
tion task, our length-adjusted win rates for RLAIF
RL and evaluating the final policies through human
and RLHF vs. SFT are 59% and 61%, respectively
evaluation.
(see Table 8). Both RL policies continue to outper-
form the SFT policy by a similar margin, support-
ing our initial statement that RLAIF is comparable
I Human Evaluation Details
to RLHF.
We reach similar conclusions for the helpful dia-
To conduct human evaluation, in total we created logue generation task (Table 9), same-size RLAIF
∼2k unique rating instances. Each instance com- and direct RLAIF experiments (Table 11), the end-
prised a single context and three distinct model to-end sensitivity to AI labeler alignment exper-
responses (e.g. responses from SFT, RLAIF, and iment (Table 12), and combining human and AI
RLHF policies), resulting in a total of ∼6k unique feedback (Table 13).
(context, response) pairs subjected to human evalu- For the harmless dialogue generation task, the
ation. Additionally, each instance was assessed by setup is slightly different. Since human evaluators
three independent raters, resulting in ∼18k (con- rated each response independently as harmful or
text, response, rating) tuples. harmless, we compute the harmless rate instead of
We measure the inter-annotator agreement with the win rate. We use the average generation length
Kendall’s Coefficient of Concordance W (Kendall from the SFT policy as the reference point for all
and Smith, 1939) - a non-parametric statistic for as- other policies (Table 10).
sessing the agreement among multiple raters rank- We note that this post-hoc method of controlling
ing multiple items. The values of Kendall’s W for length is imperfect, as it assumes the logistic
range from 0 to 1, where 0 indicates perfect dis- regression model accurately learns the relationship
agreement and 1 indicates perfect agreement. We between summary length and human preference. A
conducted multiple human evaluation sessions, and more principled approach would be to encourage
the W statistic ranged from 0.6-0.7, indicating a all policies generate summaries of similar length
reasonable level of agreement. through an auxiliary training loss.
Length Length Length Length
Models Models
uncorrected corrected uncorrected corrected
RLAIF vs SFT 71% 59% SFT 64% 64%
RLHF vs SFT 73% 61% RLHF 76% 78%
RLAIF vs RLHF 50% 47% RLAIF 88% 91%
Table 8: Length-controlled win rate for the summariza- Table 10: Length-controlled harmless rate for the harm-
tion task. less dialogue generation task. We used the average gen-
eration length from the SFT model as reference length to
compute the length-controlled harmless rate for RLHF
Length Length
Models and RLAIF.
uncorrected corrected
RLAIF vs SFT 63% 61% Length Length
RLHF vs SFT 64% 61% Models
uncorrected corrected
RLAIF vs RLHF 52% 50% Same-size RLAIF
68% 59%
Table 9: Length-controlled win rate for the helpful dia- vs SFT
logue generation task. Direct RLAIF
74% 65%
vs SFT
Direct RLAIF vs
60% 56%
K Combining Human and AI Feedback Same-size RLAIF
We investigate the effectiveness of combining hu- Table 11: Length-controlled win rate for same-size
RLAIF and direct RLAIF.
man feedback and AI feedback on the task of sum-
marization. We refer to this approach as RLHF +
RLAIF and compare it against RLHF. L Cost of LLM vs. Human Labeling
First, given contexts randomly drawn from the
Reddit TL;DR dataset, responses are generated by Using LLMs as data annotators can be much less
RLHF and SFT policies with temperature T = 1.0. costly than hiring human annotators (Wang et al.,
The instruction-tuned PaLM 2 L is then called to 2021a). We estimate AI preference labeling to be
generate AI preferences. Finally, a new RM is over 10x less costly than human preference labeling
trained on both the entire OpenAI human prefer- following the calculations below.
ence dataset and an equivalent size AI preference At the time of writing, GPT-4 charges $0.03
dataset. USD and $0.06 USD for every 1,000 tokens to
We observe that RLHF + RLAIF does not im- encode and decode, respectively (OpenAI, 2023b).
prove beyond RLHF alone. RLHF + RLAIF and For labeling TL;DR preferences with an LLM, our
RLHF achieve win rates of 71% and 74% over SFT, average token lengths were as follows:
respectively. The difference in win-rate is not statis-
tically significant. When compared head-to-head, 1. Input prompt length - 830 tokens (using the
raters prefer both policies equally. “Detailed + CoT 0-shot” prompt)
While this experiment did not show positive re- 2. Generated chain-of-thought rationale - 61 to-
sults from combining RLAIF and RLHF, there are kens
many alternative setups which could prove success-
ful. One such setup could involve first conduct- Additionally, to debias position, we repeat each
ing RLAIF, then collecting generations and human labeling procedure after inverting the order in
preferences using the RLAIF policy as the initial- which a pair of responses are shown. Our estimated
ization point for RLHF. In this curriculum learning AI labeling cost per example is $0.06 USD15 .
approach, RLAIF can be viewed as a “warm-up” In comparison, Google Cloud’s human annota-
policy, which is then refined with RLHF. Another tion service charges approximately $0.11 USD / 50
possible setup could involve collecting much more words for classification tasks at the time of writ-
AI feedback than human feedback, since it is much 15
2 inferences * (830 encoder tokens * $0.03 / 1,000 tokens
less expensive to collect (see Appendix L). We + 61 decoder tokens * $0.06 / 1,000 tokens) = $0.057 ∼ =
leave this exploration to future work. $0.06
Length Length Self-Consistency AI Labeler Alignment
Models
uncorrected corrected 1 sample, T=0.0 78.0%
Base RLAIF 16 samples, T=0.3 76.2%
63% 59% 16 samples, T=0.5 75.1%
vs SFT
Detailed RLAIF 16 samples, T=0.7 74.0%
67% 63% 16 samples, T=1.0 72.8%
vs SFT
Base RLAIF vs Table 14: Sampling multiple chain-of-thought rationales
41% 45%
Detailed RLAIF with T > 0 results in lower alignment with human
preferences. Note: 1 and 16 samples represent 2 and 32
Table 12: Length-controlled win rate for the experiment inferences given our position debiasing technique (see
on end-to-end sensitivity to AI labeler alignment. Base Section 2.1.1).
RLAIF and Detailed RLAIF correspond to “Base 0-shot”
RLAIF and “Detailed CoT 0-shot” RLAIF described in
Appendix N, respectively. to obtain the final preference distribution.
On the task of summarization, we experiment
Length Length with self-consistency using 4 and 16 samples un-
Models
uncorrected corrected der decoding temperatures ranging from 0.3 to 1.0
RLHF + RLAIF (see Figure 14)17 . In all settings, self-consistency
71% 61%
vs SFT decreases AI labeler alignment versus the baseline
RLHF without self-consistency. Our experiments show
74% 67%
vs SFT that alignment decreases as temperature increases,
RLHF + RLAIF with the largest drop of over -5% at T = 1.0. In
48% 46%
vs RLHF our experiments, using 4 vs. 16 self-consistency
samples does not impact AI labeler alignment.
Table 13: Length-controlled win rate for experiments
combining human and AI feedback. Manually inspecting chain-of-thought rationales
did not reveal any common patterns for why self-
consistency might degrade alignment (examples in
ing16 (Google, 2023). We assume that each classifi- Table 20). One hypothesis is that using a temper-
cation task only consists of reading a document and ature of T > 0 leads the model to generate lower
two candidate summaries, which have a combined quality rationales compared to greedy decoding,
average word length of 304 words. We estimate the ultimately leading to worse accuracy overall.
human labeling cost per example to be $0.67 USD
(304 words * $0.11 / 50 words). N End-to-end Sensitivity to AI Labeler
We recognize that this cost analysis does not ac- Alignment
count for all factors, such as the cost of training
human annotators, tasking multiple human anno- We assess the end-to-end sensitivity of the RLAIF
tators to rate the same instance for robustness, the policies to AI labeler alignment on the task of sum-
cost of expert vs. crowd-sourced annotators, or the marization. Since human judgement is subjective
cost of setting up LLM labeling. and prone to noise, we test whether better AI la-
beler alignment leads to improved downstream per-
M Self-Consistency formance. We train two RLAIF policies that only
differ in the prompting technique used for AI la-
For chain-of-thought prompts, we also experiment beling - “Base 0-shot” and “Detailed CoT 0-shot”,
with self-consistency (Wang et al., 2022b) - a yielding 76.1% and 78.0% AI labeler alignment,
technique to generate robust chain-of-thought ra- respectively.
tionales. In self-consistency, multiple chain-of- When compared head-to-head, human evalua-
thought rationales are sampled with temperature tors prefer “Detailed CoT 0-shot” RLAIF 59% of
T > 0, and LLM preference distributions are ob- the time over “Base 0-shot” RLAIF18 . This result
tained for each one. The results are then averaged suggests that small gains in AI labeler alignment
16
Google Cloud charges between $90 and $129 per 1,000 may lead to noticeable improvements in the final
units, where each unit is 50 words for a classification task. We
17
average the lower and upper bound costs and convert from Results of using 4 samples are not shown because they
units to words - ($90 / 1,000 units + $129 / 1,000 units) / 2 * 1 only differ from the 16-sample results by ±0.4%.
18
unit / 50 words = $0.1095 USD / 50 words Result is statistically significantly different from 50%.
RL policies. However, this study is limited, and fur-
ther experiments are required to draw generalizable
conclusions.
Preamble A good summary is a shorter piece of text that has the
essence of the original. ... Given a piece of text and two
of its possible summaries, output 1 or 2 to indicate which
summary best adheres to coherence, accuracy, coverage, and
overall quality as defined above.
Preferred Summary=1
Table 15: An example of a prompt fed to an off-the-shelf LLM to generate AI preference labels for summarization.
{text}, {summary1}, and {summary2} are populated with unlabeled examples, and a preference distribution
is obtained by computing the softmax of the log-probabilities of generating the tokens “1” vs. “2”.
Figure 4: A screenshot of the user interface presented to human evaluators, ultimately used to calculate win rates.
Raters are shown a context and asked to rank the quality of candidate responses.
“Base” preamble You are an expert summary rater. Given a piece of text and
two of its possible summaries, output 1 or 2 to indicate
which summary is better.
“Detailed” preamble A good summary is a shorter piece of text that has the
essence of the original. It tries to accomplish the same
purpose and conveys the key information from the original
post. Below we define four evaluation axes for summary
quality: coherence, accuracy, coverage, and overall quality.
Table 16: The “Base” and “Detailed” preambles given to the LLM labeler to obtain preference labels for the
summarization task.
Preamble A good summary is a shorter piece of text that has the
essence of the original. It tries to accomplish the same
purpose and conveys the key information from the original
post. Below we define four evaluation axes for summary
quality: coherence, accuracy, coverage, and overall quality.
Rationale:
Table 17: The prompt used for the “Detailed + CoT 0-shot” for summarization. For CoT prompts, we first decode a
response from the LLM and then concatenate it with the original prompt and the ending “Preferred Summary=”
before following the scoring procedure in Section 2.1 to obtain a preference distribution.
Preamble A good summary is a shorter piece of text that has the
essence of the original. ... Given a piece of text and
two of its possible summaries, explain which summary best
adheres to coherence, accuracy, coverage, and overall quality
as defined above.
Thoughts on Summary 1 -
Coherence - 7. Rationale: The summary is generally
understandable, though it could be written with better
grammar.
Accuracy - 9. Rationale: The summary doesn’t say things
that aren’t in the original text, and isn’t misleading.
Coverage - 6. Rationale: The summary covers most of the
important information in the post and conveys the gist of
the original text. However, it places more emphasis on “no
contact” and could have mentioned the smothering/neediness to
be more complete.
Overall Quality - 7. Rationale: The summary represents
the post fairly well with only minor areas where it could be
improved.
Thoughts on Summary 2 -
Coherence - 3. Rationale: The summary is long-winded and
has several grammatical errors.
Accuracy - 4. Rationale: The summary mentions that the
author broke no contact, but this is incorrect. Otherwise,
it is accurate.
Coverage - 8. Rationale: The summary covers the key points
in the original text.
Overall Quality - 4. Rationale: The summary is somewhat
misleading and doesn’t convey the original text’s key points
well.
Preferred Summary=1
Table 18: The template used for the “Detailed + CoT 1-shot” prompt for summarization, with some text removed
for brevity.
Sample to Annotate Text - I met my current girlfriend online around 6 months ago
when another one of our online friends was going through some
problems. ...
Thoughts on Summary 2 -
Coherence - 9. Rationale: The summary is concise and easy
to understand.
Accuracy - 9. Rationale: The summary is accurate and
mentions that the girlfriend hasn’t begun college yet.
Coverage - 9. Rationale: The summary covers the main points
of the post and mentions that the girlfriend hasn’t begun
college yet.
Overall Quality - 9. Rationale: The summary is concise,
accurate, and covers the main points of the post.
Table 19: An example of the different chain-of-thought rationales produced by the 0-shot (“Detailed + CoT 0-shot”)
vs. 1-shot (“Detailed + CoT 1-shot”) prompts for summarization. We note that the CoT 0-shot rationale is inaccurate,
as it mistakenly claims that Summary 1 states the girlfriend has graduated from college.
Sample to Annotate Text - I feel that out of principle I should be refunded
the adoption fee since the agency’s foster home infected the
kittens with the parasite. Both cats were born in the foster
home and there are 20 other cats. Do I have any legal right
to ask for the fee back? Or help with the cost of treating?
They had a disclaimer that they would not be held liable for
any vet bills incurred but I feel that as an agency whose
main purpose is finding forever home for “healthy, sociable
kittens” (as their website suggests) should be held liable in
some way.
Table 20: An example comparing chain-of-thought rationales produced at different temperatures for self-consistency
experiments. Rationale @ T = 1.0 is worse because it mis-attributes the hallucination about Giardia to Summary 2
instead of Summary 1.
Rationale:
Table 21: The “Base + CoT 0-shot” prompting template for the helpful dialogue generation task. The AI labels
generated using this prompt were used to conduct RLAIF experiments in Section 4.1.
Preamble You are an expert rater of harmful assistant responses.
Given the context and the two responses choose the most
harmful response.
Preferred Response=2
Preferred Response=1
Table 22: The “Base + 2-shot” prompting template for the harmless dialogue generation task. The AI labels
generated using this prompt were used to conduct RLAIF experiments in Section 4.1.
Figure 5: Example summaries generated by SFT, RLHF, and RLAIF policies for a Reddit post. RLHF and RLAIF
produce higher quality summaries than SFT, which fails to capture key details. Salient details are in bold.
Example #3 RLAIF summary: I’m a nice, chill girl who is often described
as “good” but I’m jealous of the girls that guys get enamored
with so easily. What can I do to improve myself or how I
communicate/interact with guys to make myself into someone a
guy wants to be with for the long haul?
Table 23: We observe that the RLHF policy sometimes hallucinates when the RLAIF policy does not. Hallucinations
are highlighted in red.
Example #1 RLAIF summary: Boyfriend is overly flirtatious with other
girls, I’ve talked to him about it, he doesn’t seem to care.
It’s causing trust issues. Am I overreacting? What else can
I do?
Example #2 RLAIF summary: Asked a girl to prom, things were going great
until I asked her. Now our conversations are awkward and I’m
not sure if I should ask her out. Should I just give up?
Table 24: We observe that that summaries from the RLAIF policy are sometimes less coherent and grammatical
than summaries from the RLHF policy. Less coherent phrases are highlighted in red.
RLAIF offers significant scalability advantages over RLHF because it circumvents the need for time-consuming and expensive human annotation by using AI to generate feedback. This method not only reduces the labor and costs associated with human feedback (reported to be over 10 times cheaper) but also speeds up the labeling process, thereby improving resource efficiency .
Directly prompting an LLM for reward scores during reinforcement learning bypasses the step of distilling LLM preferences into a reward model. This approach achieves a higher win rate over the supervised fine-tuning baseline than the canonical RLAIF setup, indicating performance improvements in text generation tasks when the LLM is leveraged for immediate feedback .
Challenges with AI-generated labels may include potential bias, lack of nuanced understanding, and variability in label quality. These challenges can be mitigated by refining the LLM training process to capture a diverse array of user preferences, employing techniques like chain-of-thought reasoning, and continuously comparing AI labels against heuristic benchmarks to ensure accuracy and consistency .
The critical evaluation axes for assessing summary quality include coherence, accuracy, coverage, and overall quality. These axes influence preference labeling by shaping the criteria for comparing and rating different summaries, ensuring that preferences align with the strength of summaries in readability (coherence), fidelity to the source (accuracy), comprehensiveness (coverage), and overall representation quality. These axes guide raters on which summary better represents the underlying text .
RLAIF maintains alignment with human preferences by utilizing AI-generated preference labels that have demonstrated a high degree of accuracy in reflecting human judgment. Techniques such as soliciting chain-of-thought reasoning improve this alignment, while detailed preambles and few-shot prompting selectively enhance performance depending on the task .
RLAIF, or Reinforcement Learning from AI Feedback, utilizes large language models to generate preference labels for training, instead of human feedback as in RLHF (Reinforcement Learning from Human Feedback). It achieves comparable or superior performance to RLHF across summarization, helpful dialogue generation, and harmless dialogue generation tasks. RLAIF outperforms the supervised fine-tuning baseline and is equally preferred to RLHF by human evaluators in terms of alignment with human preferences. It also outperforms RLHF in harmless dialogue generation .
The experiment, referred to as 'same-size RLAIF', involved using an AI labeler with the same number of parameters as the policy LLM, specifically the instruction-tuned PaLM 2 XS. Despite the similar size, RLAIF was still preferred over the supervised fine-tuning baseline in summarization tasks, indicating that RLAIF can still offer improvements even without a larger AI labeler .
Combining human and AI feedback in the studies did not show an improvement over human feedback alone possibly due to overlapping benefits of each method or the unsuitability of current integration methods. Future research could explore alternative setups for combining feedback, such as refining how each form of feedback is weighted or leveraged within the training process, or innovating hybrid models that better capture advantages of both feedback types .
'Constitutional AI' enhances training by employing AI-generated revisions to iteratively improve responses according to predefined value statements, leveraging both human and AI preferences. Although a hybrid approach using 'Constitutional AI' was shown to outperform supervised fine-tuning, the study by Bai et al. did not directly compare RLAIF against human feedback, leaving the question of their relative efficacy unanswered .
Same-size policy experiments challenge conventional views of model hierarchy by showing that substantial performance gains are possible without increasing model size. These findings imply that 'self-improvement' in language models may leverage efficiency rather than expanding model scales, prompting new research avenues in self-revising models that optimize existing resources .