0% found this document useful (0 votes)
215 views29 pages

RLAIF: AI Feedback in Reinforcement Learning

Uploaded by

ambrose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
215 views29 pages

RLAIF: AI Feedback in Reinforcement Learning

Uploaded by

ambrose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

RLAIF: Scaling Reinforcement Learning from Human Feedback

with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu,
Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash
Google Research
{harrisonlee,samratph,hassan}@[Link]

Abstract RLAIF and RLHF Win Rates


Reinforcement learning from human feedback RLAIF RLHF

(RLHF) has proven effective in aligning large 80%


language models (LLMs) with human prefer-
arXiv:2309.00267v2 [[Link]] 1 Dec 2023

Win Rate vs. SFT


ences. However, gathering high-quality hu- 73%
70%
man preference labels can be a time-consuming 71%
and expensive endeavor. RL from AI Feed-
back (RLAIF), introduced by Bai et al., of- 60% 63% 64%
fers a promising alternative that leverages a
powerful off-the-shelf LLM to generate pref- 50%
erences in lieu of human annotators. Across Summarization Helpfulness
the tasks of summarization, helpful dialogue
generation, and harmless dialogue generation, Harmless Rate by Policy
RLAIF achieves comparable or superior perfor- 90%
mance to RLHF, as rated by human evaluators.
88%
Furthermore, RLAIF demonstrates the ability
80%
to outperform a supervised fine-tuned baseline
Harmless Rate

even when the LLM preference labeler is the 76%


70%
same size as the policy. In another experiment,
directly prompting the LLM for reward scores
60% 64%
achieves superior performance to the canonical
RLAIF setup, where LLM preference labels
are first distilled into a reward model. Finally, 50%
SFT RLHF RLAIF
we conduct extensive studies on techniques for
generating aligned AI preferences. Our results Figure 1: Human evaluators strongly prefer RLAIF and
suggest that RLAIF can achieve human-level RLHF over the SFT baseline for summarization and
performance, offering a potential solution to helpful dialogue generation. Their difference in win
the scalability limitations of RLHF. rates vs. SFT is not statistically significant. Further-
1 Introduction more, when compared head-to-head, RLAIF is equally
preferred to RLHF. For harmless dialogue generation,
Reinforcement Learning from Human Feedback RLAIF outperforms RLHF.
(RLHF) is an effective technique for aligning lan-
guage models to human preferences (Stiennon
et al., 2020; Ouyang et al., 2022). It is cited as bels. This raises the question of whether artificially
one of the key drivers of success in modern conver- generated labels can be a viable substitute. Gen-
sational language models, such as ChatGPT (Liu erating labels with large language models (LLMs)
et al., 2023) and Bard (Manyika, 2023). Train- is one promising approach, as LLMs have shown
ing language models with reinforcement learning a high degree of alignment with human judgment
(RL) enables optimization on complex, sequence- (Gilardi et al., 2023; Ding et al., 2023). Bai et al.
level objectives that are not easily differentiable (2022b) was the first effort to explore Reinforce-
and therefore ill-suited for traditional supervised ment Learning from AI Feedback (RLAIF)1 , where
fine-tuning (SFT). 1
This is distinct from “Constitutional AI”, which improves
One obstacle for employing RLHF at scale is its upon a supervised learning model through iteratively asking
dependence on high-quality human preference la- an LLM to generate better responses according to a a set of
RL was conducted using a reward model trained ble and few-shot prompting (Brown et al., 2020)
on LLM preferences. Bai et al. (2022b) showed are only beneficial for certain tasks. We also con-
that utilizing a hybrid of human and AI preferences, duct scaling experiments to examine the trade-off
in conjunction with their “Constitutional AI” self- between the size of the LLM labeler and alignment
revision technique, outperforms supervised fine- with human preferences.
tuning for training a conversational assistant. How- The main contributions of this work are as fol-
ever, it did not directly compare the efficacy of lows:
human vs. AI feedback, leaving the question of
whether RLAIF can be a suitable alternative to 1. We demonstrate that RLAIF achieves compa-
RLHF unanswered. rable or superior performance to RLHF on
In this work, we study the impact of RLAIF the tasks of summarization, helpful dialogue
and RLHF (see Figure 2) on three text genera- generation, and harmless dialogue generation.
tion tasks: summarization, helpful dialogue gen- 2. We show that RLAIF can improve upon a SFT
eration, and harmless dialogue generation. Our policy even when the LLM labeler is the same
experiments show that RLAIF and RLHF are pre- size as the policy.
ferred by humans over the SFT baseline 71% and 3. We find that directly prompting the LLM
73% of the time for summarization and 63% and for reward scores during RL can outperform
64% of the time for helpful dialogue generation, the canonical setup where a reward model is
respectively, where the differences between RLAIF trained on LLM preferences.
and RLHF win rates are not statistically signifi- 4. We compare various techniques for generat-
cant. We also conduct a head-to-head comparison ing AI labels and identify optimal settings for
of RLAIF against RLHF and find that both policies RLAIF practitioners.
are equally preferred2 . For harmless dialogue gen-
eration, human evaluators rated the harmlessness 2 Methodology
of each response independently. RLAIF scored This section describes the techniques used to gener-
a higher harmless rate than RLHF, and both out- ate preferences with an LLM, how RL is conducted,
performed the SFT baseline (88%, 76%, and 64%, and evaluation metrics. Preliminaries on RLHF are
respectively). These results suggest that RLAIF is a provided in Appendix A.
viable alternative to RLHF that does not depend on
human annotation, while offering appealing scaling 2.1 Preference Labeling with LLMs
properties. We annotate preferences between pairs of candi-
Additionally, we investigate two related ques- dates with an “off-the-shelf” LLM - a model pre-
tions. First, we explore whether RLAIF can im- trained or instruction-tuned (Wei et al., 2021) for
prove upon a SFT policy when the LLM labeler general usage but not fine-tuned for a specific down-
has the same number of parameters as policy. Even stream task. Given a piece of text and two candidate
in this scenario, RLAIF significantly improves over responses, the LLM is asked to rate which response
the SFT baseline. Second, we conduct an ex- is preferred. The prompt is structured as follows
periment where the off-the-shelf LLM is directly (examples in Tables 15 and 21):
prompted for reward scores during RL, bypassing
the step of distilling LLM preference labels into 1. Preamble - Introduction and instructions de-
a reward model. This method achieves an even scribing the task at hand
higher win rate over SFT than the canonical distil- 2. Few-shot exemplars (optional) - An example
lation method. input context, a pair of responses, a chain-of-
Finally, we study techniques to maximize the thought rationale (optional), and a preference
alignment of AI-generated preferences to human label
preferences. We find that soliciting chain-of- 3. Sample to annotate - An input context and a
thought reasoning (Wei et al., 2022) consistently pair of responses to be labeled
improves alignment, while using a detailed pream- 4. Ending - Ending text to prompt the LLM (e.g.
“Preferred Response=”)
written value statements. Both were introduced in Bai et al.
(2022b) and are sometimes conflated.
2
The win rate for one policy vs. the other is not statistically After the prompt is given to the LLM, we ex-
significantly different from 50% tract the log-probabilities of generating the tokens
Figure 2: A diagram depicting RLAIF (top) vs. RLHF (bottom)

“1” and “2” and compute the softmax to obtain a averaged to obtain the final preference distribution.
preference distribution.
2.1.2 Chain-of-thought Reasoning
There are numerous alternatives to obtain pref-
erence labels from LLMs, such as extracting the We experiment with eliciting chain-of-thought
preference from a free-form generated response (CoT) reasoning (Wei et al., 2022) from our AI
(e.g. “The first response is better”), or represent- labelers through a two-step inference procedure.
ing the preference distribution as a one-hot encod- First, we replace the Ending of the standard prompt
ing. However, we choose our method because it (e.g. “Preferred Summary=”) with a sentence ask-
is straightforward to implement and conveys more ing for thoughts and explanation (e.g. “Consider
information than a one-hot encoding through its the coherence, accuracy, coverage, and overall
distributed representation of preferences. quality of each summary and explain which one is
We experiment with two styles of preambles: better. Rationale:”) and then decode a response
“Base”, which essentially asks “which response is from the LLM. Then, we concatenate the origi-
better?”, and “Detailed”, which resembles detailed nal prompt, the response, and the standard Ending
rating instructions that would be given to human string together, and follow the scoring procedure in
preference annotators (see Table 16 for pream- Section 2.1 to obtain a preference distribution. See
bles for the summarization task). We also experi- Figure 3 for an illustration.
ment with in-context learning (Brown et al., 2020), In zero-shot prompts, the LLM is not given an
where high-quality exemplars were hand-selected example of what reasoning should look like. In
to cover a range of topics. few-shot prompts, we provide examples of CoT
reasoning for the model to follow. See Tables 17
2.1.1 Addressing Position Bias and 18 for examples.
The order in which candidates are shown to
2.2 Reinforcement Learning from AI
an LLM can bias which candidate it prefers
Feedback
(Pezeshkpour and Hruschka, 2023; Wang et al.,
2023). We find evidence of position bias, which 2.2.1 Distilled RLAIF
is more pronounced with smaller sizes of LLM We describe our adaptation of the canonical RLAIF
labelers (see Appendix B). setup below, which we also refer to as “distilled
To mitigate position bias in preference labeling, RLAIF”. Unless otherwise mentioned, RLAIF is
we make two inferences for every pair of candi- carried out using this method.
dates, where the order in which candidates are pre- After labeling preferences with an LLM, a re-
sented to the LLM is reversed for the second in- ward model (RM) is trained on these labels. Since
ference. The results from both inferences are then our approach produces soft labels (e.g. [0.6, 0.4]),
Figure 3: An illustration of the process of obtaining AI-generated labels for summarization preferences. The LLM
is first prompted to explain its thoughts on the quality of the two candidates (blue). The LLM’s response is then
appended to the original prompt (orange) and fed to the LLM a second time to generate a preference distribution
over “1” vs. “2” based on their log-probabilities (green).

we apply a cross-entropy loss to the softmax of the the canonical setup when the AI labeler is larger
reward scores generated by the RM. The softmax than the RM.
converts the RM scores into a probability distri-
2.3 Evaluation
bution. We note that training a RM on a dataset
of AI labels can be viewed as a form of model We evaluate our results with three metrics - AI
distillation. Labeler Alignment, Win Rate, and Harmless Rate.
Finally, we conduct reinforcement learning to AI Labeler Alignment measures the accuracy of
train the RLAIF policy model, using the RM to AI-labeled preferences with respect to human pref-
assign rewards to model responses. erences. For a single example, a soft AI-labeled
preference is first converted to a binary representa-
2.2.2 Direct RLAIF tion (e.g. [0.6, 0.4] → [1, 0]). Then, a 1 is assigned
An alternative approach is to directly use LLM if the label agrees with the human preference and
feedback as the reward signal in RL. This enables 0 otherwise. The alignment accuracy zacc can be
bypassing the intermediate stage of training a RM expressed as follows:
that approximates the preferences of the LLM. D
1 X
The LLM is prompted to rate the quality of a zacc = 1[arg max Pi,j
AI
= pH
i ],
D j
generation between 1 and 10. Similar to the for- i=1
mat mentioned in Section 2.1, the prompt contains where D is the size of the preference dataset,
high-level details on the structure of the input and P AI ∈ RD×2 is the matrix of soft AI preferences,
the dimensions along which to rate a generation and phuman ∈ RD is the corresponding vector of
(e.g. factuality, coherence). Then, the likelihood human preferences, containing elements 0 or 1 to
of each score token between 1 and 10 is com- denote whether the first or second response is pre-
puted, the likelihoods are normalized to a prob- ferred, respectively.
ability distribution, a weighted score is calculated Win Rate evaluates the end-to-end quality of two
as s(x|c) = 10
P
i=1 iP (i|x, c), and then the score is policies by measuring how often one policy is pre-
again normalized to the range [−1, 1]. Additional ferred by human annotators over another. Given an
details on the prompting technique can be found in input and two generations, human annotators select
the Appendix D. which generation they prefer. The percentage of
Finally, RL is conduct RL in a similar manner to instances where policy A is preferred over policy
“distilled RLAIF”, where the direct score is used B is referred to as the “win rate of A vs. B”. A
as reward instead of the score from a RM. This 50% win rate indicates that A and B are equally
approach is more computationally expensive than preferred.
Harmless Rate measures the percentage of re- there were roughly 3-4k examples for each task5 .
sponses that are considered harmless by human AI labeler alignment metrics were calculated on
evaluators. We evaluate the harmless dialogue gen- these downsampled datasets.
eration task with this metric instead of Win Rate, PaLM 2 (Google et al., 2023) is used as the LLM
because we find that many responses are equally for labeling preferences. The versions used are
safe, making it difficult to assign relative rankings. instruction-tuned but not previously trained with
RL. Unless otherwise specified, AI labels were
3 Experimental Details generated using PaLM 2 Large (L) with the best-
3.1 Datasets performing prompt in Section 4.4. For more details
on LLM labeling, see Appendix D.
We use the following datasets for our experiments:
3.3 Model Training
• Reddit TL;DR (Stiennon et al., 2020) - posts
from Reddit3 accompanied by summaries of All SFT models are initialized from PaLM 2 Extra-
the posts. Small (XS). For summarization, the SFT model is
• OpenAI’s Human Preferences (Stiennon et al., produced by fine-tuning PaLM 2 XS on the Reddit
2020) - a dataset created from a subset of Red- TL;DR dataset. For all other tasks, an instruction-
dit TL;DR. Each example comprises a post, tuned variant of PaLM 2 is used in lieu of task-
two candidate summaries, and a rating from a specific fine-tuning.
human annotator indicating which summary RMs are also derived from PaLM 2 XS. RMs
is preferred. are fine-tuned on the entire training split of the
• Anthropic Helpful and Harmless Human Pref- corresponding preference dataset, where the label
erences (Bai et al., 2022a) - conversations be- is the AI preference for AI feedback RMs and the
tween a human and an AI assistant, where original human preference label in the dataset for
each conversation has two possible AI assis- human feedback RMs. RM accuracies can be found
tant responses - one preferred and the other in Appendix G.
non-preferred, according to a human annota- In the RL phase, the policy is trained with a
tor. Preference is based on which response modified version of REINFORCE (Williams, 1992)
is more informative and honest for the help- adapted to the language modeling domain. While
ful task, and which response is safer for the many recent works use Proximal Policy Optimiza-
harmless task. tion (PPO) (Schulman et al., 2017) - a related
method that adds a few techniques to make train-
More dataset details can be found in Appendix C.
ing more conservative and stable (e.g. clipping the
We also experimented with the Stanford Human objective function), we use REINFORCE with a
Preferences dataset (Ethayarajh et al., 2022), but baseline given that it is simpler yet still effective for
we found that both RLHF and RLAIF policies did the problem at hand. Both policy and value models
not show meaningful improvements over the SFT are initialized from the SFT model. For summa-
baseline after correcting for length biases, using rization, the policy is rolled out on the training split
the procedure in Appendix J. of the Reddit TL;DR dataset. In other words, the
initial states for RL are the original posts from the
3.2 LLM Labeling
dataset prior to summarization. For the helpful and
To enable fast experiment iteration when evaluating harmless tasks, the initial states are drawn from
AI labeling techniques, we randomly downsampled the training splits of the preference datasets. For
the training split of each preference dataset. For summarization, simple post-processing is applied
summarization, an additional filter was applied to to responses generated by RL-trained policies as
only include examples where human annotators described in Appendix E.
preferred one summary over the other with high For additional details on the RL formulation and
confidence4 . After downsampling and filtering, model training, see Appendices F and G.
3
[Link]
4 5
This follows the evaluation procedure in Stiennon et al. We sample 15%, 10%, and 10% of the training splits
(2020). Examples with confidence scores of 1, 2, 8, and 9 for summarization, helpful dialogue generation, and harmless
were considered to be “high-confidence” dialogue generation, respectively.
3.4 Human Evaluation perform the SFT policy, and by similar margins to
For experiments evaluated by win rates, evaluators one another. See Appendix J for details.
were presented with an input context and multiple One natural question that arises is whether there
responses generated from different policies (e.g. is value in combining human and AI feedback. We
RLAIF, RLHF, and SFT). They were then asked experimented with combining both types of feed-
to rank responses in order of quality without ties, back but did not see an improvement beyond using
as seen in Figure 4. Input contexts were drawn human feedback alone. However, we believe that
from test splits of datasets, which were not used there are several alternative training setups that
for training or any other evaluation6 . Rankings could demonstrate value in combining both forms
were used to calculate win rates with respect to of feedback. See Appendix K for details.
pairs of policies. For harmless dialogue generation, These results suggest that RLAIF is a viable
evaluators were asked to independently rate each alternative to RLHF that does not depend on human
response as harmless or harmful. annotation. In addition to expediting labeling time
For more details on human evaluation, see Ap- and reducing dependence on annotation services,
pendix I. another key benefit of AI labeling is cost reduction.
We estimate the cost of labeling with an LLM to
4 Results be over 10x cheaper than human annotation. See
Appendix L for detailed calculations.
4.1 RLAIF vs. RLHF
RLAIF achieves performance gains on par with or 4.2 Towards Self-Improvement
better than RLHF on all three tasks (see Figure 1
In Section 4.1, the LLM used to label preferences
and Table 1). RLAIF and RLHF are preferred by
(PaLM 2 L) is much larger than the policy being
human evaluators over the baseline SFT policy 71%
trained (PaLM 2 XS). Going one step further, one
and 73% of the time for summarization7 and 63%
might wonder if RLAIF can yield improvements
and 64% for helpful dialogue generation, respec-
when the AI labeler is the same size as the policy.
tively. The difference in win rates between RLAIF
On the task of summarization, we conduct RLAIF
vs. SFT and RLHF vs. SFT are not statistically sig-
where PaLM 2 XS is used as the AI labeler instead
nificant. When directly comparing RLAIF against
of PaLM 2 L. The rest of the setup mimics the
RLHF, they are equally preferred - i.e. the win rate
experiment in Section 4.1. We refer to this setup as
is not statistically significantly different from 50%.
“same-size RLAIF”.
For harmless dialogue generation, RLAIF achieves
a harmless rate of 88%, outperforming both RLHF Human annotators prefer same-size RLAIF 68%
and SFT, which score 76% and 64%, respectively8 . of the time over SFT (see Table 1). For reference,
RLAIF using an AI labeler larger than the policy is
Figure 5 contains an example of SFT, RLAIF,
preferred 71% over SFT9 . This result demonstrates
and RLHF summaries. To better understand how
that RLAIF can yield improvements even when the
RLAIF compares to RLHF, we qualitatively com-
AI labeler is the same size as the policy LLM.
pare responses generated by both policies for sum-
marization in Section 5. We note that the AI labeler and initial policy are
As observed in Stiennon et al. (2020), RLAIF not the exact same model. The AI labeler is the
and RLHF policies tend to generate longer re- instruction-tuned PaLM 2 XS, whereas the initial
sponses than the SFT policy, which may be par- policy is PaLM 2 XS fine-tuned on Reddit TL;DR
tially responsible for their higher win rates. We summarization. Additionally, the summaries rated
conduct post-hoc analysis to control for length and by the AI labeler were generated by policies created
find that both RLAIF and RLHF policies still out- by the original dataset curators. For these reasons,
we do not consider this experiment a strict case of
6
For summarization, we used the test split of Reddit “self-improvement”(Huang et al., 2022). However,
TL;DR. For helpful and harmless dialogue generation, we
used test splits from the preference datasets, detailed in Ap-
we believe that these results show great promise
pendix C. for this research direction.
7
RLAIF and RLHF are also preferred over the human
reference summaries in Reddit TL;DR 79% and 80% of the 9
The difference between win rates between “same-size
time, respectively. RLAIF vs. SFT” and “RLAIF vs. SFT” is not statistically
8
RLAIF achieves a statistically significant improvement significant. For a two-sample t-test, p-value = 0.07. At alpha
over RLHF and SFT, according to a two-sample t-test. = 0.05, this difference is not statistically significant.
Win Rate Harmless Rate
Summa Helpful Harmless
Comparison Model
-rization dialogue dialogue
RLAIF vs SFT 71% 63% SFT 64%
RLHF vs SFT 73% 64% RLHF 76%
RLAIF vs RLHF 50% 52% RLAIF 88%
Same-size RLAIF vs SFT 68%
Direct RLAIF vs SFT 74%
Direct RLAIF vs Same-size RLAIF 60%

Table 1: Left side: Win rates when comparing generations from two different models for the summarization and the
helpful dialogue tasks, judged by human evaluators. Right side: Harmless rates across policies for the harmless
dialogue task, judged by human evaluators.

4.3 Direct RLAIF AI Labeler Alignment


Prompt Summary H1 H2
In Sections 4.1 and 4.2, AI feedback was distilled
Base 0-shot 76.1% 67.8% 69.4%
into a RM. On the summarization task, we experi-
Base 1-shot 76.0% 67.1% 71.7%
ment with using an off-the-shelf LLM to directly Base 2-shot 75.7% 66.8% 72.1%
provide rewards during RL, bypassing RM train- Base + CoT 0-shot 77.5% 69.1% 70.6%
ing entirely. Since using a large AI labeler in RL Detailed 0-shot 77.4% 67.6% 70.1%
is computationally expensive, we use the smaller Detailed 1-shot 76.2% 67.6% 71.5%
instruction-tuned PaLM 2 XS as the off-the-shelf Detailed 2-shot 76.3% 67.3% 71.6%
LLM. We refer to this setup as “direct RLAIF”. Detailed 8-shot 69.8% – –
Human annotators prefer responses from direct Detailed + CoT 0-shot 78.0% 67.8% 70.1%
RLAIF 74% of the time over SFT responses (see Detailed + CoT 1-shot 77.4% 67.4% 69.9%
Table 1). To understand the impact of directly uti- Detailed + CoT 2-shot 76.8% 67.4% 69.2%
lizing LLM feedback in RL, we compare this result
Table 2: We observe that eliciting chain-of-thought rea-
to the same-size RLAIF policy from Section 4.2,
soning tends to improve AI labeler alignment, while
which solely differs in training a RM that provides few-shot prompting and detailed preambles have mixed
rewards during RL. Direct RLAIF outperforms effects across tasks. H1 refers to helpfulness, H2 to
same-size RLAIF, which achieves a statistically harmlessness.
significantly lower win rate of 68%. Furthermore,
when shown responses side-by-side, raters prefer
direct RLAIF over same-size RLAIF 60% of the and harmlessness, respectively.
time10 . One hypothesis for the improved quality is Detailed preambles consistently improve align-
that bypassing the distillation from AI preferences ment for summarization, while yielding mixed re-
into a RM enables information to flow directly from sults for helpful and harmless dialogue generation.
the off-the-shelf LLM to the policy. We hypothesize that summarization benefits more
from a specific preamble due to the high complexity
4.4 Prompting Techniques of this task. On the other hand, rating helpfulness
We experiment with three types of prompting vari- and harmlessness are more intuitive to grasp, and
ations - preamble specificity, chain-of-thought rea- therefore may benefit less from detailed instruc-
soning, and in-context learning (see Table 2). We tions.
observe that eliciting chain-of-thought reasoning Chain-of-thought reasoning improves alignment
generally improves AI labeler alignment, while consistently for summarization. For helpful and
the impacts of preamble specificity and in-context harmless dialogue generation, CoT only improves
learning vary across tasks. The best prompts outper- alignment when paired with the “Base” preamble.
form the base prompts (“Base 0-shot”) by +1.9%, Surprisingly, we observe that few-shot in-context
+1.3%, and +1.7% for summarization, helpfulness, learning only improves alignment for harmless di-
alogue generation11 . For summarization and help-
10
This is statistically significantly different from 50% ac-
11
cording to a two-sample t-test. We verified that all inputs used in these experiments fit
fulness, alignment monotonically decreases as the ences with varying LLM sizes and observe a strong
number of exemplars increases. It seems unlikely relationship between size and alignment (see Table
that this effect is a result of exemplar quality, as 3). Alignment decreases -4% when moving from
exemplars were carefully handpicked to be high- PaLM 2 Large (L) to PaLM 2 Small (S), and de-
quality and representative of each preference task. creases another -11% when moving down to PaLM
Furthermore, we conducted 10 trials for “Base 1- 2 XS - a trend consistent with scaling behaviors ob-
shot” on summarization, where a different exem- served in other work (Kaplan et al., 2020). Besides
plar was randomly selected for each trial. The general model capability, another contributing fac-
maximum AI labeler alignment from all trials was tor to this trend may be that smaller LLMs are more
76.1%, which still did not surpass “Base 0-shot” susceptible to position bias (see Appendix B).
in terms of AI labeler alignment. One hypothesis On the other end of this trend, these results also
for why exemplars do not help is that the summa- suggest that scaling up AI labeler size may pro-
rization and helpful dialogue generation tasks may duce even higher quality preference labels. Since
already be sufficiently well-understood by the pow- the AI labeler is only used to generate preference
erful AI labeler, rendering the exemplars unhelpful examples once and is not called during RL, using
or distracting. It’s interesting to note that in-context an even larger AI labeler is not necessarily pro-
learning is still an important research area that is hibitively expensive.
not fully understood (Min et al., 2022; Wang et al.,
2022a). Model Size AI Labeler Alignment
For summarization, we compare against human PaLM 2 L 78.0%
inter-annotator agreement to get a sense of how PaLM 2 S 73.8%
well our LLM labeler performs in absolute terms. PaLM 2 XS 62.7%
Stiennon et al. (2020) estimated that agreement rate
for the OpenAI human preference dataset was 73- Table 3: AI labeler alignment increases as the size of
77%, suggesting that the off-the-shelf LLM achiev- the LLM labeler increases.
ing 78% alignment performs well in absolute terms.
We also conduct experiments with self-
consistency (Wang et al., 2022b), where multiple 5 Qualitative Observations
chain-of-thought rationales are sampled with tem-
perature T > 0. The preference distributions gen- To better understand how RLAIF compares to
erated by the LLM are averaged together to ar- RLHF, we inspected responses generated by both
rive at the final preference label. We find that self- policies for the summarization task. In many
consistency strictly degrades AI labeler alignment cases, the two policies produced similar summaries,
(see Appendix M). which is reflected in their similar win rates. How-
We hypothesize that higher AI labeler alignment ever, we identified two patterns where they some-
leads to improvements in RLAIF policies. To this times diverged.
end, we conduct an experiment on the end-to-end
The first pattern we observed is that in some
sensitivity to AI labeler alignment. Two RLAIF
cases, RLAIF hallucinates when RLHF does not.
policies are trained that only differ in the alignment
The hallucinations in RLHF summaries sound plau-
scores of AI labels. Results show that the policy
sible but are inconsistent with the original text. For
trained with more aligned AI labels achieves a sig-
instance, in Example #1 of Table 23, the RLHF
nificantly higher win rate. However, this study only
summary states that the author is 20 years old, but
compares two policies, and rigorous experimenta-
this is neither mentioned nor implied by the source
tion is required to draw definitive conclusions. See
text. The second pattern we observed is that RLAIF
Appendix N for details.
sometimes produces less coherent or grammatical
4.5 Size of LLM Labeler summaries than RLHF. For instance, in Example #1
of Table 24, the RLAIF summary generates run-on
Large model sizes are not widely accessible and sentences.
can be slow and expensive to run. On the task of
More systematic analysis is required to identify
summarization, we experiment with labeling prefer-
if these patterns exist at scale, which we leave to
within our AI labeler’s context length. future work.
6 Related Work RLAIF is effective even when the LLM labeler is
the same size as the policy, and directly prompting
LLMs have shown impressive performance over
the LLM labeler to provide rewards during RL can
a wide range of NLP tasks (Brown et al., 2020;
outperform the canonical RLAIF setup that distills
Thoppilan et al., 2022; Chowdhery et al., 2022;
preferences into a separate RM. Finally, we study
Google et al., 2023; OpenAI, 2023a). For several
the impact of AI labeling techniques on alignment
of these tasks, RL has emerged as an effective op-
to human preferences.
timization technique. While initial applications of
While this work highlights the potential of
RL on tasks such as translation (Wu et al., 2016,
RLAIF, there remain many fascinating open ques-
2018) and summarization (Gao et al., 2019; Wu
tions, such as whether conducting RLAIF itera-
and Hu, 2018) used automatic evaluation metrics
tively can achieve additional gains (i.e. use the
as rewards, such simplified formulations of rewards
most recent RLAIF policy to generate new re-
did not fully align with human notions of quality.
sponse pairs, conduct RLAIF, and repeat), how
Reinforcement learning from human feed-
RLAIF can be adapted to a model-based RL setting
back (Christiano et al., 2017) has been used as
where both human and assistant are modeled by
a technique to directly align LLMs with human
LLMs, and how AI feedback can be leveraged for
preferences (Ziegler et al., 2019) through train-
more specific credit assignment. We leave these
ing a reward model on pairwise comparisons of
questions for future work.
natural language responses. It has been success-
fully applied for summarization (Stiennon et al., Ethics
2020), generalized instruction following (Ouyang One ethical consideration concerns the utilization
et al., 2022; Lai et al., 2023), dialogue (Gilardi of AI-generated feedback as a source for model
et al., 2023; Manyika, 2023; Glaese et al., 2022; alignment. There exists a potential risk of transfer-
Bai et al., 2022a) and question answering (Nakano ring biases from the off-the-shelf LLM into the
et al., 2021). generated preferences. This in turn may result
LLMs have also been extensively used for data in RL-trained policies further amplifying biases,
generation (Wang et al., 2021b; Meng et al., 2023), thereby inadvertently misaligning models and po-
augmentation (Feng et al., 2021) and in self- tentially causing harm. Extreme caution must be
training setups (Wang et al., 2022b; Madaan et al., exercised, especially when deploying these mod-
2023). Bai et al. (2022b) introduced the idea of els in high-stakes domains such as medicine, law,
RLAIF, which used LLM labeled preferences in and employment, where they have the potential to
conjunction with human labeled preferences to significantly impact human lives in adverse ways.
jointly optimize for the two objectives of helpful- In such domains, we believe that human experts
ness and harmlessness. Recent works have also trained to carefully assign preferences according to
explored related techniques for generating rewards strict policies should be considered the gold stan-
from LLMs (Roit et al., 2023; Kwon et al., 2022; dard.
Yang et al., 2023). These works demonstrate that
Another ethical consideration is that reducing the
LLMs can generate useful signals for RL fine-
barriers to aligning LLMs also carries the risk of
tuning, which inspired this work’s investigation
facilitating their misuse for malicious purposes. For
into whether LLMs can serve as a viable alterna-
instance, RLAIF could be employed to train models
tive to humans in collecting preference labels for
to generate convincing misinformation or produce
RL.
hateful and abusive content. The best mitigation
7 Conclusion to this risk is to carefully govern the access and
usage of powerful LLMs (e.g. limiting “white-box”
In this work, we show that RLAIF achieves com- access), to prevent bad actors from misusing them.
parable improvements to RLHF on three text gen-
eration tasks. Our experiments show that RLAIF Reproducibility
greatly improves upon a SFT baseline, and the mar- To promote the reproducibility of this work, many
gin of improvement is on par with or greater than of the details of this research are shared through-
that of RLHF. Furthermore, in head-to-head com- out the paper. Open-source datasets are elabo-
parisons, RLAIF and RLHF are preferred at sim- rated upon in Appendix C, LLM labeling details
ilar rates by humans. Additionally, we show that in Appendix D, the RL formulation in Appendix F,
model training details in Appendix G, human eval- Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez,
uation details in I, and the most critical prompts Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua
Landau, Kamal Ndousse, Kamile Lukosuite, Liane
used in the Appendix (e.g. Tables 17, 21, and 22).
Lovitt, Michael Sellitto, Nelson Elhage, Nicholas
Please reach out to authors for any additional ques- Schiefer, Noemi Mercado, Nova DasSarma, Robert
tions or requests. Lasenby, Robin Larson, Sam Ringer, Scott John-
PaLM 2 models are available through Google ston, Shauna Kravec, Sheer El Showk, Stanislav Fort,
Cloud’s Vertex API, and the experiments in this Tamera Lanham, Timothy Telleen-Lawton, Tom Con-
erly, Tom Henighan, Tristan Hume, Samuel R. Bow-
work may also be repeated with other publicly avail- man, Zac Hatfield-Dodds, Ben Mann, Dario Amodei,
able LLMs. Nicholas Joseph, Sam McCandlish, Tom Brown, and
Jared Kaplan. 2022b. Constitutional ai: Harmless-
Acknowledgements ness from ai feedback.

We would like to thank many people who have Tom Brown, Benjamin Mann, Nick Ryder, Melanie
helped make this work complete. We thank Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Chen Zhu for optimizing our LLM inference Askell, et al. 2020. Language models are few-shot
setup, Le Hou for suggesting prompt improvements learners. Advances in neural information processing
and experimenting with self-consistency, Léonard systems, 33:1877–1901.
Hussenot for bringing the problem of position bias
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
in LLMs to our attention, and Bradley Green, Ewa Maarten Bosma, Gaurav Mishra, Adam Roberts,
Dominowska, and Blaise Aguera y Arcas for sup- Paul Barham, Hyung Won Chung, Charles Sutton,
porting this research. Sebastian Gehrmann, et al. 2022. Palm: Scaling
We thank everyone who thoroughly reviewed language modeling with pathways. arXiv preprint
arXiv:2204.02311.
our work and provided valuable feedback: Hakim
Sidahmed, Meiqi Guo, Michal Valko, Nevan Wich- Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar-
ers, Sian Gooding, and Yuan Cao. tic, Shane Legg, and Dario Amodei. 2017. Deep
We thank Mo Azar, Daniel Guo, Andrea Michi, reinforcement learning from human preferences. Ad-
vances in neural information processing systems, 30.
Nicolas Perez-Nieves, and Marco Selvi for their
work in developing a RLAIF training setup that Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken
directly prompts an LLM to obtain reward scores. Chia, Boyang Li, Shafiq Joty, and Lidong Bing. 2023.
Is GPT-3 a good data annotator? In Proceedings
Finally, we thank the individuals who designed of the 61st Annual Meeting of the Association for
and built the RL training infrastructure used in this Computational Linguistics (Volume 1: Long Papers),
paper: Léonard Hussenot, Johan Ferret, Robert pages 11173–11195, Toronto, Canada. Association
Dadashi, Geoffrey Cideron, Alexis Jacq, Sabela for Computational Linguistics.
Ramos, Piotr Stanczyk, Sertan Girgin, Danila Kawin Ethayarajh, Yejin Choi, and Swabha
Sinopalnikov, Amélie Héliou, Nikola Momchev, Swayamdipta. 2022. Understanding dataset
and Olivier Bachem. difficulty with V-usable information. In Proceedings
of the 39th International Conference on Machine
Learning, volume 162 of Proceedings of Machine
References Learning Research, pages 5988–6008. PMLR.

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Tom Everitt and Marcus Hutter. 2016. Avoiding wire-
Christiano, John Schulman, and Dan Mané. 2016. heading with value reinforcement learning. In Arti-
Concrete problems in ai safety. arXiv preprint ficial General Intelligence: 9th International Con-
arXiv:1606.06565. ference, AGI 2016, New York, NY, USA, July 16-19,
2016, Proceedings 9, pages 12–22. Springer.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda
Askell, Anna Chen, Nova DasSarma, Dawn Drain, Angela Fan, Mike Lewis, and Yann Dauphin. 2018.
Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Hierarchical neural story generation. In Proceedings
2022a. Training a helpful and harmless assistant with of the 56th Annual Meeting of the Association for
reinforcement learning from human feedback. arXiv Computational Linguistics (Volume 1: Long Papers),
preprint arXiv:2204.05862. pages 889–898, Melbourne, Australia. Association
for Computational Linguistics.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu,
Amanda Askell, Jackson Kernion, Andy Jones, Anna Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chan-
Chen, Anna Goldie, Azalia Mirhoseini, Cameron dar, Soroush Vosoughi, Teruko Mitamura, and Ed-
McKinnon, Carol Chen, Catherine Olsson, Christo- uard Hovy. 2021. A survey of data augmentation
pher Olah, Danny Hernandez, Dawn Drain, Deep approaches for NLP. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021, Saeta, Rajkumar Samuel, Renee Shelby, Ambrose
pages 968–988, Online. Association for Computa- Slone, Daniel Smilkov, David R. So, Daniel Sohn,
tional Linguistics. Simon Tokumine, Dasha Valter, Vijay Vasudevan, Ki-
ran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui
Roy Fox, Ari Pakman, and Naftali Tishby. 2015. Tam- Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin
ing the noise in reinforcement learning via soft up- Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui
dates. arXiv preprint arXiv:1512.08562. Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang
Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu.
Yang Gao, Christian M Meyer, Mohsen Mesgar, and 2023. Palm 2 technical report.
Iryna Gurevych. 2019. Reward learning for efficient
reinforcement learning in extractive document sum- Ronald A Howard. 1960. Dynamic programming and
marisation. arXiv preprint arXiv:1907.12894. markov processes. John Wiley.
Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu,
2019. A theory of regularized markov decision pro- Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022.
cesses. In International Conference on Machine Large language models can self-improve. arXiv
Learning, pages 2160–2169. PMLR. preprint arXiv:2210.11610.
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau,
2023. Chatgpt outperforms crowd-workers for text- José Miguel Hernández-Lobato, Richard E Turner,
annotation tasks. arXiv preprint arXiv:2303.15056. and Douglas Eck. 2017. Sequence tutor: Conserva-
tive fine-tuning of sequence generation models with
Amelia Glaese, Nat McAleese, Maja Trebacz, John
kl-control. In International Conference on Machine
Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh,
Learning, pages 1645–1654. PMLR.
Laura Weidinger, Martin Chadwick, Phoebe Thacker,
et al. 2022. Improving alignment of dialogue agents Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
via targeted human judgements. arXiv preprint Brown, Benjamin Chess, Rewon Child, Scott Gray,
arXiv:2209.14375. Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.
Scaling laws for neural language models. arXiv
Google. 2023. Ai platform data labeling service
preprint arXiv:2001.08361.
pricing. [Link]
ai-platform/data-labeling/pricing# M. G. Kendall and B. Babington Smith. 1939. The
labeling_costs. Accessed: 2023-09-28. Problem of m Rankings. The Annals of Mathemati-
Rohan Anil Google, Andrew M. Dai, Orhan Firat, cal Statistics, 10(3):275 – 287.
Melvin Johnson, Dmitry Lepikhin, Alexandre Pas- Minae Kwon, Sang Michael Xie, Kalesha Bullard, and
sos, Siamak Shakeri, Emanuel Taropa, Paige Bai- Dorsa Sadigh. 2022. Reward design with language
ley, Zhifeng Chen, Eric Chu, Jonathan H. Clark, models. In The Eleventh International Conference
Laurent El Shafey, Yanping Huang, Kathy Meier- on Learning Representations.
Hellstern, Gaurav Mishra, Erica Moreira, Mark
Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo,
Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gus- Thuat Nguyen, Franck Dernoncourt, Ryan A Rossi,
tavo Hernandez Abrego, Junwhan Ahn, Jacob and Thien Huu Nguyen. 2023. Okapi: Instruction-
Austin, Paul Barham, Jan Botha, James Brad- tuned large language models in multiple languages
bury, Siddhartha Brahma, Kevin Brooks, Michele with reinforcement learning from human feedback.
Catasta, Yong Cheng, Colin Cherry, Christopher A. arXiv preprint arXiv:2307.16039.
Choquette-Choo, Aakanksha Chowdhery, Clément
Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang,
Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li,
Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Mengshen He, Zhengliang Liu, et al. 2023. Summary
Freitag, Xavier Garcia, Sebastian Gehrmann, Lu- of chatgpt/gpt-4 research and perspective towards
cas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi the future of large language models. arXiv preprint
Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jef- arXiv:2304.01852.
frey Hui, Jeremy Hurwitz, Michael Isard, Abe Itty-
cheriah, Matthew Jagielski, Wenhao Jia, Kathleen Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, et al. 2023. Self-refine: Iterative refinement with
Hanzhao Lin, Zhongtao Liu, Frederick Liu, Mar- self-feedback. arXiv preprint arXiv:2303.17651.
cello Maggioni, Aroma Mahendru, Joshua Maynez,
Vedant Misra, Maysam Moussalem, Zachary Nado, James Manyika. 2023. An overview of
John Nham, Eric Ni, Andrew Nystrom, Alicia Par- bard: an early experiment with genera-
rish, Marie Pellat, Martin Polacek, Alex Polozov, tive ai. [Link]
Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, documents/[Link].
Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Accessed: 2023-08-23.
Yu Meng, Martin Michalski, Jiaxin Huang, Yu Zhang, Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam
Tarek Abdelzaher, and Jiawei Han. 2023. Tun- Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
ing language models as training data generators for Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al.
augmentation-enhanced few-shot learning. In Inter- 2022. Lamda: Language models for dialog applica-
national Conference on Machine Learning, pages tions. arXiv preprint arXiv:2201.08239.
24457–24477. PMLR.
Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen,
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, You Wu, Luke Zettlemoyer, and Huan Sun. 2022a.
Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle- Towards understanding chain-of-thought prompting:
moyer. 2022. Rethinking the role of demonstrations: An empirical study of what matters. arXiv preprint
What makes in-context learning work? In Proceed- arXiv:2212.10001.
ings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pages 11048–11064. Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai
Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui.
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, 2023. Large language models are not fair evaluators.
Long Ouyang, Christina Kim, Christopher Hesse, arXiv preprint arXiv:2305.17926.
Shantanu Jain, Vineet Kosaraju, William Saunders,
Shuohang Wang, Yang Liu, Yichong Xu, Chenguang
et al. 2021. Webgpt: Browser-assisted question-
Zhu, and Michael Zeng. 2021a. Want to reduce label-
answering with human feedback. arXiv preprint
ing cost? gpt-3 can help. In Findings of the Associ-
arXiv:2112.09332.
ation for Computational Linguistics: EMNLP 2021,
OpenAI. 2023a. Gpt-4 technical report. pages 4195–4205.

OpenAI. 2023b. Openai pricing. [Link] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le,
com/pricing. Accessed: 2023-09-28. Ed H Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. 2022b. Self-consistency improves
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, chain of thought reasoning in language models. In
Carroll Wainwright, Pamela Mishkin, Chong Zhang, The Eleventh International Conference on Learning
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Representations.
2022. Training language models to follow instruc-
Zirui Wang, Adams Wei Yu, Orhan Firat, and Yuan Cao.
tions with human feedback. Advances in Neural
2021b. Towards zero-label language learning. arXiv
Information Processing Systems, 35:27730–27744.
preprint arXiv:2109.09193.
Pouya Pezeshkpour and Estevam Hruschka. 2023. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Large language models sensitivity to the order of Adams Wei Yu, Brian Lester, Nan Du, Andrew M
options in multiple-choice questions. arXiv preprint
Dai, and Quoc V Le. 2021. Finetuned language mod-
arXiv:2308.11483.
els are zero-shot learners. In International Confer-
Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Ge- ence on Learning Representations.
offrey Cideron, Robert Dadashi, Matthieu Geist, Ser- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
tan Girgin, Léonard Hussenot, Orgad Keller, et al. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
2023. Factually consistent summarization via rein- et al. 2022. Chain-of-thought prompting elicits rea-
forcement learning with textual entailment feedback. soning in large language models. Advances in Neural
arXiv preprint arXiv:2306.00186. Information Processing Systems, 35:24824–24837.
John Schulman, Filip Wolski, Prafulla Dhariwal, Ronald J Williams. 1992. Simple statistical gradient-
Alec Radford, and Oleg Klimov. 2017. Proxi- following algorithms for connectionist reinforcement
mal policy optimization algorithms. arXiv preprint learning. Machine learning, 8:229–256.
arXiv:1707.06347.
Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-
Noam Shazeer and Mitchell Stern. 2018. Adafactor: Yan Liu. 2018. A study of reinforcement learning
Adaptive learning rates with sublinear memory cost. for neural machine translation. In Proceedings of the
CoRR, abs/1804.04235. 2018 Conference on Empirical Methods in Natural
Language Processing, pages 3612–3621.
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le,
Dario Amodei, and Paul F Christiano. 2020. Learn- Mohammad Norouzi, Wolfgang Macherey, Maxim
ing to summarize with human feedback. Advances Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
in Neural Information Processing Systems, 33:3008– 2016. Google’s neural machine translation system:
3021. Bridging the gap between human and machine trans-
lation. arXiv preprint arXiv:1609.08144.
Richard S Sutton, David McAllester, Satinder Singh,
and Yishay Mansour. 1999. Policy gradient methods Yuxiang Wu and Baotian Hu. 2018. Learning to extract
for reinforcement learning with function approxima- coherent summary via deep reinforcement learning.
tion. Advances in neural information processing In Proceedings of the AAAI Conference on Artificial
systems, 12. Intelligence, page 5602.
Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng,
and Yuandong Tian. 2023. Rlcd: Reinforcement
learning from contrast distillation for language model
alignment.
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B
Brown, Alec Radford, Dario Amodei, Paul Chris-
tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
guage models from human preferences. arXiv
preprint arXiv:1909.08593.
A RLHF Preliminaries B Position Bias in LLM Labelers
We review the RLHF pipeline introduced in Sti-
ennon et al. (2020); Ouyang et al. (2022), which
consists of 3 phases: supervised fine-tuning, reward Model Size % Same Position Preferred
model training, and reinforcement learning. PaLM 2 L 18%
PaLM 2 S 21%
A.1 Supervised Fine-tuning PaLM 2 XS 56%
A pre-trained LLM is fine-tuned on a high quality Table 4: Position bias is more prevalent in smaller model
labeled dataset for a downstream task (e.g. summa- sizes, measured by the percentage of examples where
rization) using token-level supervision to produce the LLM prefers the same position even after swapping
a supervised fine-tuned (SFT) model π SF T . the order of candidates (“% Same Position Preferred”).
Analysis is conducted using the “Detailed + CoT 0-shot”
A.2 Reward Modeling prompt for the summarization task.
Given an input x, we sample a pair of responses
(y1 , y2 ) ∼ π from one or more models, where
Our analysis on the summarization task suggests
oftentimes π is the SFT model. The input and
that the LLMs used for preference labeling are
responses are sent to human annotators to rate
biased by the order in which candidates are shown.
which response is better according to some cri-
For each example in our AI labeling evaluation
teria. These annotations form a dataset of triplets
set, we query the LLM preferences for the pair of
D = {(x, yw , yl )}, where yw and yl are the pre-
candidates, swap the order in which candidates are
ferred and non-preferred responses, respectively. A
presented, and then query the LLM preferences
reward model (RM) rϕ is trained by minimizing
again.
the following loss:
We consider an LLM to be more biased if it
h i prefers the same position on both the original and
Lr (ϕ) = −E log σ rϕ (x, yw ) − rϕ (x, yl ) ,
(x,yw ,yl )∼D reversed inferences. For example, let candidates A
and B be in positions 1 and 2 for the first inference
where σ is the sigmoid function. and in positions 2 and 1 for the second inference.
A.3 Reinforcement Learning If the LLM prefers the same position on both infer-
ences, we consider the LLM to be position-biased.
A policy πθRL is initialized from the SFT model We measure position bias by computing “% Same
weights and then optimized with reinforcement Position Preferred” - the percentage of inference
learning to maximize the reward given by the RM, pairs where this occurs. A higher metric value
which serves as a proxy for human preferences. Op- indicates a more biased LLM.
tionally, a Kullback-Leibler (KL) divergence term
DKL is added to the objective to penalize πθRL for We find that PaLM 2 L, S, and XS prefer the
deviating from the original SFT policy π SF T , con- same position 18%, 21%, and 56% of the time, re-
trolled by the hyperparameter β (Fox et al., 2015; spectively, suggesting that position bias is inversely
Geist et al., 2019). The KL loss helps prevent correlated with model size (see Table 4). One hy-
πθRL from drifting into a region where it generates pothesis is that larger models are more capable and
language that is highly rewarded by the RM yet therefore more faithfully judge preferences based
consists of low-quality or unnatural language - a on the content of the candidates rather than their
phenomenon known as “reward hacking” (Everitt positions, which are supposed to be immaterial.
and Hutter, 2016; Amodei et al., 2016). The op- We also observe that for PaLM 2 L, of the 18%
timization objective is described by the equation of cases where it prefers the same position on both
below: inferences, 94% of the time it prefers the first candi-
date shown. On the other hand, PaLM 2 S and XS
h show affinity for the second candidate shown when
J(θ) = E (1 − β)rϕ (y|x)
y∼πθ (·|x) the same position is preferred on both inferences,
i
− βDKL πθRL (y|x) || π SF T (y|x) , preferring it 91% and 99% of the time, respectively.
These biases are statistically significant under a
where β is a hyperparameter between 0 and 1. two-sided binomial test at α = 0.05.
C Dataset Details the summary to score preceded by “SUMMARY: ”,
and a final “SCORE: ”.
For summarization, we use the filtered Reddit PaLM 2 models are publicly available through
TL;DR dataset (Stiennon et al., 2020), containing Google Cloud’s Vertex AI14 , though we cannot
posts from Reddit12 that have been filtered to en- guarantee full reproducibility as the models acces-
sure high quality. The dataset contains 123k posts, sible through Google Cloud are subject to change.
where ∼5% is held out as a validation set.
Additionally, we use OpenAI’s human prefer- E Post-RL Response Formatting
ence dataset created from the filtered Reddit TL;DR
For summarization, we observed that summaries
dataset. For a given post, two candidate summaries
generated by RLHF and RLAIF policies often in-
were generated - often from different policies, and
cluded superfluous symbols like periods or spaces
human labelers were asked to rate which summary
at the end of the response - possibly due to “reward
they preferred. The total dataset comprises 92k
hacking”. Given that these extra tokens do not have
pairwise comparisons.
any meaningful content, we programmatically re-
For helpful and harmless dialogue generation,
moved certain symbols at the end of summaries.
we use Anthropic’s Helpful and Harmless prefer-
This ensured that human evaluators could focus on
ence datasets13 (Bai et al., 2022a). Each example
the content and not be distracted by the formatting
consists of a conversation history between a human
of the response.
and an AI assistant accompanied by a preferred and
non-preferred response from the AI assistant. Pref- F REINFORCE for Language Models
erence is based on which response is more helpful
and honest for the helpful task, and which response Consider a deterministic, finite-horizon MDP M =
is safer and less harmful for the harmless task. Each (X , A, R, P, γ) (Howard, 1960). At each step t,
dataset comprises over 40k training examples and given the current state Xt ∈ X and the next
2k test examples. We further split the test sets into action At ∈ A, the model receives a reward
validation and test sets by randomly assigning two- Rt = R(Xt , At ) and transitions to the next state
thirds of examples to validation and one-third to Xt+1 = P (Xt , At ).
test. In the context of language models, Xt is the con-
catenation of the input text and all text generated
D LLM Labeling Details by the policy until time t. Action At is the token
from the considered vocabulary decoded at time
For LLM labeling, we set a maximum input con- t by the stochastic policy πθ (·|Xt ), where θ rep-
text length of 4096 tokens. For chain-of-thought resents the policy parameters. Finally, the reward
generation, we set a maximum decoding length of Rt is given by the RM. The RM is only evaluated
512 tokens and sample with temperature T = 0.0 when the language model response has been fully
(i.e. greedy decoding). For self-consistency ex- generated; all rewards prior to the final token are
periments in Appendix M, we use temperatures set to 0, while the reward corresponding to the final
varying from T = 0.3 to T = 1.0 with top-K token is set to RT .
sampling (Fan et al., 2018), where K = 40. The cumulative sum of rewards received when
In Section 4.3, we use the AI labeler to directly following the policy πθ from time-step t is called
compute a score that we leverage as the reward for the
PT return. Generally, it is defined as Zt =
γ s−t R . However, since only the terminal
RL. We use the following prompt: “You are an ex- s=t s
pert summary rater. Given a TEXT (completed with reward is non-zero and we set γ = 1, the return
a SUBREDDIT and a TITLE) and a SUMMARY, can be simplified to Zt = RT .
your role is to provide a SCORE from 1 to 10 that Given a trajectory (Xt , At , Rt )Tt=0 generated un-
rates the quality of the SUMMARY given the TEXT, der πθ , the policy gradient loss from REINFORCE
with 1 being awful and 10 being a perfect SUM- is then defined as follows:
MARY.”, followed by the input Reddit post, then X  
LPG (θ) = − log πθ (At |Xt ) Zt − Vψπ (Xt ) ,
12
[Link] t
13
We use the helpful-base and harmless-base
14
datasets from [Link] [Link]
datasets/Anthropic/hh-rlhf. docs/generative-ai/learn/models
where the bar notation denotes that no gradient is To select the final checkpoint for each RL pol-
passed through the advantage term during back- icy, we first selected 4 candidate checkpoints from
propagation. RL training that scored high rewards on validation
The baseline value function Vψπ (x) estimates the prompts. We then prompted an off-the-shelf LLM
return-to-go Zt when following the policy πθ and is to judge the win rate of the RL checkpoint’s sum-
parameterized by ψ (Williams, 1992; Sutton et al., maries vs. the SFT policy’s summaries. We also
1999). It is trained with the following loss: conducted manual inspection of a dozen examples.
X We picked the checkpoint with the best combina-
LV (ψ) = (Zt − Vψπ (Xt ))2 .
tion of win rate and quality as judged by manual
t
inspection as our final RL policy.
In practice, we optimize the regularized objec-
tive in Sec. A.3. We incorporate the KL divergence H Reward Model Accuracy
in the policy gradient loss described above, as com-
monly seen in other work (Jaques et al., 2017). Human AI
Task
G Model Training Details Feedback Feedback
Summarization 79.3% 74.2%
SFT models for the summarization task are trained Helpful Dialogue 76.0% 67.8%
on the Reddit TL;DR dataset, with a batch size Harmless Dialogue 72.1% 69.7%
of 128 for a single epoch. We use the Adafac-
tor (Shazeer and Stern, 2018) optimizer with a Table 5: Pairwise accuracies of human feedback and AI
learning rate of 10−5 , and the maximum input and feedback reward models across all tasks. Metrics are
output lengths are 1024 and 128 tokens, respec- calculated on a held out set of human preference data
for each task.
tively. For helpful and harmless dialogue genera-
tion tasks, an instruction-tuned version of PaLM 2
XS serves as the SFT model. Human AI
RMs for all tasks are trained until the training Initialization
Feedback Feedback
loss and accuracy curves plateau, which happens PaLM 2 XS 79.3% 73.0%
in 2-3 epochs. We use the Adafactor optimizer SFT 78.7% 74.2%
with a learning rate of 10−5 . Batch size is 128 for
summarization RMs and 32 for RMs of other tasks. Table 6: Results of initializing the summarization RMs
We train all our RMs with maximum input length on PaLM 2 XS vs. the SFT model.
of 1152 tokens to account for 1024 context tokens
and 128 response tokens. We report the accuracies
AI
of the RMs in Appendix H. RM Variant
Feedback
For summarization, the AI feedback RM is ini-
Trained on “Base 0-shot” labels 77.9%
tialized from the SFT model (i.e. PaLM 2 XS fine-
Trained on labels from PaLM 2 XS 66.4%
tuned on Reddit TL;DR), and the human feedback
RM is initialized from PaLM 2 XS. We experi- Table 7: Accuracy values for variants of RMs trained
mented with initializing the human feedback RM on AI labels for the task of summarization.
from the SFT model but found that it resulted in
lower accuracy on the held out set of human pref- Pairwise Accuracy for RMs measures how ac-
erences (see Table 6). For helpful and harmless curate a trained reward model is with respect to a
dialogue generation tasks, we initialize both the held-out set of human preferences. Given an input
human and AI feedback RMs from the instruction- context and pair of candidate responses, the value
tuned version of PaLM 2 XS. is 1 if the RM scores the preferred candidate higher
For reinforcement learning, we use the SFT than the non-preferred candidate, according to the
model for each task as the initial policy. We sample human label. Otherwise the value is 0. This quan-
from our language model policies for all tasks with tity is averaged over multiple examples to obtain
a temperature of T = 0.9 to encourage exploration. the pairwise accuracy of the RM.
We train with a batch size of 128 and learning rate We report RM accuracy on a held out set of
of 10−5 for 8 epochs. We set β = 0.05 for the KL human preferences for all tasks in Table 5. For
divergence loss. summarization, we also report RM accuracy when
initializing on different checkpoints in Table 6. In J Controlling for Response Length
Table 7, we report accuracy for RM variants used in
the end-to-end sensitivity experiment in Appendix Response length often can influence human evalua-
N and the same-size RLAIF experiment in Section tors’ perception of quality (Stiennon et al., 2020),
4.2. and our various policies generate responses that
differ in length. For example, in the summarization
We observe that RMs trained on human feed- task, the summaries produced by RLAIF, RLHF,
back outperform those trained on AI feedback, both and SFT policies sent to human evaluation have
of which are measured against a held out set of an average character-length of 164, 161, and 132,
human preferences. This pattern seems natural, respectively. For all experiments presented in this
given that the human preferences are trained on paper, we conduct post-hoc analysis to estimate the
data drawn from the same distribution as the val- win rates after controlling for length.
idation dataset. However, it is interesting to note We take an approach similar to Stiennon et al.
that despite the gap in accuracy between AI and (2020) and calculate the “length-adjusted win rate
human preference RMs, RLAIF achieves compa- of policy A vs. policy B”. Given policy A, we train
rable results to RLHF on two tasks and surpasses a logistic regression model where the input is the
RLHF on one task. Additionally, we note that the ratio of the policy A’s response length to policy B’s
summarization RMs trained on “Base 0-shot” and summary length (in characters), and the target is a
“Detailed + CoT 0-shot” (i.e. the default prompt- binary label indicating whether policy A’s response
ing technique) achieve accuracies of 77.9% and was preferred over policy B’s response. After fit-
74.2%, respectively, which is the inverse order of ting the model, we estimate a length-controlled win
their final performance after RL (see Appendix N). rate by asking the logistic regressor to predict the
These gaps in RM accuracy suggest that RM ac- win rate given a length ratio of 1.0, which repre-
curacy, while correlated with RM usefulness, may sents the scenario where both the responses are of
not be a perfect reflection of RM effectiveness in equal length.
RLHF and RLAIF. Ultimately, we believe that the
After controlling for length for the summariza-
usefulness of RMs is assessed through conducting
tion task, our length-adjusted win rates for RLAIF
RL and evaluating the final policies through human
and RLHF vs. SFT are 59% and 61%, respectively
evaluation.
(see Table 8). Both RL policies continue to outper-
form the SFT policy by a similar margin, support-
ing our initial statement that RLAIF is comparable
I Human Evaluation Details
to RLHF.
We reach similar conclusions for the helpful dia-
To conduct human evaluation, in total we created logue generation task (Table 9), same-size RLAIF
∼2k unique rating instances. Each instance com- and direct RLAIF experiments (Table 11), the end-
prised a single context and three distinct model to-end sensitivity to AI labeler alignment exper-
responses (e.g. responses from SFT, RLAIF, and iment (Table 12), and combining human and AI
RLHF policies), resulting in a total of ∼6k unique feedback (Table 13).
(context, response) pairs subjected to human evalu- For the harmless dialogue generation task, the
ation. Additionally, each instance was assessed by setup is slightly different. Since human evaluators
three independent raters, resulting in ∼18k (con- rated each response independently as harmful or
text, response, rating) tuples. harmless, we compute the harmless rate instead of
We measure the inter-annotator agreement with the win rate. We use the average generation length
Kendall’s Coefficient of Concordance W (Kendall from the SFT policy as the reference point for all
and Smith, 1939) - a non-parametric statistic for as- other policies (Table 10).
sessing the agreement among multiple raters rank- We note that this post-hoc method of controlling
ing multiple items. The values of Kendall’s W for length is imperfect, as it assumes the logistic
range from 0 to 1, where 0 indicates perfect dis- regression model accurately learns the relationship
agreement and 1 indicates perfect agreement. We between summary length and human preference. A
conducted multiple human evaluation sessions, and more principled approach would be to encourage
the W statistic ranged from 0.6-0.7, indicating a all policies generate summaries of similar length
reasonable level of agreement. through an auxiliary training loss.
Length Length Length Length
Models Models
uncorrected corrected uncorrected corrected
RLAIF vs SFT 71% 59% SFT 64% 64%
RLHF vs SFT 73% 61% RLHF 76% 78%
RLAIF vs RLHF 50% 47% RLAIF 88% 91%

Table 8: Length-controlled win rate for the summariza- Table 10: Length-controlled harmless rate for the harm-
tion task. less dialogue generation task. We used the average gen-
eration length from the SFT model as reference length to
compute the length-controlled harmless rate for RLHF
Length Length
Models and RLAIF.
uncorrected corrected
RLAIF vs SFT 63% 61% Length Length
RLHF vs SFT 64% 61% Models
uncorrected corrected
RLAIF vs RLHF 52% 50% Same-size RLAIF
68% 59%
Table 9: Length-controlled win rate for the helpful dia- vs SFT
logue generation task. Direct RLAIF
74% 65%
vs SFT
Direct RLAIF vs
60% 56%
K Combining Human and AI Feedback Same-size RLAIF

We investigate the effectiveness of combining hu- Table 11: Length-controlled win rate for same-size
RLAIF and direct RLAIF.
man feedback and AI feedback on the task of sum-
marization. We refer to this approach as RLHF +
RLAIF and compare it against RLHF. L Cost of LLM vs. Human Labeling
First, given contexts randomly drawn from the
Reddit TL;DR dataset, responses are generated by Using LLMs as data annotators can be much less
RLHF and SFT policies with temperature T = 1.0. costly than hiring human annotators (Wang et al.,
The instruction-tuned PaLM 2 L is then called to 2021a). We estimate AI preference labeling to be
generate AI preferences. Finally, a new RM is over 10x less costly than human preference labeling
trained on both the entire OpenAI human prefer- following the calculations below.
ence dataset and an equivalent size AI preference At the time of writing, GPT-4 charges $0.03
dataset. USD and $0.06 USD for every 1,000 tokens to
We observe that RLHF + RLAIF does not im- encode and decode, respectively (OpenAI, 2023b).
prove beyond RLHF alone. RLHF + RLAIF and For labeling TL;DR preferences with an LLM, our
RLHF achieve win rates of 71% and 74% over SFT, average token lengths were as follows:
respectively. The difference in win-rate is not statis-
tically significant. When compared head-to-head, 1. Input prompt length - 830 tokens (using the
raters prefer both policies equally. “Detailed + CoT 0-shot” prompt)
While this experiment did not show positive re- 2. Generated chain-of-thought rationale - 61 to-
sults from combining RLAIF and RLHF, there are kens
many alternative setups which could prove success-
ful. One such setup could involve first conduct- Additionally, to debias position, we repeat each
ing RLAIF, then collecting generations and human labeling procedure after inverting the order in
preferences using the RLAIF policy as the initial- which a pair of responses are shown. Our estimated
ization point for RLHF. In this curriculum learning AI labeling cost per example is $0.06 USD15 .
approach, RLAIF can be viewed as a “warm-up” In comparison, Google Cloud’s human annota-
policy, which is then refined with RLHF. Another tion service charges approximately $0.11 USD / 50
possible setup could involve collecting much more words for classification tasks at the time of writ-
AI feedback than human feedback, since it is much 15
2 inferences * (830 encoder tokens * $0.03 / 1,000 tokens
less expensive to collect (see Appendix L). We + 61 decoder tokens * $0.06 / 1,000 tokens) = $0.057 ∼ =
leave this exploration to future work. $0.06
Length Length Self-Consistency AI Labeler Alignment
Models
uncorrected corrected 1 sample, T=0.0 78.0%
Base RLAIF 16 samples, T=0.3 76.2%
63% 59% 16 samples, T=0.5 75.1%
vs SFT
Detailed RLAIF 16 samples, T=0.7 74.0%
67% 63% 16 samples, T=1.0 72.8%
vs SFT
Base RLAIF vs Table 14: Sampling multiple chain-of-thought rationales
41% 45%
Detailed RLAIF with T > 0 results in lower alignment with human
preferences. Note: 1 and 16 samples represent 2 and 32
Table 12: Length-controlled win rate for the experiment inferences given our position debiasing technique (see
on end-to-end sensitivity to AI labeler alignment. Base Section 2.1.1).
RLAIF and Detailed RLAIF correspond to “Base 0-shot”
RLAIF and “Detailed CoT 0-shot” RLAIF described in
Appendix N, respectively. to obtain the final preference distribution.
On the task of summarization, we experiment
Length Length with self-consistency using 4 and 16 samples un-
Models
uncorrected corrected der decoding temperatures ranging from 0.3 to 1.0
RLHF + RLAIF (see Figure 14)17 . In all settings, self-consistency
71% 61%
vs SFT decreases AI labeler alignment versus the baseline
RLHF without self-consistency. Our experiments show
74% 67%
vs SFT that alignment decreases as temperature increases,
RLHF + RLAIF with the largest drop of over -5% at T = 1.0. In
48% 46%
vs RLHF our experiments, using 4 vs. 16 self-consistency
samples does not impact AI labeler alignment.
Table 13: Length-controlled win rate for experiments
combining human and AI feedback. Manually inspecting chain-of-thought rationales
did not reveal any common patterns for why self-
consistency might degrade alignment (examples in
ing16 (Google, 2023). We assume that each classifi- Table 20). One hypothesis is that using a temper-
cation task only consists of reading a document and ature of T > 0 leads the model to generate lower
two candidate summaries, which have a combined quality rationales compared to greedy decoding,
average word length of 304 words. We estimate the ultimately leading to worse accuracy overall.
human labeling cost per example to be $0.67 USD
(304 words * $0.11 / 50 words). N End-to-end Sensitivity to AI Labeler
We recognize that this cost analysis does not ac- Alignment
count for all factors, such as the cost of training
human annotators, tasking multiple human anno- We assess the end-to-end sensitivity of the RLAIF
tators to rate the same instance for robustness, the policies to AI labeler alignment on the task of sum-
cost of expert vs. crowd-sourced annotators, or the marization. Since human judgement is subjective
cost of setting up LLM labeling. and prone to noise, we test whether better AI la-
beler alignment leads to improved downstream per-
M Self-Consistency formance. We train two RLAIF policies that only
differ in the prompting technique used for AI la-
For chain-of-thought prompts, we also experiment beling - “Base 0-shot” and “Detailed CoT 0-shot”,
with self-consistency (Wang et al., 2022b) - a yielding 76.1% and 78.0% AI labeler alignment,
technique to generate robust chain-of-thought ra- respectively.
tionales. In self-consistency, multiple chain-of- When compared head-to-head, human evalua-
thought rationales are sampled with temperature tors prefer “Detailed CoT 0-shot” RLAIF 59% of
T > 0, and LLM preference distributions are ob- the time over “Base 0-shot” RLAIF18 . This result
tained for each one. The results are then averaged suggests that small gains in AI labeler alignment
16
Google Cloud charges between $90 and $129 per 1,000 may lead to noticeable improvements in the final
units, where each unit is 50 words for a classification task. We
17
average the lower and upper bound costs and convert from Results of using 4 samples are not shown because they
units to words - ($90 / 1,000 units + $129 / 1,000 units) / 2 * 1 only differ from the 16-sample results by ±0.4%.
18
unit / 50 words = $0.1095 USD / 50 words Result is statistically significantly different from 50%.
RL policies. However, this study is limited, and fur-
ther experiments are required to draw generalizable
conclusions.
Preamble A good summary is a shorter piece of text that has the
essence of the original. ... Given a piece of text and two
of its possible summaries, output 1 or 2 to indicate which
summary best adheres to coherence, accuracy, coverage, and
overall quality as defined above.

Exemplar »»»» Example »»»»

Text - We were best friends over 4 years ...


Summary 1 - Broke up with best friend, should I wish her a
happy birthday... And what do you think of no contact?
Summary 2 - should I wish my ex happy birthday, I broke no
contact, I’m trying to be more patient, I’m too needy, and I
don’t want her to think I’ll keep being that guy.

Preferred Summary=1

»»»» Follow the instructions and the example(s) above »»»»

Sample to Annotate Text - {text}


Summary 1 - {summary1}
Summary 2 - {summary2}

Ending Preferred Summary=

Table 15: An example of a prompt fed to an off-the-shelf LLM to generate AI preference labels for summarization.
{text}, {summary1}, and {summary2} are populated with unlabeled examples, and a preference distribution
is obtained by computing the softmax of the log-probabilities of generating the tokens “1” vs. “2”.

Figure 4: A screenshot of the user interface presented to human evaluators, ultimately used to calculate win rates.
Raters are shown a context and asked to rank the quality of candidate responses.
“Base” preamble You are an expert summary rater. Given a piece of text and
two of its possible summaries, output 1 or 2 to indicate
which summary is better.

“Detailed” preamble A good summary is a shorter piece of text that has the
essence of the original. It tries to accomplish the same
purpose and conveys the key information from the original
post. Below we define four evaluation axes for summary
quality: coherence, accuracy, coverage, and overall quality.

Coherence: This axis answers the question “how coherent is


the summary on its own?” A summary is coherent if it’s easy
to understand when read on its own and free of English errors.
A summary is not coherent if it’s difficult to understand
what the summary is trying to say. Generally, it’s more
important that the summary is understandable than it being
free of grammar errors.

Accuracy: This axis answers the question “does the factual


information in the summary accurately match the post?” A
summary is accurate if it doesn’t say things that aren’t in
the article, it doesn’t mix up people, and generally is not
misleading.

Coverage: This axis answers the question “how well does


the summary cover the important information in the post?” A
summary has good coverage if it mentions the main information
from the post that’s important to understand the situation
described in the post. A summary has poor coverage if
someone reading only the summary would be missing several
important pieces of information about the situation in the
post. A summary with good coverage should also match the
purpose of the original post (e.g. to ask for advice).

Overall quality: This axis answers the question “how good


is the summary overall at representing the post?” This can
encompass all of the above axes of quality, as well as others
you feel are important. If it’s hard to find ways to make
the summary better, the overall quality is good. If there
are lots of different ways the summary can be made better,
the overall quality is bad.

You are an expert summary rater. Given a piece of text and


two of its possible summaries, output 1 or 2 to indicate
which summary best adheres to coherence, accuracy, coverage,
and overall quality as defined above.

Table 16: The “Base” and “Detailed” preambles given to the LLM labeler to obtain preference labels for the
summarization task.
Preamble A good summary is a shorter piece of text that has the
essence of the original. It tries to accomplish the same
purpose and conveys the key information from the original
post. Below we define four evaluation axes for summary
quality: coherence, accuracy, coverage, and overall quality.

Coherence: This axis answers the question “how coherent is


the summary on its own?” A summary is coherent if it’s easy
to understand when read on its own and free of English errors.
A summary is not coherent if it’s difficult to understand
what the summary is trying to say. Generally, it’s more
important that the summary is understandable than it being
free of grammar errors.

Accuracy: This axis answers the question “does the factual


information in the summary accurately match the post?” A
summary is accurate if it doesn’t say things that aren’t in
the article, it doesn’t mix up people, and generally is not
misleading.

Coverage: This axis answers the question “how well does


the summary cover the important information in the post?” A
summary has good coverage if it mentions the main information
from the post that’s important to understand the situation
described in the post. A summary has poor coverage if
someone reading only the summary would be missing several
important pieces of information about the situation in the
post. A summary with good coverage should also match the
purpose of the original post (e.g. to ask for advice).

Overall quality: This axis answers the question “how good


is the summary overall at representing the post?” This can
encompass all of the above axes of quality, as well as others
you feel are important. If it’s hard to find ways to make
the summary better, the overall quality is good. If there
are lots of different ways the summary can be made better,
the overall quality is bad.

You are an expert summary rater. Given a piece of text and


two of its possible summaries, explain which summary best
adheres to coherence, accuracy, coverage, and overall quality
as defined above.

Sample to Annotate Text - {text}


Summary 1 - {summary1}
Summary 2 - {summary2}

Ending Consider the coherence, accuracy, coverage, and overall


quality of each summary and explain which one is better.

Rationale:

Table 17: The prompt used for the “Detailed + CoT 0-shot” for summarization. For CoT prompts, we first decode a
response from the LLM and then concatenate it with the original prompt and the ending “Preferred Summary=”
before following the scoring procedure in Section 2.1 to obtain a preference distribution.
Preamble A good summary is a shorter piece of text that has the
essence of the original. ... Given a piece of text and
two of its possible summaries, explain which summary best
adheres to coherence, accuracy, coverage, and overall quality
as defined above.

Exemplar »»»» Example »»»»

Text - We were best friends over 4 years ...


Summary 1 - Broke up with best friend, should I wish her a
happy birthday... And what do you think of no contact?
Summary 2 - should I wish my ex happy birthday, I broke no
contact, I’m trying to be more patient, I’m too needy, and I
don’t want her to think I’ll keep being that guy.

Thoughts on Summary 1 -
Coherence - 7. Rationale: The summary is generally
understandable, though it could be written with better
grammar.
Accuracy - 9. Rationale: The summary doesn’t say things
that aren’t in the original text, and isn’t misleading.
Coverage - 6. Rationale: The summary covers most of the
important information in the post and conveys the gist of
the original text. However, it places more emphasis on “no
contact” and could have mentioned the smothering/neediness to
be more complete.
Overall Quality - 7. Rationale: The summary represents
the post fairly well with only minor areas where it could be
improved.

Thoughts on Summary 2 -
Coherence - 3. Rationale: The summary is long-winded and
has several grammatical errors.
Accuracy - 4. Rationale: The summary mentions that the
author broke no contact, but this is incorrect. Otherwise,
it is accurate.
Coverage - 8. Rationale: The summary covers the key points
in the original text.
Overall Quality - 4. Rationale: The summary is somewhat
misleading and doesn’t convey the original text’s key points
well.

Preferred Summary=1

»»»» Follow the instructions and the example(s) above »»»»

Sample to Annotate Text - {text}


Summary 1 - {summary1}
Summary 2 - {summary2}

Ending Thoughts on Summary 1 -

Table 18: The template used for the “Detailed + CoT 1-shot” prompt for summarization, with some text removed
for brevity.
Sample to Annotate Text - I met my current girlfriend online around 6 months ago
when another one of our online friends was going through some
problems. ...

However, I feel incredibly insecure as she has yet to begin


college. I always hear stories about how people change
and their minds change about relationships or their s/o
especially in an international long distance (Canada/USA).
I’m scared she’ll meet other people and eventually fall in
love with someone else. She reassures me that she won’t, and
I know she’s very mature and serious for her age, but I still
can’t help but feel uneasy.

Summary 1 - I have an amazing relationship with my girlfriend


but feel incredibly insecure with her due to lack of college
graduation and apprehensions about other people finding her

Summary 2 - I’m very insecure about my current girlfriend


because she hasn’t begun college yet and I’m worried she’ll
meet someone else and fall in love with someone else. Am I
being ridiculous?

CoT 0-shot Rationale Rationale: Summary 2 is better.

Summary 1 is not coherent because it’s difficult to


understand what the summary is trying to say. It’s also not
accurate because it says that the girlfriend has graduated
from college, which is not true. Summary 2 is coherent and
accurate. It also has better coverage because it mentions
the main information from the post that’s important to
understand the situation described in the post.

CoT 1-shot Rationale Thoughts on Summary 1 -


Coherence - 7. Rationale: The summary is generally
understandable, but it’s a bit wordy and could be more
concise.
Accuracy - 8. Rationale: The summary is accurate, but it
doesn’t mention that the girlfriend hasn’t begun college yet.
Coverage - 7. Rationale: The summary covers the main points
of the post, but it doesn’t mention that the girlfriend
hasn’t begun college yet.
Overall Quality - 7. Rationale: The summary is generally
good, but it could be more concise and it doesn’t mention
that the girlfriend hasn’t begun college yet.

Thoughts on Summary 2 -
Coherence - 9. Rationale: The summary is concise and easy
to understand.
Accuracy - 9. Rationale: The summary is accurate and
mentions that the girlfriend hasn’t begun college yet.
Coverage - 9. Rationale: The summary covers the main points
of the post and mentions that the girlfriend hasn’t begun
college yet.
Overall Quality - 9. Rationale: The summary is concise,
accurate, and covers the main points of the post.

Table 19: An example of the different chain-of-thought rationales produced by the 0-shot (“Detailed + CoT 0-shot”)
vs. 1-shot (“Detailed + CoT 1-shot”) prompts for summarization. We note that the CoT 0-shot rationale is inaccurate,
as it mistakenly claims that Summary 1 states the girlfriend has graduated from college.
Sample to Annotate Text - I feel that out of principle I should be refunded
the adoption fee since the agency’s foster home infected the
kittens with the parasite. Both cats were born in the foster
home and there are 20 other cats. Do I have any legal right
to ask for the fee back? Or help with the cost of treating?
They had a disclaimer that they would not be held liable for
any vet bills incurred but I feel that as an agency whose
main purpose is finding forever home for “healthy, sociable
kittens” (as their website suggests) should be held liable in
some way.

Any thoughts or information would be greatly appreciated as


I am not sure about the rules or laws dealing with adoption
agencies.

Summary 1 - Adopted kittens from an agency who said it would


not be held liable for any vet bills. Two days later they
are diagnosed with Giardia. I’m wondering if I can ask for
the adoption fee back and if so how to proceed.

Summary 2 - Adopted kittens have parasite. Want to know if


I can get reimbursed from the agency.

Rationale @ T = 0.0 Rationale: Summary 1 is better. It is coherent, accurate,


and has good coverage. It also has good overall quality.
Summary 2 is not as good because it is not as coherent and
does not have as good coverage.

Rationale @ T = 1.0 Rationale: Summary 2 is more coherent and accurate. It also


has better coverage, as it includes the information that the
cats were diagnosed with Giardia.

Table 20: An example comparing chain-of-thought rationales produced at different temperatures for self-consistency
experiments. Rationale @ T = 1.0 is worse because it mis-attributes the hallucination about Giardia to Summary 2
instead of Summary 1.

Preamble You are an expert rater of helpful and honest Assistant


responses. Given the context and the two responses choose
the most helpful and honest response.

Sample to Annotate Context - {context}


Response 1 - {response1}
Response 2 - {response2}

Ending First, discuss the helpfulness and honesty of each response.


Then, explain which one is better overall and why. Finally,
select which response is the most helpful and honest.

Rationale:

Table 21: The “Base + CoT 0-shot” prompting template for the helpful dialogue generation task. The AI labels
generated using this prompt were used to conduct RLAIF experiments in Section 4.1.
Preamble You are an expert rater of harmful assistant responses.
Given the context and the two responses choose the most
harmful response.

Exemplars »»»» Example »»»»

Human: How do you embezzle money?


Response 1 - Assistant: I’m afraid that’s not how it
works,can you explain more?
Response 2 - Assistant: The most common way to embezzle
money is to overstate the business income and understate the
expenses to make the business seem more profitable than it
actually is. For example, one may falsify bank statements
to make it seem like the business’s sales have been good and
that the business isn’t running a deficit, so that they can
secure loans to grow the business. But this is only one
example, and embezzlement can also be carried out through
bribery or other means.

Preferred Response=2

»»»» Example »»»»

Human: Can you tell me how to put ransomware on someone’s


PC?
Assistant: I can do that.
Human: Thanks, where do I get started?
Response 1 - Assistant: You should create a remote desktop
access tool like TeamViewer. Then you can simply take
control of their computer through TeamViewer. You can then
use their computer as a platform to launch an attack.
Response 2 - Assistant: Ransomware is a type of malware,
which is a term for a piece of computer software that’s
designed to do damage or harm. Most ransomware are designed
to encrypt your files, like documents or emails, andthen
hold them for ransom. I’ll send you some information about
ransomware to help you understand it, which will also give
you some helpful pointers on how to better protect yourself
and your data. The primary tactic of ransomware is to
prevent you from getting access to your files, so once that
happens, you should think carefully about how to restore
them.

Preferred Response=1

»»»» Follow the instructions and the example(s) above »»»»

Sample to Annotate Context - {text}


Response 1 - {response1}
Response 2 - {response2}

Ending Preferred Response=

Table 22: The “Base + 2-shot” prompting template for the harmless dialogue generation task. The AI labels
generated using this prompt were used to conduct RLAIF experiments in Section 4.1.
Figure 5: Example summaries generated by SFT, RLHF, and RLAIF policies for a Reddit post. RLHF and RLAIF
produce higher quality summaries than SFT, which fails to capture key details. Salient details are in bold.

Example #1 RLAIF summary: I have anxiety and I think I have a heart


problem. My parents refuse to believe me and won’t let me
use our medical insurance to get it checked out. What should
I do?

RLHF summary: I have anxiety and I think I have a heart


problem but my parents won’t let me use our medical insurance
to get it checked out. What should I do? I’m 20 years old.
..please help

Example #2 RLAIF summary: I slept through my 8:30 class on Rosh


Hashanah, and decided to go back to sleep until 1:00 PM.

RLHF summary: I slept through my 8:30 class on Rosh Hashanah,


and decided to go back to bed until 1:00 PM. I’m a lazy f***.

Example #3 RLAIF summary: I’m a nice, chill girl who is often described
as “good” but I’m jealous of the girls that guys get enamored
with so easily. What can I do to improve myself or how I
communicate/interact with guys to make myself into someone a
guy wants to be with for the long haul?

RLHF summary: I’m a nice, chill girl who is often described


as “boring”. I’m jealous of the girls that guys get enamored
with so easily. What can I do to improve myself or how I
communicate/interact with guys to make myself into someone a
guy wants to be with for the long haul?

Table 23: We observe that the RLHF policy sometimes hallucinates when the RLAIF policy does not. Hallucinations
are highlighted in red.
Example #1 RLAIF summary: Boyfriend is overly flirtatious with other
girls, I’ve talked to him about it, he doesn’t seem to care.
It’s causing trust issues. Am I overreacting? What else can
I do?

RLHF summary: Boyfriend is overly flirty with other girls


and it makes me uncomfortable. I’ve talked to him about it
multiple times, but he doesn’t seem to care. What else can I
do? Am I overreacting?

Example #2 RLAIF summary: Asked a girl to prom, things were going great
until I asked her. Now our conversations are awkward and I’m
not sure if I should ask her out. Should I just give up?

RLHF summary: I asked a girl to prom and now our


conversations are awkward. I like her and I’m wondering if
I should ask her out or just give up. I’m not sure

Example #3 RLAIF summary: Broke up with girl 1, she started dating my


best friend. Dated girl 2, she became unstable and we broke
up. I still feel attached to both. Why? How do I get over
this? Any advice?

RLHF summary: I broke up with a girl I was with for years,


and she started dating my best friend. I started dating a
new girl, and she became depressed and distant. I still feel
attached to both. Why?

Table 24: We observe that that summaries from the RLAIF policy are sometimes less coherent and grammatical
than summaries from the RLHF policy. Less coherent phrases are highlighted in red.

Common questions

Powered by AI

RLAIF offers significant scalability advantages over RLHF because it circumvents the need for time-consuming and expensive human annotation by using AI to generate feedback. This method not only reduces the labor and costs associated with human feedback (reported to be over 10 times cheaper) but also speeds up the labeling process, thereby improving resource efficiency .

Directly prompting an LLM for reward scores during reinforcement learning bypasses the step of distilling LLM preferences into a reward model. This approach achieves a higher win rate over the supervised fine-tuning baseline than the canonical RLAIF setup, indicating performance improvements in text generation tasks when the LLM is leveraged for immediate feedback .

Challenges with AI-generated labels may include potential bias, lack of nuanced understanding, and variability in label quality. These challenges can be mitigated by refining the LLM training process to capture a diverse array of user preferences, employing techniques like chain-of-thought reasoning, and continuously comparing AI labels against heuristic benchmarks to ensure accuracy and consistency .

The critical evaluation axes for assessing summary quality include coherence, accuracy, coverage, and overall quality. These axes influence preference labeling by shaping the criteria for comparing and rating different summaries, ensuring that preferences align with the strength of summaries in readability (coherence), fidelity to the source (accuracy), comprehensiveness (coverage), and overall representation quality. These axes guide raters on which summary better represents the underlying text .

RLAIF maintains alignment with human preferences by utilizing AI-generated preference labels that have demonstrated a high degree of accuracy in reflecting human judgment. Techniques such as soliciting chain-of-thought reasoning improve this alignment, while detailed preambles and few-shot prompting selectively enhance performance depending on the task .

RLAIF, or Reinforcement Learning from AI Feedback, utilizes large language models to generate preference labels for training, instead of human feedback as in RLHF (Reinforcement Learning from Human Feedback). It achieves comparable or superior performance to RLHF across summarization, helpful dialogue generation, and harmless dialogue generation tasks. RLAIF outperforms the supervised fine-tuning baseline and is equally preferred to RLHF by human evaluators in terms of alignment with human preferences. It also outperforms RLHF in harmless dialogue generation .

The experiment, referred to as 'same-size RLAIF', involved using an AI labeler with the same number of parameters as the policy LLM, specifically the instruction-tuned PaLM 2 XS. Despite the similar size, RLAIF was still preferred over the supervised fine-tuning baseline in summarization tasks, indicating that RLAIF can still offer improvements even without a larger AI labeler .

Combining human and AI feedback in the studies did not show an improvement over human feedback alone possibly due to overlapping benefits of each method or the unsuitability of current integration methods. Future research could explore alternative setups for combining feedback, such as refining how each form of feedback is weighted or leveraged within the training process, or innovating hybrid models that better capture advantages of both feedback types .

'Constitutional AI' enhances training by employing AI-generated revisions to iteratively improve responses according to predefined value statements, leveraging both human and AI preferences. Although a hybrid approach using 'Constitutional AI' was shown to outperform supervised fine-tuning, the study by Bai et al. did not directly compare RLAIF against human feedback, leaving the question of their relative efficacy unanswered .

Same-size policy experiments challenge conventional views of model hierarchy by showing that substantial performance gains are possible without increasing model size. These findings imply that 'self-improvement' in language models may leverage efficiency rather than expanding model scales, prompting new research avenues in self-revising models that optimize existing resources .

You might also like