0% found this document useful (0 votes)
25 views25 pages

Uncertainty-Aware RLHF for Alignment

Uploaded by

Patrick Winnett
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views25 pages

Uncertainty-Aware RLHF for Alignment

Uploaded by

Patrick Winnett
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Towards Reliable Alignment: Uncertainty-aware

RLHF
Debangshu Banerjee1 and Aditya Gopalan2
arXiv:2410.23726v1 [[Link]] 31 Oct 2024

1,2
Department of Electrical and Communication Engineering, Indian
Institute of Science, India

November 1, 2024

Abstract
Recent advances in aligning Large Language Models with human preferences
have benefited from larger reward models and better preference data. However,
most of these methodologies rely on the accuracy of the reward model. The reward
models used in Reinforcement Learning with Human Feedback (RLHF) are typi-
cally learned from small datasets using stochastic optimization algorithms, making
them prone to high variability. We illustrate the inconsistencies between reward
models empirically on numerous open-source datasets.
We theoretically show that the fluctuation of the reward models can be detri-
mental to the alignment problem because the derived policies are more overfitted to
the reward model and, hence, are riskier if the reward model itself is uncertain. We
use concentration of measure to motivate an uncertainty-aware, conservative algo-
rithm for policy optimization. We show that such policies are more risk-averse in
the sense that they are more cautious of uncertain rewards. We theoretically prove
that our proposed methodology has less risk than the vanilla method.
We corroborate our theoretical results with experiments based on designing an
ensemble of reward models. We use this ensemble of reward models to align a
language model using our methodology and observe that our empirical findings
match our theoretical predictions.

1 Introduction
Reinforcement Learning with Human Feedback (RLHF) (Christiano et al., 2017; Ziegler
et al., 2019) is an influential training approach in modern artificial intelligence research,
particularly in the domain of large language models (LLMs). Notable examples in-
clude the revolutionary ChatGPT (OpenAI, 2023), Claude (Anthropic, 2023), Gem-
ini (Team et al., 2023) and LLaMA-3 (Meta, 2024). RLHF is a fine-tuning method
to align the behavior of LLMs with human values and preferences. It has been instru-
mental in addressing challenges related to model alignment, where the goal is to ensure
that an AI system adheres to specific ethical, safety, and utility guidelines defined by its

1
Prompt:
Who created the Superman cartoon character?

Response:
Superman, the iconic comic book superhero,
was created by writer Jerry Siegel and artist
Joe Shuster. Superman first appeared in Action
Comics #1, which was published by Detective
Comics, Inc. (later DC Comics) in June
1938. The character’s immense popularity
established him as one of the most enduring
and recognizable figures in the superhero genre.

Figure 1: Reward scores assigned by 10 reward models on the same prompt-response pair. The reward
models are identical in that they are trained independently on the same dataset, with the same hyperparame-
ters and number of epochs. Despite this, we see a wide variation in the score assigned by each model.

human users. The standard reward-model RLHF framework (Ouyang et al., 2022; Bai
et al., 2022b; Touvron et al., 2023) assumes a preference model based on an underlying
reward model to accurately capture human preferences. The reward model is trained to
predict how well a given response aligns with preferences provided by human evalua-
tors, thus acting as a proxy for human judgment. It is a reward signal in downstream
reinforcement learning to improve the LLM.

Challenges of Reward Model Reliability A critical issue in RLHF is the reliability


of the learned reward model. For example, look at Figure 1, which shows the reward
score assigned to the same prompt-response pair by 10 independently trained identical
reward models on the same preference data. Several factors contribute to the uncer-
tainty and potential unreliability of the reward model:
• Limited Dataset Size: The reward model is typically trained on a much smaller
dataset than the vast corpora used to pre-train the LLM. For instance, while an
LLM may be pre-trained on billions of tokens, the reward model might be trained
on a few hundred thousand human-labeled prompt-response pairs. This discrep-
ancy in the data scale can limit the generalization capability of the reward model,
leading to noisy estimates of response quality.
• Stochastic, Incomplete Optimization: The reward model is trained using stochas-
tic gradient descent (SGD) or variants, introducing inherent randomness into the
optimization process. Using mini-batches of data means that different instances
of the reward model, even when trained on the same dataset, may produce dif-
ferent evaluations of the same response due to the randomness in parameter up-
dates. This stochasticity can result in high variance in the model’s predictions.
Additionally, the optimization process to find a reward model is not completed –
typically 1 or 2 passes over the dataset (Stiennon et al., 2020; Meta, 2024) – to
avoid overfitting.
Thus, a single reward model should not be viewed as an infallible oracle for as-
sessing response quality. Its predictions are inherently uncertain, leading to challenges
when fine-tuning the LLM. Overfitting the LLM to a noisy reward model can result in

2
degraded performance, as the model may learn to optimize for the idiosyncrasies of the
reward model rather than true human preferences.

Contributions We enumerate the contributions made in this work:

1. We provide comprehensive empirical evidence using open-source datasets to


demonstrate the variability inherent in reward modeling.
2. We introduce a conservative policy optimization method incorporating uncer-
tainty measures derived from reward model training.
3. We rigorously demonstrate, through theoretical analysis and experiments on LLMs,
that our risk-aware conservative policy scheme significantly reduces the likeli-
hood of policy degradation.

RLHF preliminaries The standard RLHF setup (Christiano et al., 2017; Ziegler
et al., 2019) is described as follows. Given a prompt x, the LLM generates two re-
sponses, y 1 and y 2 . A human evaluator selects the preferred response, forming a
n
dataset of the form (xi , yi1 , yi2 )i=1 , where xi is the prompt, and yi1 , yi2 are model-
generated responses. These pairwise comparisons encode ordinal preferences, used to
train the reward model. The reward model, rθ , assigns a scalar reward to each prompt-
response pair (x, y), reflecting its likelihood of being preferred. The Bradley-Terry
model (Bradley and Terry, 1952) estimates the probability that y 1 is preferred over y 2
as: P(y 1 is preferred over y 2 ) = σ(rθ (x, y 1 ) − rθ (x, y 2 )), where σ(z) = 1+e1−z is
the logistic sigmoid function. The reward model Pisn trained by minimizing the negative
log-likelihood of human preferences: minθ n1 i=1 − ln σ rθ (xi , yi1 ) − rθ (xi , yi2 ) .


This loss function seeks to adjust the parameters θ of the reward model such that the
predicted rewards for preferred responses are consistently higher than those for less
preferred responses, as judged by human evaluators. Using the Bradley-Terry model
ensures that the reward model produces outputs that align with human feedback. Once
trained, the reward model is used to fine-tune the LLM via reinforcement learning (e.g.,
PPO (Schulman et al., 2017)). The objective is to maximize the reward for new prompts
while constraining divergence from the reference policy π0 :

max Ex∼D, y∼π(·|x) [rθ (x, y)] , s.t. KL(π||π0 ) ⩽ ε, (1)


π

Solving this optimization adjusts the LLM to generate responses that align with the
reward model to better reflect human preferences. However, the reward function rθ
above in Equation 1 can be inherently highly variable, as seen in Figure 1.
To illustrate the impact of uncertainty in reward models, consider a simple three-
armed bandit problem. Aligning a language model can be viewed as a contextual bandit
scenario where the policy assigns probabilities to each arm to maximize the expected
return. In this example, the true rewards (shown in green in Figure 2) are r1∗ < r2∗ < r3∗ ,
with Arm 1 having the lowest mean reward and Arms 2 and 3 having higher rewards.
However, the estimated rewards (depicted in blue as R b1 , R
b2 , and R
b3 ) inaccurately sug-
gest that Arm 1 has the highest reward. If probabilities are assigned solely based on

3
these estimates, Arm 1 will receive the highest probability, leading to a lower true re-
turn since its actual reward is the lowest. However, when considering the uncertainty
intervals (shown in red in Figure 2), it becomes evident that Arm 1’s high estimated
reward comes with significant uncertainty. Arms 2 and 3 exhibit much less uncertainty,
albeit having lower estimated rewards. A more conservative strategy that accounts for
this uncertainty would allocate greater probabilities to Arms 2 and 3, leveraging their
more reliable estimates. This example highlights the trade-off between pursuing high-
risk strategies and opting for lower-reward, lower-risk approaches in policy optimiza-
tion. It demonstrates the importance of incorporating uncertainty into the fine-tuning
process.

Reward

R
b1
3

R
b2
2
r2∗ r3∗
R
b3
1 r1∗

0
Response/Arm 1 Response/Arm 2 Response/Arm 3

Figure 2: A 3-armed bandit problem illustrating true rewards r1∗ , r2∗ , r3∗ (green circles), estimated re-
wards (blue circles) Rb1 , R
b2 , R
b3 , and uncertainty intervals (red brackets). Arm 1 has the lowest true reward,
whereas the highest estimate R1 . In contrast, arms 2 and 3 have lower reward estimates R
b b2 and R b3 , respec-
tively. A naive policy improvement based on only the estimated rewards R bi would increase the probability
on Arm 1, leading to a lower (true) expected return. A more conservative policy improvement strategy should
factor in the uncertainty of the estimate of Arm 1 and assign a lower probability to it, resulting in a higher
expected return.

Related Work The pitfalls of overly relying on reward models (as proxies for actual
tasks) in RLHF have been extensively documented, often referred to as reward hack-
ing (Amodei et al., 2016) or reward overoptimization (Gao et al., 2023). For example,
Shen et al. (2023) demonstrates that even large models resort to random guessing when
faced with conflicting instructions and responses. Researchers have explored using re-
ward model ensembles to address the mitigation of reward hacking (Coste et al., 2023;
Eisenstein et al., 2023; Zhang et al., 2024). Leveraging conservative lower confidence
bounds (LCBs) on reward to guide the training of LLMs has been investigated by Zhai
et al. (2023); Xiong et al. (2024); Liang et al. (2022) and Zhang et al. (2024). Ramé
et al. (2024) use a weighted average of an ensemble of reward models as a reward esti-
mate. Methods for uncertainty quantification in deep learning using model ensembles
have been studied by (Lakshminarayanan et al., 2016; Liang et al., 2022; Zhai et al.,

4
2023; Coste et al., 2023; Zhang et al., 2024) among others. Other approaches include
Lou et al. (2024), where a reparameterization trick is used to learn uncertainties, similar
to the dropout method employed by Gal and Ghahramani (2016). In this work, we uti-
lize an ensemble of reward models to help quantify reward uncertainty. Our approach
mirrors the ensemble reward modeling method of Zhang et al. (2024); however, we
enhance the training efficiency by freezing the foundation layers when creating ensem-
bles. Our problem formulation is also distinct from the LCB estimates used in previous
studies, and offers a principled and practical approach to leverage uncertainty in reward
models to perform reliable policy improvement.

2 Mathematical Modeling
Notations: We assume that prompts are strings denoted by x from a prompt set X ,
and responses are strings denoted by y from a response set Y. A reward model assigns a
scalar value to each prompt-response pair (x, y). We consider the learned reward model
Rb as a sample estimate of the true human-representative reward model r∗ . Assuming
X and Y are finite with cardinalities X and Y, respectively, both R b and r∗ can be
XY
viewed as elements of R . A large language model, for our purposes, is a policy
π that defines a distribution over responses Y given a prompt x. We also introduce
a distribution D over prompts, representing their ambient frequency in nature. With
a slight abuse of notation, we treat the policy π as the induced joint distribution over
prompts and responses. This allows us to simplify notation by expressing the average
reward E x∼D [R(x, b y)] as R b⊤ π. We denote a covariance matrix by Σ, use ∥x∥2 to
y∼π(· | x)
represent the Euclidean (ℓ2 ) norm, and define ∥x∥2Σ as the quadratic form x⊤ Σx.

Noisy Reward Model We consider the true reward function r∗ , which is unknown,
b which estimates r∗ but is subject to noise due to finite
and the learned reward model R,
and imperfect training data. We assume:
Assumption 2.1. For any (x, y), the estimated reward R(x,
b y) is a Gaussian perturba-

tion of r (x, y):

b y) = r∗ (x, y) + N 0, σ 2 (x, y) ,

R(x,

where N (0, σ 2 (x, y)) is a Gaussian random variable with mean zero and variance
σ 2 (x, y). We assume that the estimates R(x,
b y) are independent across different (x, y).

Thus, R b ∼ N (r∗ , Σ), where Σ is a diagonal matrix with entries σ 2 (x, y). Our
goal is to optimize the policy π to maximize the expected reward estimated by R. b
Let π0 be a reference policy (e.g., from pre-training), and define d = π − π0 . Since
Rb ∼ N (r∗ , Σ), the scalar R
b⊤ d is normally distributed with mean r∗⊤ d and variance
d Σd: R d ∼ N r d, d⊤ Σd . To prevent the policy π from deviating too much
⊤ ⊤ ∗⊤

b
from the reference policy π0 , we constrain d to lie within a feasible set D ⊂ RXY .

5
Lower Bound on the True Objective Function The following theorem provides
a bound on the optimization problem that accounts for the uncertainty in the reward
estimates. The proof is presented in Appendix 6.
Theorem 2.2. Under  Assumption
 2.1, for any β > 0, the following holds with proba-
XA
bility at least 1 − exp − β 2 :

b⊤ d − β∥d∥Σ ⩽ sup r∗⊤ d.


sup R
d∈D d∈D

The above theorem implies that the optimization problem on the left-hand side is
a high-probability lower bound for the true optimization problem, which depends on
the unknown reward function r∗ . Given that r∗ is not directly available, but we do
have access to noisy estimates R,
b we propose the following optimization problem as a
practical substitute:
b⊤ d − β∥d∥Σ .
sup R (2)
d∈D

This formulation leads to the following constrained optimization problem:


b⊤ π
max R subject to (π − π0 )⊤ Σ(π − π0 ) ⩽ ε,
π

for some ε > 0. The weighted constraint on the policy update penalizes deviations
more heavily for prompt-response pairs with higher variance in the reward estimates,
thereby incorporating the uncertainty into the optimization process.
Remark 2.3. Note that similar variants of the constrained optimization problem have
been explored previously in the literature. For example, the unconstrained version of
our approach is equivalent to the vanilla policy gradient method (Sutton et al., 1999).
The standard RLHF formulation typically employs the PPO algorithm (Schulman et al.,
2017), which is defined with a KL-divergence constraint, although the choice of dis-
tance metric is not unique. For example, an ℓ2 approximation of the KL-divergence
constraint, resulting in the unweighted constraint: ∥π −π0 ∥22 ⩽ ε. Another widely used
technique is the natural policy gradient, as implemented in the Trust Region Policy Op-
timization (TRPO) algorithm (Schulman, 2015). TRPO adjusts the constraint based on
the Fisher information matrix I, leading to the constraint: (π − π0 )⊤ I(π − π0 ) ⩽ ε,
where I adapts the penalization according to the sensitivity of the policy.
In our experiments, we use a variance-adjusted KL-divergence constraint:
 
π(y|x)
Ex∼D,y∼π(·|x) σ 2 (x, y) ln ⩽ ε.
π0 (y|x)

This formulation integrates seamlessly with existing PPO subroutines, such as those
provided in the TRL Library (von Werra et al., 2020) 1 .
1 TRL package from Hugging Face

6
3 Theoretical Analysis
We compare the performance of the variance-aware LLM alignment methodology with
its variance-unaware counterpart to evaluate how incorporating reward estimate uncer-
tainty affects policy robustness and effectiveness, especially in scenarios with noisy
reward estimates. We consider two policies, π1 and π2 , derived from different opti-
mization formulations.
Definition 3.1 (Variance-Unaware Policy, π1 ). The policy obtained by solving the un-
weighted l2 constraint problem:
π1 = arg max π ⊤ R
b subject to ∥π − π0 ∥22 ⩽ ε.
π

Definition 3.2 (Variance-Aware Policy, π2 ). The policy obtained by solving the vari-
ance weighted l2 constraint problem:
π2 = arg max π ⊤ R
b subject to ∥π − π0 ∥2Σ ⩽ ε̃.
π

To compare both methods fairly, we set ε̃ = λmin (Σ) · ε; this has the effect of
aligning the largest ellipsoid of the covariance-weighted constraint with the sphere of
the traditional ℓ2 constraint.

Main Result We evaluate the expected true rewards πi⊤ r∗ for i = 1, 2, where r∗ is
the true (unknown) reward vector for both methods and compare them to π0⊤ r∗ . We
aim to show that π2 is less likely to underperform relative to π0 than π1 , indicating that
the variance-aware method is less risky when reward estimates are uncertain.
Theorem 3.3. Consider policies π1 and π2 as defined in Definitions 3.1 and 3.2 respec-
tively. With ε̃ set as λmin (Σ)ε to ensure the optimization domain of the variance-aware
method is only as large as the variance unaware method, we have the following result:
P π2⊤ r∗ ⩽ π0⊤ r∗ ⩽ P π1⊤ r∗ ⩽ π0⊤ r∗ .
 

Remark 3.4. Thus, the variance-aware method (π2 ) has a lower probability of underper-
forming relative to π0 than the variance-unaware method (π1 ). Theorem 3.3 highlights
the trade-off between risk and reward. While the variance-unaware policy (π1 ) may
achieve higher rewards when R b is accurate, it is riskier as it ignores estimate uncer-
tainty. The variance-aware policy (π2 ) reduces underperformance risk by accounting
for reward estimate variance. The proof of the theorem is presented in Appendix 6.
Remark 3.5. Our variance-aware policy is closely related to another reward-to-variability
ratio known in finance literature as the Sharpe Ratio (Sharpe, 1966), which balances
expected return against risk.
Theorem 3.6. Consider the optimization problem:
h i
max Ex∼D, y∼π(·|x) R(x, b y)
π
 
π(y|x)
subject to Ex∼D, y∼π(·|x) σ 2 (x, y) ln ⩽ ε,
π0 (y|x)

7
Figure 3: In the high-variability set-
ting, variances of reward estimates range Figure 4: In the low-variability set-
between (3, 100). Method 2 (variance- ting, variances of reward estimates range
aware) exhibits significantly lower re- between (70, 100). Both methods per-
turn variance than Method 1 (variance- form similarly, with Method 2 (variance-
unaware), confirming its risk-averse nature. aware) having a standard deviation of 0.12
The standard deviation for Method 2 is and Method 1 (variance-unaware) having a
0.04, while for Method 1 it is 0.13. The standard deviation of 0.14. The mean re-
mean returns for both methods are compa- turns for Method 1 and Method 2 are 0.14
rable: 4.643 for Method 1 and 4.644 for and 0.13, respectively.
Method 2.

Figure 5: Distribution of policy returns under different variability settings. In both cases, the true reward
vector r∗ is fixed, and reward estimates Rb are sampled from a multivariate Gaussian distribution with the
specified covariance matrices. The histograms show the frequency of policy returns under both methods,
illustrating the risk-averse nature of Method 2 in the high-variability setting and the convergence of both
methods in the low-variability setting.

b y) and σ 2 (x, y) are the reward estimate and its variance. The optimal
where R(x,
policy is:
!
R(x,
b y)
π ∗ (y|x) ∝ π0 (y|x) exp ,
βσ 2 (x, y)

for some β > 0.


The proof is presented in Appendix 6. Thus, the optimal policy is proportional
R(x,y)
b
to σ 2 (x,y) ,
which is also known as the Sharpe Ratio, which measures the return of an
investment after adjusting for it’s risk.

Variability in the Variance The variance-aware method’s advantages are more sig-
nificant when reward estimate variances vary across prompt-response pairs. If vari-
ances are homogeneous, both methods perform similarly since the covariance-weighted
constraint becomes proportional to the traditional ℓ2 constraint. We conduct simu-
lations to illustrate the benefits of the variance-aware method. We fix a true reward
vector r∗ (dimension 1000) and sample reward estimates R b from N (r∗ , Σ) under two
settings: high and low variance variability. In the high-variability setting (Figure 3), the
variance-aware method (π2 ) shows significantly lower return variance compared to the

8
variance-unaware method (π1 ), confirming its risk-averse nature. In the low-variability
setting (Figure 4), both methods perform similarly, aligning with theoretical predic-
tions. These results confirm our theoretical insights and demonstrate the practical util-
ity of variance-aware policy optimization in aligning LLMs with human preferences.

4 Reward Modeling
In this section, we discuss the process of reward modeling using the Gemma-2B-
it model (Team et al., 2024), an instruction-tuned version of the foundational model
Gemma-2B. Our reward modeling methodology uses an ensemble of models, specif-
ically 10 independent reward models, to compute the reward variance across different
instances of the same prompt-response pair. This ensemble-based approach allows
us to better capture the uncertainty in the reward estimates and to analyze the vari-
ability between otherwise identical reward models. The following paragraphs detail
the methodology used to learn the ensemble of reward models, the dataset used for
training and evaluation, and the observations drawn from the ensemble’s performance
across multiple benchmarks.

Dataset To train our reward models, we utilize an existing open-source preference


dataset (Dong et al., 2024), which is available publicly via HuggingFace2 . This cu-
rated dataset contains approximately 50, 000 labeled preference pairs. It is constructed
by combining several well-known, open-source datasets. The included datasets are
HH-RLHF (Bai et al., 2022a), SHP (Ethayarajh et al., 2022), HelpSteer (Wang et al.,
2023), PKU-SafeRLHF (Ji et al., 2024), UltraFeedback (Cui et al., 2023), UltraIn-
teract (Yuan et al., 2024), Distilabel-Capybara (Daniele, 2023), and Distilabel-Orca3
(Lian et al., 2023). The combined dataset has undergone preprocessing to filter out sub-
quality data, specifically removing 10% of the original dataset to ensure the quality of
the training samples. The final dataset contains human preferences where, for each
prompt, two responses are given: one preferred and the other rejected. The prefer-
ence labels serve as the ground truth for training our ensemble of reward models. This
dataset provides a comprehensive and diverse set of prompt-response pairs, making it
suitable for training a robust reward model ensemble that can generalize across various
domains and tasks. We refer readers to the original work of Dong et al. (2024) for
further details on the dataset construction and preprocessing steps.

Methodology We use the Gemma-2B-it (Gemma, 2024) model as the foundation


for our reward models. The instruction-tuned nature of this model makes it a strong
candidate for reward modeling tasks, as it has been fine-tuned to follow human instruc-
tions closely. The size of Gemma-2B-it is approximately 9.34 GB on disk, including
a scalar reward head. Given that we use an ensemble of 10 independent reward mod-
els, the total storage required for all models is approximately 90 GB. To accelerate the
training process and optimize memory usage, we employ the following methodology:
2 [Link]/weqweasdas/preference dataset mix2

9
Model Average Score Chat Chat-Hard Safety Reasoning Prior Sets
GRM-Gemma-2B-sftreg (Yang et al., 2024) 74.7 95.5 48.7 80.0 76.8 69.8
Gemma-2B-rewardmodel-baseline 73.1 94.1 46.9 79.7 73.8 69.0
Our Model 69.4 95.6 44.5 55.9 81.8 69.0
Qwen1.5-72B-Chat (Bai et al., 2023) 68.2 62.3 66.0 72.0 85.5 42.3
MiniCPM-2B-dpo-fp32 (Hu et al., 2024) 66.2 89.1 49.3 52.5 82.3 49.6
RM-Gemma-2B (Dong et al., 2023) 64.2 94.4 40.8 44.0 76.4 66.5

Table 1: Comparison of our ensemble of reward models to other SOTA 2B models on the RewardBench-
mark platform. The Prior Sets are given 50% weightage in the final score. Our model shows competitive
performance compared to others, highlighting its efficacy in reward modeling tasks.

• Initial Training: We begin by training a single instance of the full Gemma-


2B-it model with a scalar reward head on the preference dataset. The reward
head is a simple linear layer with dimensions 2048 × 1. We use early-stopping
during training to prevent overfitting and ensure generalization. Specifically, we
stop training when the loss reaches 0.3, as this strikes a balance between model
complexity and the risk of overfitting.
• Parallel Reward Heads: Once the initial model is partially trained, we attach 9
additional reward heads in parallel with the original reward head (Zhang et al.,
2024). Each reward head is a linear layer with the same dimensions as the first
(2048×1). The model now outputs a 10-dimensional vector, where each element
corresponds to the reward output of one of the 10 models in the ensemble. This
configuration allows us to efficiently compute the rewards for all models in a
single forward pass.
• Freezing the Foundation Model: To reduce computational complexity and en-
sure faster training, we freeze the weights of the foundation model (i.e., the pre-
trained layers of Gemma-2B-it) and train only the reward heads. This allows us
to simulate training 10 independent reward models in parallel while sharing the
foundation model acrossP all reward heads. We employ an additive loss function
10
during training: loss = i=1 l(θi ), where each θi represents the parameters of
the i-th reward head. This approach ensures that all reward heads are trained
independently but computationally efficiently. In this sense, our methodology
differs from the one used in Zhang et al. (2024).
By freezing the foundational layers and focusing the training on the reward heads, we
can significantly reduce the computational and storage costs associated with training
an ensemble of models. The final ensemble model occupies approximately 9.34 GB on
disk, and the total number of trainable parameters across all reward heads is 20, 480.

Evaluation To assess the performance of our ensemble reward models, we utilize


the RewardBenchmark platform (Lambert et al., 2024)3 , a widely-used platform that
offers curated datasets and evaluation metrics specifically designed for benchmarking
reward models. This platform provides an in-depth evaluation across multiple datasets,
each designed to test different aspects of reward modeling, such as conversational abil-
ity, safety, and reasoning. The evaluation is conducted on four primary datasets: Chat
3 [Link]

10
(a) Chat (b) Chat Hard (c) Safety (d) Reasoning

(e) Chat (f) Chat Hard (g) Safety (h) Reasoning

Figure 6: (Top Row) The distribution of sample variances of the reward on the accepted responses. The 10
reward models calculate the sample variance. We note from the median of the sample variances that half of
the dataset tends to have variances of the rewards greater than 3.81, with a maximum close to 10. This cor-
roborates our hypothesis that different reward models will exhibit variability in their reward assignments for
the same prompt-response pair. (Bottom Row) The distribution of sample variance of the rewards difference
between accepted and rejected responses. The figure shows that the reward models are not merely transla-
tions of one another, and the variance arises due to the statistical nature of learning these reward models and
the stochasticity of the optimization process.

(Li et al., 2023; Zheng et al., 2023), Chat-Hard (Zheng et al., 2023), Safety (Röttger
et al., 2023; Dong et al., 2023), and Reasoning (Muennighoff et al., 2023; Lightman
et al., 2023). Additionally, there is a fifth dataset called Prior, which consists of subsets
of various other datasets including Anthropic Helpful (Bai et al., 2022a), BIG-Bench
(Askell et al., 2021), Stanford Human Preferences (SHP) (Ethayarajh et al., 2022)
and Learning to Summarize (Stiennon et al., 2020) and is given a 50% weightage
in the overall score. The platform evaluates models based on a comprehensive list of
metrics, providing a holistic view of the model’s ability to predict human preferences.
We refer readers to the original work for a more detailed explanation of the dataset
composition. We compare the average performance of our ensemble model to other
state-of-the-art (SOTA) models with similar model sizes (2B parameters). Table 1
summarizes the results of this comparison. Our ensemble reward model demonstrates
performance comparable to other SOTA 2B models, confirming its efficacy as a reliable
reward estimation framework.

Observations To corroborate our hypothesis that identically trained reward models


disagree on the same prompt-response pair, we run our experiment on the 4 datasets
provided in the RewardBenchmark platform, namely Chat, Chat-Hard, Safety and
Reasoning datasets. For example, the Chat dataset contains 358 prompt-response
pairs in the form (x, y 1 , y 2 ), where y 1 is the accepted response, and y 2 is the rejected
response. The Chat dataset is a mixture of multiple sources, including AlpacaEval

11
Easy, AlpacaEval, AlpacaEval Hard (Li et al., 2023), MT Bench Easy, and MT
Bench Medium (Zheng et al., 2023). The composition of the other datasets can be
found in the original work of Lambert et al. (2024). We analyze the variance of the
rewards assigned to the accepted responses across the 10 models in the ensemble. For
each prompt x, we compute the reward for the accepted response ri (x, y 1 ) using the
i-th reward model. We continue to compute the sample variance of the rewards for
each accepted response across the 10 models and plot the distribution of the sample
variance of the entire dataset. The top row of Figure 6 shows the histogram of the com-
puted sample variances in each dataset. We observe that the variances in the rewards
range between 3 and 14, with a mean variance greater than 4 and a median variance
greater than 3 for each dataset. This indicates that there is non-negligible variability in
the rewards assigned by the different models in the ensemble, even though the models
are trained on the same dataset. This lack of uniformity can be attributed to factors such
as the finite size of the training data and the inherent stochasticity of the optimization
process used during training. These findings align with our hypothesis that different
reward models can exhibit notable disagreement in their reward assignments for the
same prompt-response pair, even when trained on identical data. To further explore
this variability, we analyze the variance distribution of the differences between the re-
wards assigned to the accepted and rejected responses. The bottom row of Figure 6
presents this distribution, illustrating that the reward models are not simply transla-
tions of one another. Translationally invariant models would exhibit no differences in
rewards, leading to a Dirac distribution centered at zero. However, the distribution
as observed shows that this is not the case, supporting the notion that the observed
variance arises from the statistical and stochastic nature of the learning process.

5 Proximal Policy Optimization (PPO)


This section describes our methodology for fine-tuning the GPT-2 (Radford et al.,
2019) language model using a variance-aware approach. Our approach builds on
the standard Proximal Policy Optimization (PPO) framework (Schulman et al., 2017),
modified to incorporate uncertainty in the reward estimates. The goal is to demonstrate
how accounting for variance in reward models can lead to more robust and safe poli-
cies. We note that the reason for choosing GPT-2 was based on the ease of performing
PPO, as it is known in the literature that training large language models with PPO
presents difficulties involving instability and sensitivity to hyperparameters (Choshen
et al., 2019), code-level optimizations (Engstrom et al., 2020) and resource intensive-
ness.

Dataset For prompt sampling, we use the IMDB dataset (Maas et al., 2011), which is
publicly available via Hugging Face4 . The train split of this dataset consists of 25, 000
rows. We sample prompts x from each row with random lengths between 2 to 8 to-
kens. These sampled prompts serve as input to the language model during the training
process, where responses are generated and evaluated by our reward models.
4 stanfordnlp/imdb

12
Methodology We use GPT-2 as the base language model for fine-tuning. The re-
sponses generated by GPT-2 have a maximum length of 10 tokens. For each prompt-
response pair (x, y), we compute rewards and variances from each of the 10 reward
models in our ensemble. The reward for a given pair is adjusted by penalizing the
score based on the variance-weighted KL divergence between the current policy π and
the reference policy; that is, the adjusted reward is given by: Ri (x, y) = ri (x, y) −
βσ(x, y) ln ππ(y|x)
0 (y|x)
, where ri (x, y) is the reward from the i-th model. Note that this
estimate differs from the lower confidence estimate ri − βσ used in previous works
(Zhang et al., 2024). Using this variance-weighted reward, we perform PPO to up-
date the policy. For each reward model, we run 4 independent trials of PPO, resulting
in 4 policies per reward model. We train 40 independent policies, which we label as
the variance-aware policies. These policies are compared with another set of policies
trained using the conventional PPO method as given in TRL library (von Werra et al.,
2020). To ensure a fair comparison between the two methods, we fine-tune the value
of β experimentally to equalize the KL divergence between the final policy and the
reference policy across both sets of policies.

Evaluation To assess the quality of the trained policies, we evaluate them using a
large reward model that serves as a judge. Specifically, we use the FsfairX-LLaMA3-
RM-v0.1 reward model (Dong et al., 2023; Xiong et al., 2024)5 , which is based on
LLama-3-8B and currently ranks 17 on the RewardBenchmark platform. This re-
ward model acts as an evaluator by scoring the prompt-response pairs generated by the
trained policies. Each of the 40 policies from the variance-aware set is used to generate
responses for the test split of the IMDB dataset. The responses are then evaluated by
the judge reward model, which assigns an average score for the entire test dataset. This
process results in a distribution of average rewards for the variance-aware policies. We
repeat the same evaluation for the vanilla-PPO policies, generating another reward dis-
tribution based on their performance. As a baseline, we also evaluate the performance
of the reference policy, GPT-2, using the same reward model. The reward distributions
for all three sets of policies are compared and plotted in Figure 7.

Observations In Figure 7, indigo marks the true reward distribution of the base or
reference policy of GPT-2 as measured by the judge reward model. The red marks
the true reward distribution of the variance-aware policy, while the cyan marks the
true reward distribution of the vanilla PPO policy. As can be seen from the figure,
the mean reward of both methods performs better than the reference policy, which has
a mean reward of 0.19. The Variance-Aware Policy shows an improvement over the
reference policy, with a mean reward of 0.22 and a variance of 0.012. These policies
are trained to be more conservative, which leads to a more robust, albeit less aggressive,
improvement in the reward scores. The vanilla PPO policy demonstrates the highest
average reward, with a mean of 0.34 but also a significantly higher variance of 0.06.
This suggests that while ignoring variance in the reward model can result in larger
potential gains, it comes with increased variability and risk, making these policies more
5 [Link]

13
Figure 7: The reward distribution for the two methods compared with the reference policy’s quality. The
distribution marked in indigo represents the reward distribution for the reference policy, based on 40 samples
of the average reward determined by the judge reward model on responses generated by GPT-2. The reward
distribution from the reference policy has a mean of 0.19 and a variance of 0.002. The reward distribution
for the variance-aware method (in red) has a mean of 0.22 and a variance of 0.012. The reward distribution
for the vanilla PPO method (in cyan) has a mean of 0.34 and a variance of 0.06.

sensitive to noise in the reward estimates. The results suggest that the variance-aware
approach offers a more stable, risk-averse policy.

References
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan
Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
AI Anthropic. Introducing claude, 2023.

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan,
Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language
assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan,
Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint
arXiv:2309.16609, 2023.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Das-
Sarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training
a helpful and harmless assistant with reinforcement learning from human feedback.
arXiv preprint arXiv:2204.05862, 2022a.

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion,
Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKin-
non, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint
arXiv:2212.08073, 2022b.
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs:
I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.

14
Leshem Choshen, Lior Fox, Zohar Aizenbud, and Omri Abend. On the weak-
nesses of reinforcement learning for neural machine translation. arXiv preprint
arXiv:1907.01752, 2019.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario
Amodei. Deep reinforcement learning from human preferences. Advances in neural
information processing systems, 30, 2017.
Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model en-
sembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie,
Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with
high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
Luigi Daniele. Suphavadeeprasit. Amplify-Instruct: Synthetically Generated Diverse
Multi-turn Conversations for Effecient LLM Training. arXiv preprint arXiv:(coming
soon), 2023.

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan,
Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Re-
ward ranked finetuning for generative foundation model alignment. Transac-
tions on Machine Learning Research, 2023. ISSN 2835-8856. URL https:
//[Link]/forum?id=m7p5O7zblY.

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan
Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward
modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024.
Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D’Amour,
DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachan-
dran, et al. Helping or herding? reward model ensembles mitigate but do not elimi-
nate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos,
Larry Rudolph, and Aleksander Madry. Implementation matters in deep policy gra-
dients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.

Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. Understanding dataset diffi-
culty with v-usable information. In International Conference on Machine Learning,
pages 5988–6008. PMLR, 2022.
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Represent-
ing model uncertainty in deep learning. In International Conference on Machine
Learning, 2016.
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overopti-
mization. In International Conference on Machine Learning, pages 10835–10866.
PMLR, 2023.

15
Gemma. [Link] 2024.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng,
Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the po-
tential of small language models with scalable training strategies. arXiv preprint
arXiv:2404.06395, 2024.

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen,
Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved
safety alignment of llm via a human-preference dataset. Advances in Neural Infor-
mation Processing Systems, 36, 2024.

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Ensemble-based


uncertainty estimation for deep learning. arXiv preprint arXiv:1612.01474, 2016.
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin,
Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Re-
wardbench: Evaluating reward models for language modeling. arXiv preprint
arXiv:2403.13787, 2024.
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos
Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacaeval: An automatic eval-
uator of instruction-following models, 2023.
W Lian, B Goodson, E Pentland, et al. Openorca: An open dataset of gpt augmented
flan reasoning traces, 2023.
Xinran Liang, Katherine Shu, Kimin Lee, and Pieter Abbeel. Reward uncer-
tainty for exploration in preference-based reinforcement learning. arXiv preprint
arXiv:2205.12401, 2022.

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy
Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by
step. arXiv preprint arXiv:2305.20050, 2023.
Xingzhou Lou, Dong Yan, Wei Shen, Yuzi Yan, Jian Xie, and Junge Zhang.
Uncertainty-aware reward model: Teaching reward models to know what is un-
known. arXiv preprint arXiv:2410.00847, 2024.

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and
Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings
of the 49th Annual Meeting of the Association for Computational Linguistics: Hu-
man Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011.
Association for Computational Linguistics. URL [Link]
anthology/P11-1015.
AI Meta. Introducing meta llama 3: The most capable openly available llm to date.
Meta AI, 2024.

16
Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue
Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Long-
pre. Octopack: Instruction tuning code large language models. arXiv preprint
arXiv:2308.07124, 2023.
R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5), 2023.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Train-
ing language models to follow instructions with human feedback. Advances in neu-
ral information processing systems, 35:27730–27744, 2022.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,
2019.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Er-
mon, and Chelsea Finn. Direct preference optimization: Your language model is
secretly a reward model. Advances in Neural Information Processing Systems, 36,
2024.
Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey
Cideron, Olivier Bachem, and Johan Ferret. Warm: On the benefits of weight aver-
aged reward models. arXiv preprint arXiv:2401.12187, 2024.

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi,
and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in
large language models. arXiv preprint arXiv:2308.01263, 2023.
John Schulman. Trust region policy optimization. arXiv preprint arXiv:1502.05477,
2015.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.
Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
William F Sharpe. Mutual fund performance. The Journal of business, 39(1):119–138,
1966.

Lingfeng Shen, Sihao Chen, Linfeng Song, Lifeng Jin, Baolin Peng, Haitao Mi, Daniel
Khashabi, and Dong Yu. The trickle-down impact of reward (in-) consistency on
rlhf. arXiv preprint arXiv:2309.16155, 2023.
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss,
Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with
human feedback. Advances in Neural Information Processing Systems, 33:3008–
3021, 2020.
Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy
gradient methods for reinforcement learning with function approximation. Advances
in neural information processing systems, 12, 1999.

17
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac,
Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth,
et al. Gemini: a family of highly capable multimodal models. arXiv preprint
arXiv:2312.11805, 2023.
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati-
raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette
Love, et al. Gemma: Open models based on gemini research and technology. arXiv
preprint arXiv:2403.08295, 2024.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas-
mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-
ale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288, 2023.
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tris-
tan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gal-
louédec. Trl: Transformer reinforcement learning. [Link]
huggingface/trl, 2020.
Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar,
Daniel Egert, Olivier Delalleau, Jane Polak Scowcroft, Neel Kant, Aidan Swope,
et al. Helpsteer: Multi-attribute helpfulness dataset for steerlm. arXiv preprint
arXiv:2311.09528, 2023.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davi-
son, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Can-
wen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and
Alexander M. Rush. Transformers: State-of-the-art natural language processing.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan-
guage Processing: System Demonstrations, pages 38–45, Online, October 2020.
Association for Computational Linguistics. URL [Link]
anthology/[Link]-demos.6.
Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and
Tong Zhang. Iterative preference learning from human feedback: Bridging theory
and practice for rlhf under kl-constraint. ICML, 2024.
Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing
hidden states enables learning generalizable reward model for llms. arXiv preprint
arXiv:2406.10216, 2024.
Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji
Shan, Huimin Chen, Ruobing Xie, Yankai Lin, et al. Advancing llm reasoning
generalists with preference trees. arXiv preprint arXiv:2404.02078, 2024.
Yuanzhao Zhai, Han Zhang, Yu Lei, Yue Yu, Kele Xu, Dawei Feng, Bo Ding, and
Huaimin Wang. Uncertainty-penalized reinforcement learning from human feedback
with diverse reward lora ensembles. arXiv preprint arXiv:2401.00243, 2023.

18
Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, and Chuang
Gan. Improving reinforcement learning from human feedback with efficient reward
model ensemble. arXiv preprint arXiv:2401.16635, 2024.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yong-
hao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-
judge with mt-bench and chatbot arena. Advances in Neural Information Processing
Systems, 36:46595–46623, 2023.
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario
Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from
human preferences. arXiv preprint arXiv:1909.08593, 2019.

6 Proofs
Theorem 2.2. Under  Assumption
 2.1, for any β > 0, the following holds with proba-
bility at least 1 − exp − XA
β2 :

b⊤ d − β∥d∥Σ ⩽ sup r∗⊤ d.


sup R
d∈D d∈D

Proof. The result follows from a standard self-normalizing bound for Gaussian ran-
dom variables. Specifically, for any δ > 0, the following inequality holds with high
probability:
p
b − r∗
R ⩽ XA ln (1/δ),
Σ−1

b − r∗
with probability at least 1 − δ, since R is the self-normalized euclidean
Σ−1
norm of a standard Gaussian random variable in XA dimensions. By applying the
Cauchy-Schwarz inequality, we have, for any d ∈ D:

b − r∗ ⟩ ⩽ ∥d∥Σ R
⟨d, R b − r∗ .
Σ−1

b − r∗
Substituting the bound on R , we obtain:
Σ−1
p
b − r∗ ⟩ ⩽ ∥d∥Σ
⟨d, R XA ln (1/δ).

This completes the proof.


Theorem 3.3. Consider policies π1 and π2 as defined in Definitions 3.1 and 3.2 respec-
tively. With ε̃ set as λmin (Σ)ε to ensure the optimization domain of the variance-aware
method is only as large as the variance unaware method, we have the following result:

P π2⊤ r∗ ⩽ π0⊤ r∗ ⩽ P π1⊤ r∗ ⩽ π0⊤ r∗ .


 

19
Proof. Both optimization problems ((3.1) and (3.2)) involve maximizing a linear func-
tion over a convex domain. Thus, the maximum occurs at the boundary of the feasible
region, allowing us to replace the inequality constraints in (3.1) and (3.2) with equality
constraints. We can solve these optimization problems using the method of Lagrange
multipliers. For the variance-aware optimization problem (3.2), the Lagrangian formu-
lation is:
h i
π2 = argmax R b⊤ π − β(π − π0 )⊤ Σ(π − π0 ) ,
π

where β is the Lagrange multiplier associated with the covariance-weighted ℓ2 con-


straint. The solution to this optimization problem is given by:
1 −1 b
π2 = π0 + Σ R. (3)

To satisfy the constraint ∥π2 − π0 ∥2Σ = ε̃, we determine β as:


s
1 R b⊤ Σ−1 Rb
β= .
2 ε̃

Substituting this back into the solution for π2 yields:


r
ε̃
π2 = π0 + Σ−1 R.
b

R Σ R
b −1 b

Similarly, for the variance-unaware policy π1 , solving the optimization problem (3.1)
yields: r
ε b
π1 = π0 + R.
R R
b ⊤ b
Next, we compute the expected true rewards under both policies. The true reward under
π1 is:
r
ε b⊤ ∗
π1⊤ r∗ = π0⊤ r∗ + R r ,
Rb⊤ Rb

and under π2 , the true reward is:


r
ε̃
π2⊤ r∗ = π0⊤ r∗ + b⊤ Σ−1 r∗ .
R
b⊤ Σ−1 R
R b

Both policies underperform relative to π0 if their corresponding rewards are less than or
equal to π0⊤ r∗ . For π1 , this occurs if R
b⊤ r∗ ⩽ 0, and for π2 , this occurs if R
b⊤ Σ−1 r∗ ⩽

0. Since R is normally distributed with mean r and covariance Σ, we have:
b

b⊤ r∗ ∼ N ∥r∗ ∥2 , r∗⊤ Σr∗ ,



R
b⊤ Σ−1 r∗ ∼ N r∗⊤ Σ−1 r∗ , r∗⊤ Σ−1 r∗ .

R

20
Thus, the probabilities of underperformance are given by:

∥r∗ ∥2
   
⊤ ∗
P R r ⩽ 0 = Φ −√
b ,
r∗⊤ Σr∗
   √ 
P R b⊤ Σ−1 r∗ ⩽ 0 = Φ − r∗⊤ Σ−1 r∗ ,

where Φ is the standard normal cumulative distribution function. Using the Cauchy-
Schwarz inequality:

∥r∗ ∥2 = r∗⊤ Σ−1/2 Σ1/2 r∗


⩽ Σ−1/2 r∗ Σ1/2 r∗
√ √
= r∗⊤ Σ−1 r∗ r∗⊤ Σr∗ .

Thus, we conclude:
∥r∗ ∥2 √
−√ ⩾ − r∗⊤ Σ−1 r∗ .
r∗⊤ Σr∗
Since the cumulative distribution function Φ is increasing, it follows that:

P π2⊤ r∗ ⩽ π0⊤ r∗ ⩽ P π1⊤ r∗ ⩽ π0⊤ r∗ .


 

Theorem 3.6. Consider the optimization problem:


h i
max Ex∼D, y∼π(·|x) R(x, b y)
π
 
2 π(y|x)
subject to Ex∼D, y∼π(·|x) σ (x, y) ln ⩽ ε,
π0 (y|x)
b y) and σ 2 (x, y) are the reward estimate and its variance. The optimal
where R(x,
policy is:
!
∗ R(x,
b y)
π (y|x) ∝ π0 (y|x) exp ,
βσ 2 (x, y)

for some β > 0.


Proof. The constrained optimization problem can be transformed into an unconstrained
optimization problem by introducing a Lagrange multiplier β > 0:
" #
R(x,
b y) π(y|x)
argmax Ex∼D,,y∼π(·|x) − ln .
π βσ 2 (x, y) π0 (y|x)

The proof follows standard techniques and can be found in Rafailov et al. (2024) (Ap-
pendix A.1).

21
7 Experimental Details for Reward Modeling
The hyperparameter details used in the single reward-head modeling are given in Table
2. Other parameters are kept as in Wolf et al. (2020). Table 3 summarizes the hardware
specifications and resource consumption during the single reward-head training pro-
cess, including GPU memory, disk space, and total training time. The model is trained
using four NVIDIA A40 GPUs, each with 48 GB of memory. The total disk space for
storing the dataset, model checkpoints, and logs is approximately 30 GB. Training time
is 51 hours.
Hyperparameter Value
Effective Batch Size 32 Resource Details
Learning Rate 1e-5 GPU Model NVIDIA A40 (40 GB)
Optimizer Paged AdamW 32bit Number of GPUs 4
Weight Decay 0.001 Total GPU Memory 12.68 GB
LR Scheduler cosine Total Disk Space Required 30 GB
Epochs 1 Total Training Time 51 hours
Global Train Steps 4125
Table 3: Hardware requirements for training the sin-
Table 2: Hyperparameters used in training the gle reward model.
Single Reward Model.

The hyperparameter details used in ensemble reward modeling are given in Table
4. Other parameters are kept as in Wolf et al. (2020). Table 5 summarizes the hardware
specifications and resource consumption during the ensemble training process, includ-
ing GPU memory, disk space, and total training time. The model is trained using four
NVIDIA A40 GPUs, each with 48 GB of memory. The total disk space for storing the
dataset, model checkpoints, and logs is approximately 40 GB. Training time is 7 hours.

Hyperparameter Value
Effective Batch Size 32 Resource Details
Learning Rate 1e-5 GPU Model NVIDIA A40
Optimizer Paged AdamW 32bit Number of GPUs 4
Weight Decay 0.001 Total GPU Memory 6.12 GB
LR Scheduler cosine Total Disk Space Required 38 GB
Epochs 0.5 Total Training Time 7 hours
Global Train Steps 2060
Table 5: Hardware requirements for training the en-
Table 4: Hyperparameters used in training the semble reward model.
Ensemble Reward Model.

Figures 8(a) and 8(b) depict the training loss curves for both the single and ensem-
ble reward models. In particular, we early-stop the fine-tuning of the single reward-
head model when the loss dips below the 0.4 mark. We then attach 10 reward heads
parallel to the final layer, freeze the base model, and retrain only the reward heads until
the average training loss for each reward head is close to 0.2.
In Figure 9, we present the performance of the ten models evaluated across four
datasets on the RewardBenchmark platform: Chat, Chat-Hard, Reasoning, and
Safety. In particular, we compare these models against a fully fine-tuned single reward
head model instead of the ensemble models trained with a frozen base. Our results

22
(a) Training Loss for a Single Reward Model. (b) Training Loss for an Ensemble of 10 reward mod-
els

Figure 8: Training Loss for Reward Modelling

indicate that the models within the ensemble perform on par with each other and are
comparable to the fully fine-tuned single reward head model.

(a) Chat (b) Chat Hard (c) Safety (d) Reasoning

Figure 9: The comparison of each model in the ensemble with the single reward-head model on all evalua-
tion datasets of the RewardBenchmark platform. In particular, the 10 blue bars indicate the model accuracy
for each of the 10 models. The accuracy of the base model is given in orange. We see that for each of the 10
models in the ensemble, the performance is comparable with the base model.

8 Experimental Details for PPO Training


The hyperparameter and details used in both the vanilla and the variance-aware PPO
training are given in Tables 6 and 7. Most of the hyperparameters are taken as in von
Werra et al. (2020). The major difference between the two methods is a judicious
choice of the β parameter, which controls the constraint domain of the optimization
problem. To be consistent, we choose the β parameter such that the KL divergence
from the reference policy is roughly the same for both methods. This ensures that the
search domains for both methods are roughly the same. The β parameter is defined as
the Initial KL Coeff variable in the hyperparameter tables.
Table 8 summarizes the hardware specifications and resource consumption for train-
ing a single GPT-2 model using PPO, including GPU memory, disk space, and total
training time. The model is trained using four NVIDIA A40 GPUs, each with 48 GB

23
Hyperparameter Value Hyperparameter Value
Effective Batch Size 128 Effective Batch Size 128
Learning Rate 1.414e-5 Learning Rate 1.5e-5
Epochs 1 Epochs 1
Steps 192 Steps 192
Initial KL Coeff 0.2 Initial KL Coeff 0.05
Adaptive KL Control False Adaptive KL Control False

Table 6: Hyperparameters used in training Table 7: Hyperparameters used in training


with vanilla PPO method. with Variance Aware PPO method.

of memory. The total disk space for storing the dataset, model checkpoints, and logs is
approximately 6.55 GB. Training time is roughly 4 hours.

Resource Details
GPU Model NVIDIA A40
Number of GPUs 4
Total GPU Memory 18.4 GB
Total Disk Space Required 6.55 GB
Total Training Time 3.86 hours

Table 8: Hardware requirements for training a single PPO model.

Figure 10 shows the evolution of the KL divergence between the trained and ref-
erence policies for both methods. The average and standard deviation of the KL di-
vergence for the 40 policies for both sets of methods are plotted. As can be seen with
high probability, the KL divergence for both methods lies within the 1.2 and 1.4 range.
Each of the 40 independent policies was run with an initial random seed of 0.

Figure 10: The trajectories of the KL divergence as a function of training steps are plotted for both
methods. Specifically, we plot the mean KL and the standard deviation of the KL for the 40 independently
trained policies for both methods. Green denotes the KL trajectory for the vanilla PPO method, whereas
blue indicates the variance-aware method. As can be seen, by the end of the training, with high probability,
the KL divergence of the final policy from the reference policy is roughly the same for both methods. In
particular, both methods produce policies whose KL divergences from the reference policy lie between 1.2
and 1.4.

Figure 11 shows the evolution of the rewards collected by the policies for both
methods. The average and standard deviation of the rewards for the 40 policies for

24
both sets of methods are plotted.

Figure 11: The trajectories of the proxy reward as a function of training steps are plotted for both methods.
Specifically, we plot the mean proxy reward and the standard deviation of the proxy rewards for the 40
independently trained policies for both methods. Green denotes the trajectory for the vanilla PPO method,
whereas blue indicates the variance-aware method.

In Figure 12, we repeat the experiment of Section 5, but this time with 100 sample
policies trained using the vanilla and the variance aware method and evaluated using
the judge reward model.

Figure 12: The reward distribution for the two methods compared with the reference policy’s quality.
The distribution marked in indigo represents the reward distribution for the reference policy, based on 100
samples of the average reward determined by the judge reward model on responses generated by GPT-2.
The reward distribution from the reference policy has a mean of 0.15 and a variance of 0.012. The reward
distribution for the variance-aware method (in red) has a mean of 0.41 and a variance of 0.016. The reward
distribution for the vanilla PPO method (in cyan) has a mean of 0.43 and a variance of 0.038.

25

Common questions

Powered by AI

The ensemble-based reward modeling captures variability more effectively by leveraging multiple independently trained models, which better represent the uncertainty and variability of reward estimates. This methodology reveals that different models, although trained on identical data, give varied reward distributions due to factors such as stochastic optimization, improving the robustness in capturing the degrees of freedom in reward prediction .

Training large language models with PPO is challenging due to issues like instability during training, sensitivity to hyperparameters, and the necessitated code-level optimizations, which are resource-intensive. The variance-aware approach aims to address these by accounting for uncertainty in reward estimates, potentially leading to more stable and robust policy development. This method introduces variance-weighted rewards allowing for safer policy updates and less sensitivity to the inherent instabilities in large model training .

Significant reward variability was observed across all four datasets on the RewardBenchmark platform, with variances ranging between 3 and 14. This variability demonstrates that even identically trained models can produce divergent reward estimates for the same prompt-response pairs due to underlying stochasticity in training and dataset differences .

Variance-aware policy optimization helps align LLMs with human preferences by reducing reward variance, especially in settings with significant variability. By taking into account the variability of reward estimates across prompt-response pairs, this method enhances the LLMs' ability to predict human preferences more accurately, making them more aligned with expected outcomes and thus increasing model effectiveness in real-world applications .

GPT-2 was chosen as the base model for implementing variance-aware PPO because it allows for manageable training, avoiding difficulties associated with larger models. The rationale includes the known stability of training with PPO despite the sensitivity to hyperparameters and resource intensiveness noted in literature .

Reward model ensembles, by highlighting variances and inconsistencies in reward assignments for identical data, provide insights into dataset difficulty. This variability can indicate complex features within the dataset that singular models might overlook, thus contributing to understanding how intricate or challenging a certain dataset might be for reward modeling .

The variance-weighted KL divergence penalizes reward scores by adjusting the rewards based on the discrepancy between the current policy and a reference policy. This seeks to maintain policy updates that are consistent with variance considerations, effectively creating a more stable policy development environment, as the impact of uncertainties in reward estimates are acknowledged and controlled during optimization .

The ensemble of reward models helps manage reward hacking by mitigating overoptimization. It achieves this through diversity in reward assessment across multiple independently trained models, reducing the likelihood of models being optimized towards misleading solutions. This ensemble framework aids in maintaining a nuanced reward system that considers variance across models, thereby combating rewards hacking effectively .

Variance-aware policy optimization reduces the reward variance significantly in high variability settings as it accounts for variance in reward estimates. In contrast, both variance-aware and variance-unaware methods perform similarly in low variability settings because the variance differences between the estimates are minimal .

Using an ensemble of models for reward modeling captures the uncertainty in the reward estimates effectively and allows for variability analysis across different instances of the same prompt-response pair. The ensemble-based approach ensures that variations, which arise due to factors such as the finite size of training data and stochastic optimization processes, can be analyzed, providing insights into the non-homogeneous nature of trained reward models .

You might also like