0% found this document useful (0 votes)

4 views24 pages

Reward Collapse in Aligning Large Language Models

This paper discusses the phenomenon of reward collapse in large language models (LLMs), where the reward distribution becomes identical regardless of the prompts during training, leading to miscalibration of human preferences. The authors propose a prompt-aware optimization scheme to mitigate this issue by using distinct utility functions based on the nature of the prompts, allowing for a more accurate reward distribution. Experimental results demonstrate that this approach significantly alleviates reward collapse and improves the training of reward models.

Uploaded by

therryleseul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views24 pages

Reward Collapse in Aligning Large Language Models

Uploaded by

therryleseul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Reward Collapse in Aligning Large Language Models

Ziang Song* Tianle Cai† Jason D. Lee† Weijie J. Su‡

May 25, 2023

arXiv:2305.17608v1 [[Link]] 28 May 2023

Abstract
The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4
are in part unleashed by aligning them with reward models that are trained on human preferences,
which are often represented as rankings of responses to prompts. In this paper, we document the
phenomenon of reward collapse, an empirical observation where the prevailing ranking-based
approach results in an identical reward distribution regardless of the prompts during the terminal
phase of training. This outcome is undesirable as open-ended prompts like “write a short story
about your best friend” should yield a continuous range of rewards for their completions, while
specific prompts like “what is the capital of New Zealand” should generate either high or low
rewards. Our theoretical investigation reveals that reward collapse is primarily due to the
insufficiency of the ranking-based objective function to incorporate prompt-related information
during optimization. This insight allows us to derive closed-form expressions for the reward
distribution associated with a set of utility functions in an asymptotic regime. To overcome
reward collapse, we introduce a prompt-aware optimization scheme that provably admits a
prompt-dependent reward distribution within the interpolating regime. Our experimental results
suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse
during the training of reward models.

1 Introduction
A cornerstone of the recent remarkable advancements in the capabilities of large language models
(LLMs) like ChatGPT and GPT-4 is the integration of human feedback [25, 24]. The approach
to leveraging human feedback often begins with the training of a reward model that encapsulates
human preferences, values, and ethical considerations [8, 14, 2, 29, 10]. This is followed by the
fine-tuning of the LLMs using reinforcement learning, guided by the reward model. This process,
often referred to as reinforcement learning from human feedback (RLHF), has proven effective in
aligning LLMs with human intent, substantially enriching the quality of human interaction.
However, developing an effective reward model based on human preferences is challenging [4,
19, 27]. A notable difficulty arises when a human labeler struggles to give a quantitative score
to a response/completion for a specific prompt. Instead, it is much easier for humans to make
pairwise comparisons between completions in terms of their quality, which is indeed employed in the
development of InstructGPT [25]. Specifically, a human labeler is presented with several completions
*
Stanford University
†
Princeton University
‡
University of Pennsylvania. Email: suw@[Link].

1
generated by the LLMs for the same prompt and arranges the responses from the highest to lowest
perceived quality.1 A neural network is then trained to obtain a reward model that assigns rewards
to the responses in an attempt to align as closely as possible with human preferences in the form of
rankings.
Despite some benefits, such as eliminating calibration issues, rankings fall short in reflecting the
varied reward distributions of different prompts. This is because ranking one completion higher than
another does not indicate how much superior the former is compared to the latter. This concern is
especially pertinent in RLHF as some prompts are open-ended or, in other words, are dependent on
the users’ backgrounds, allowing the reward distribution to span a continuous range. Conversely,
some prompts are closed-ended, resulting in a response that should be either highly or lowly scored,
thus generating a roughly two-point mass distribution for the reward distribution. Instances of the
first type of prompts include write a short story about how AI will look like in 100 years and what is
the best cuisine in the world, while examples of the second type are prove the Pythagorean theorem
and is chicken a dinosaur. The reward model may struggle to aid LLMs in accurately calibrating
uncertainty without accounting for the nuances of different prompts.2

Figure 1: Illustration of reward collapse, with rewards assigned to eight responses, arranged from
least to most preferred. One type of prompt is open-ended, which should result in a roughly uniform
distribution of rewards, while the other is closed-ended, which should yield either high or low rewards
(polarized). However, as evidenced in the first three plots, when a common utility function is
employed (see Eq. 1 in Section 2), the two types of prompts result in a strikingly similar reward
distribution. Conversely, when a prompt-aware utility is applied, as seen in the fourth plot, the two
types of prompts exhibit distinct reward distributions. Further details are elaborated in Section 3.
All of our code is publicly available at [Link]

As our first main contribution, this paper documents a surprising phenomenon through a series
of experiments, demonstrating that training a reward model on preference rankings could result in
the same reward distribution regardless of the prompts. We call this phenomenon reward collapse,
1
Through private communication, [25] required human labelers to utilize a drag-and-drop interface to construct
consistent rankings from pairwise comparisons.
2
For instance, we suspect that this is partly accountable for the poor calibration of GPT-4 after RLHF (see page 12
of [24]), although we are unable to verify due to the black-box nature of GPT-4 as well as insufficient computational
resources.

2
which occurs during the terminal phase of training [26]. Intriguingly, our theoretical analysis first
predicted this phenomenon before it was confirmed experimentally. Indeed, we show that the collapse
reward distribution can be numerically deduced from a simple optimization program or, even simpler,
admits a closed-form expression. As demonstrated in Figure 1, our prediction of reward collapse is
in excellent agreement with the empirical results.
Reward collapse is clearly undesirable, as it overlooks the subtle differences among various
prompts, potentially leading to the miscalibration of human preference during the training of LLMs
via reinforcement learning with the reward model. A rudimentary strategy to bypass this issue is to
early stop the training of the reward model [25], which, however, is somewhat arbitrary and can
make it challenging to determine the stopping point.
As our second main contribution, we introduce a principled approach to alleviating reward
collapse, leveraging insights derived from the same optimization program that was instrumental in
predicting this phenomenon. In essence, we propose to use distinct utility functions depending on
prompts in training the reward model, such that the resulting reward distribution can be either
widely dispersed or tightly concentrated, contingent on whether the prompt is open-ended or closed-
ended. A notable advantage of this prompt-aware strategy is that our analysis is analytical, enabling
full control over the shape of the reward distribution as required. As depicted in the right-most
panel of Figure 1 and more results in Section 3, our experiments show that reward collapse can be
substantially mitigated using this prompt-aware methodology.

2 What Is Reward Collapse and How to Mitigate It?

2.1 Reward Collapse
Denote by R(prom, compl) a reward model. Without loss of generality, we assume 0 ≤ R(prom, compl) ≤
1. For a given prompt prom and n completions that are i.i.d. draws from an LLM, a human labeler
ranks the n responses from the most preferred to the least preferred, and the ranking is denoted as
πprom . The reward model is expected to score each completion that is consistent with the human-
provided ranking πprom as much as possible. To this end, we train a neural network that maximizes
the following overall utility:
X
U (Rθ (prom, complw ) − Rθ (prom, compll )) , (1)
(prom,complw ,compll )∈Π

where U is an (increasing) utility function, θ is the weights of the reward neural network, and
Π is the ranking dataset and complw is a preferred completion than compll in the ranking πprom .
ex/σ
In InstructGPT [25], U is set to Uσ (x) = log sigmoid(x/σ) ≡ log ex/σ +1
, which is an increasing
concave function. While maximizing Eq. 1, the reward model learns to not only align with the
human-provided ranking but also distinguish the rewards as much as possible.
To gain insights into how the rewards depend on U , note that the above is equivalent to
X X
max U (Rθ (prom, complw ) − Rθ (prom, compll )) .
prom (complw ,compll )∈πprom

Next, assume that the neural network parameterized by θ is sufficiently overparameterized such that
X
U (Rθ (prom, complw ) − Rθ (prom, compll ))
(complw ,compll )∈πprom

3

is exactly maximized. This is precisely the same as maximizing 1≤i<j≤n U rπprom (i) − rπprom (j)
P

over 0 ≤ r1 , . . . , rn ≤ 1. However, the solution to this optimization program is independent of the

prompt and, indeed, is the same as the solution to
X
max U (ri − rj ) (2)
0≤r1 ,...,rn ≤1
1≤i<j≤n

up to a permutation. That is, the empirical distribution of the rewards is independent of the prompt
itself in the interpolating regime, thereby leading to reward collapse.

2.2 Prompt-Aware Optimization

To avoid having the same reward distribution, one simple strategy is early stopping. While reward
collapse can be avoided via early stopping, early stopping might make the model neglect other
important features. A more principled approach is to change the objective. Our proposal is to let the
utility function U now depend on the prompt. That is, now we consider training a neural network
that maximizes
X
Uprom (Rθ (prom, complw ) − Rθ (prom, compll )) . (3)
(prom,complw ,compll )∈Π

In general, the choice of Uprom should reflect the open-endedness of the prompt prom. An
important feature is that if Uprom is concave, this problem becomes a convex optimization problem
(Lemma 4.1). Given the high flexibility in choosing Uprom , it is generally recommended to let the
practitioners choose these functions to meet their needs. Nonetheless, below we introduce a family
of such functions.
For a strictly increasing utility function U , it can be easily demonstrated that the maximum can
only be attained when r1 ≥ · · · ≥ rn (see Lemma B.1 in the Appendix). As a result, we only need to
consider the problem X
max U (ri − rj ) . (4)
0≤rn ≤...≤r1 ≤1
1≤i<j≤n

Class 1. Let Uγ (x) = xγ , x ∈ [0, 1], for some 0 < γ < 1. This utility function encourages the
reward distribution to take values either near 0 or 1 as γ tends to be large. Some plots showing the
empirical distribution of solutions to (2) is given in Figure 2(a) and (b).

Class 2. Let Uγ (x) = −x−γ , x ∈ (0, 1], for 0 < γ ≤ 1 and U0 (x) = log x for γ = 0. We also let
Uγ (0) = −∞ for 0 ≤ γ ≤ 1. In this case, the reward distribution of Eq. 2 becomes more even as γ
increases from 0 to 1. Some plots are shown in Figure 2(c) and (d).

Class 3. Let Uσ (x) = log sigmoid(x/σ) for σ > 0. The reward distribution becomes more spread
between 0 and 1 as σ becomes smaller. Some plots are shown in Figure 2(e) and (f).

2.3 Asymptotics
In general, we can explicitly evaluate the reward distribution for any n by solving the optimization
program (4). Nevertheless, it is helpful to get a handle on the empirical distribution of the solution

4
1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.46 0.45
0.485
0.4 0.4 0.455
0.4 0.445

0.48 0.45 0.44

0.2 0.2 0.445 0.2 0.435

0.475 0.44
0.43
0.4 0.405 0.41 0.415 0.42 0.4 0.405 0.41 0.415 0.42 0.4 0.405 0.41 0.415 0.42
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(a) U (x) = x0.8 (b) U (x) = x0.2 (c) U (x) = log x

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.43 0.48 0.44

0.4 0.4 0.475
0.4
0.42 0.42
0.47

0.2 0.41 0.2 0.465 0.2 0.4

0.46
0.4 0.38
0.4 0.405 0.41 0.415 0.42 0.36 0.38 0.4 0.36 0.38 0.4
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(d) U (x) = −x−1 (e) U (x) = log sigmoid(x) (f) U (x) = log sigmoid(4x)

Figure 2: Reward distribution for different utility function.

to this optimization program in the limit n → ∞. The next result gives a closed-form expression of
the reward distribution in the case of a large number of completions.

Theorem 1. Let Uγ (x) = 0 < γ < 1. Then the reward distribution of (4)
xγ for some converges to
1−γ 1−γ − 1+γ 1+γ
the Beta distribution Beta 2 , 2 as n → ∞, which has probability density x 2 (1 − x)− 2
on (0, 1).

The proof of Theorem 1 is defered to Section 4.

Theorem 2. For Uγ (x) = −x−γ for 0 ≤ γ ≤ 1 (as a convention, take U0 (x) = log x). Then. the
reward distribution of (4) converges in distribution to Beta( 1+γ 1+γ
2 , 2 ).

The proof of Theorem 2 can be found in [21, 17]. In the limit γ → 1 in Theorem 2, the Beta
distribution tends to Beta(1, 1), which is the uniform distribution on [0, 1]. This is indeed an example
of the one-dimensional Thomson problem [5], which asks the configuration of n electrons constrained
to a line that repel each other with a force given by Coulomb’s law. This problem was first considered
by Maxwell. Indeed, [21, 11, 1] prove that the reward distribution will converge to the uniform
distribution for Uγ (x) = −x−γ with γ ≥ 1.
For the above two classes, the limiting distribution does not admit a probability mass. However,
probability mass can emerge in the case of a scaled log-sigmoid function.

Theorem 3. If U is strictly increasing and concave, the derivative of the utility function satisfies
U ′ (0) < ∞, U ′ (1) > 0, then the reward distribution of (4) converges in distribution to a probability

5
measure µ∗ that satisfies
µ∗ ({0}) = µ∗ ({1}) ≥ 1
κ+1 ,

where κ = U ′ (0)/U ′ (1).

In general, the reward distribution can be characterized from a variational perspective. This
gives the following theorem.

Theorem 4. If U is bounded, strongly concave, and increasing. There exists a probability measure µ∗
such that the reward distribution of (2) converges in distribution to µ∗ , which is uniquely determined
by the following two properties:

(a) µ∗ maximizes
E iid U (|X − X ′ |)
X,X ′ ∼ µ

over all probability measures µ on [0, 1], and

(b) it is symmetric with respect to 12 in the sense that, for any measurable set A ∈ [0, 1] and
1 − A = {x : 1 − x ∈ A}, µ∗ (A) = µ∗ (1 − A).

3 Experiments
In this section, we conduct experiments to investigate the phenomenon of reward collapse in a
controlled setting and demonstrate that prompt-aware training can prevent reward collapse.

3.1 Experimental Setup

The open-source datasets currently available for RLHF are rather limited. Most of these datasets [22,
3] typically include only a handful of candidate responses (usually a single pair) for each corresponding
prompt question. Moreover, the ranking signals in those datasets are usually noisy, either because
they are sourced from the Internet [9] or because of the inherent subjectivity of the ranking process.
In order to conduct a carefully controlled experiment, we curated our own dataset, focusing on
a single, simplified feature—the length of the response, measured in terms of word count—as the
ground truth reward. A subset of questions was selected from the LongForm dataset [16], a question-
answer dataset characterized by its lengthy answers. To simulate scenarios with open-ended and
concrete problems, we truncated the original answer according to two distinct length distributions,
thereby generating eight responses for each prompt: the first distribution is nearly uniform, ranging
from 10 to 80 words, while the second is a polarized distribution with response lengths primarily
clustered around either 30 or 60 words. Each question was randomly assigned as either open-ended
or concrete. Additionally, the phrases “Write the answer in an open-ended way” and “Write either a
short answer or a long answer” were added to the open-ended and concrete questions, respectively,
to distinguish the question type. Following this process, we constructed a dataset comprising 8192
training questions and 16 test questions.
In our experiments, we focus on the following U functions: x, log x, −1/x, log sigmoid(x), and
the prompt-aware U , which adaptively selects U from x and −1/x. Given that the U function
operates on x in the range [−1, 1], we adjust some U functions with suitable continuous extensions
or scaling. We then train a DeBERTa V3 [12] as the reward model. The training details can be
found in Appendix A.

6
Figure 3: Reward collapse on the test set. This figure follows the same setting as Figure 1 while
the evaluation is on the test set. As we can see from the figure, the reward distributions have similar
collapse phenomenons on the test set, and using prompt-aware loss can mitigate the collapse.

3.2 Experimental Results

Fixed loss function leads to reward collapse. As depicted in Figure 3.2, reward distributions
corresponding to different prompts gradually converge towards a single, prompt-independent dis-
tribution throughout the training process. Specifically, in the context of Figure 3.2, where the U
function is represented by LogSigmoid, the reward distribution exhibits positive probability mass at
reward scores of 0 and 1 (illustrated by the flat segments corresponding to the first two and last two
scores). This observation validates the prediction encapsulated in Theorem 3. Examining other U
functions, Figures 1 and 3 collectively indicate the occurrence of loss collapse on both training and
test datasets. Specifically, employing x as the U function results in a polarized reward distribution,
whereas utilizing −1/x as the U function yields a uniform reward distribution.

Prompt-aware training avoids reward collapse. Figures 1 and 3 show the reward distribution
at the end of training with varying utility functions. The results along with Figure 3.2 reveal that
using a prompt-aware U function effectively prevents reward collapse across both training and test
datasets. This strategy yields a more uniform reward distribution for open-ended prompts while
promoting a more polarized reward distribution for concrete prompts.

4 Proofs
In this section, we present the proofs of the theorems in Section 2. For ease of presentation, we start
by proving Theorem 4. Let
X
S(r1 , · · · , rn ) := U (ri − rj ) and r̂ ≡ (r̂1 , . . . , r̂n ) := arg max S(r1 , · · · , rn ).
0≤r1 ,··· ,rn ≤1
1≤i<j≤n

In addition, for any vector (u1 , · · · , un ) ∈ Rn , we employ boldface notation u to represent the entire
vector. This allows us to write S(r).

7
(a) log sigmoid as utility function (b) Prompt-aware utility function

Figure 4: (Left) Reward collapse when using log sigmoid as utility function [25]. The
reward distribution of different prompts gradually converges into a single distribution during training.
(Right) Prompt-aware training avoids reward collapse. When using the prompt-aware loss
function, the reward distributions of the two different prompts can be gradually separated during
training.

4.1 Proof of Theorem 4

First, when U is concave and strictly increasing, r̂ exhibits the following properties:
Lemma 4.1. If U is strictly concave and strictly increasing, the function S(r) is concave. Therefore,
the optimization problem uniquely determines r̂n . Additionally, the following properties hold: (1)
r̂1 ≥ · · · ≥ r̂n , and (2) 1 − r̂i = r̂n−i+1 for any 1 ≤ i ≤ n.
The proof of Lemma 4.1 is straightforward and is provided in Appendix B.1. Upon further
examination of the function S(r), we discover that if U is strongly concave with parameter µ > 0,
then S also exhibits some kind of strongly concavity, except in the direction (1, 1, · · · , 1). This
property is formulated in the following lemma.
Lemma 4.2. If U is strongly concave with parameter µ > 0, and we consider another vector
u = (u1 , . . . , un ) where u1 ≥ · · · ≥ un , the following inequality holds:
nµ
S(u) − S(r̂) ≤ − ∥ ProjVn (u − r̂)∥2 .
2
Here, Vn ⊂ Rn denotes the subspace orthogonal to (1, 1, · · · , 1), and ∥ · ∥ represents the Euclidean
norm.
The proof of this lemma can be found in Appendix B.2. Our next lemma quantifies the difference
between two symmetric probability measures.
(j)
Lemma 4.3. For two different symmetric probability measure µ1 and µ2 on [0, 1], let ri = 12 inf{t :
n−i
µj ([0, t]) ≥ n−1 } + 12 sup{t : µj ([0, t)) < n−1
n−i
}}), i = 1, 2, · · · , n; j = 1, 2. Then there exists positive
constant c0 such that

∥ ProjVn (r(1) − r(2) )∥2 ≥ c0 n,

for all n.

8
The proof of this lemma is also provided in Appendix B.3. Now, we are ready to prove the
uniqueness part of Theorem 4. Due to the length constraint, we will present it as a separate lemma
and defer the proof to Appendix B.4. In summary, we use Lemma 4.2 and Lemma 4.3 to demonstrate
that for two distinct symmetric measures, their distance is sufficiently large such that at least one of
them is not optimal.
Lemma 4.4. If µ1 and µ2 are two symmetric probability measure which both maximize
E iid U (|X − X ′ |)
X,X ′ ∼ µ

over all probability measures µ on [0, 1]. Then we have µ1 = µ2 .

Now we are ready to prove the convergence part of Theorem 4.
Proof of Theorem 4. Let P̂n := n1 ni=1 δr̂n denote the empirical distribution of r̂n . Note that {P̂n }
P
are probability measures defined on [0, 1], so they are tight. By Prohorov’s theorem, there exists
d iid iid
a sub-sequence {k(n)}n≥1 such that P̂k(n) → µ̂. Let Xn , Xn′ ∼ P̂n and X̂, X̂ ′ ∼ µ̂. By continuous
d
mapping theorem, we also have |Xn − Xn′ | → |X̂ − X̂ ′ |. Moreover, because U is bounded and
continuous, Portmanteau theorem gives
E iid U (|X − X ′ |) → E iid U (|X − X ′ |).
X,X ′ ∼ P̂k(n) X,X ′ ∼ µ̂

d
Let µ be another probability measure on [0, 1]. Let Q̂n = n1 ni=1 δqn,i such that Q̂n → µ. By the
P
same argument before, we also have E ′ iid
U (|X − X ′ |) → E ′ iid
U (|X − X ′ |). Then by
X,X ∼ Q̂k(n) X,X ∼ µ
the optimal assumption of r̂n ,
E iid U (|X − X ′ |) = lim E iid U (|X − X ′ |)
X,X ′ ∼ µ̂ n→∞ X,X ′ ∼ P̂k(n)

≥ lim E iid U (|X − X ′ |) = E iid U (|X − X ′ |).

n→∞ X,X ′ ∼ Q̂k(n) X,X ′ ∼ µ

This means µ̂ maximize E iid U (|X − X ′ |) over all probability measure µ on [0, 1]. From Lemma
X,X ′ ∼ µ
4.1, we know that 1 − r̂i = r̂n−i+1 , so µ̂ is symmetric. If there is another sub-sequence m(n) such
d
that P̂m(n) → ν̂. By the same argument before, ν̂ is also optimal and symmetric. From Lemma 4.4,
µ̂ = ν̂. Thus for every converging sub-sequence of {P̂n }, the limit distribution must be the same. By
d
the tightness of {P̂n }, we have P̂n → µ∗ .

4.2 Proof of Theorem 1

For the utility function Uγ (x) = xγ , having established Theorem 4, our objective is to identify
a symmetric probability measure µ∗ that maximizes E iid Uγ (|X − X |). By employing the
′
X,X ′ ∼ µ
variational principle, we can derive a condition that is necessary for optimality. Notably, this
condition also suffices for optimality.
Lemma 4.5. Let Uγ (x) = xγ for some γ ∈ (0, 1). A probability measure µ on [0, 1] will maximize
E ′ iid
Uγ (|X − X ′ |) if it satisfies the condition that EX∼µ Uγ (|X − c|) is independent of c ∈ [0, 1].
X,X ∼ µ
The proof of Lemma 4.5 is provided in Appendix C.1. Therefore, proving Theorem 1 is reduced
to verifying the condition stated in Lemma 4.5. This verification process is tedious and will be
deferred to Appendix C.2 for brevity.

9
4.3 Proof of Theorem 3
Theorem 3 can be intuitively understood as follows: If the function U satisfies U ′ (0) < ∞ and
U ′ (1) > 0, we can show, by analyzing the first-order optimality condition, that a positive fraction of
r̂ is equal to 1.

Proof of Theorem 3. The derivative of − i<j U (ri − rj ) with respect to rk is given by

P k−1 N
∂ U (ri −rj ) X
′
X
− i<j
∂rk = U (r̂i − r̂k ) − U ′ (r̂k − r̂j ) ≤ (k − 1)U ′ (0) − (n − k)U ′ (1).
r̂1 ,··· ,r̂n
i=1 i=k+1

The inequality follows from the convexity of U . If k ≤ n/(κ+1), we have (k−1)U ′ (0)−(n−k)U ′ (1) ≤ 0.

Hence, we can get r̂k = 1. Otherwise, we could increase r̂k to make i<j U (r̂i − r̂j ) larger. As
P

a result, r̂1 = · · · = r̂[n/(κ+1)] = 1. This gives P̂n ({1}) ≥ [ κ+1

n
]/n. By Theorem 4, we know that
d
there exists a limiting distribution µ∗ such that P̂ → µ∗ and µ∗ ({1}) ≥ 1/(κ + 1). Due to symmetry
proved in Lemma 4.1, we also have µ∗ ({0}) ≥ 1/(κ + 1).

5 Extension to Pairwise Comparisons

Our Prompt-Aware approach can be generalized to accommodate other settings, such as instances
where only pairwise preference data is accessible. Pairwise preference data may include loops, similar
to the rock-paper-scissors scenario, and can be produced from a probabilistic model. Consequently,
the data might simultaneously indicate a preference of A over B and a preference of B over A.
Pairwise preference data is extensively utilized in RLHF [8, 14, 29, 25, 28].
We explore the well-known Bradley-Terry-Luce (BTL) model [7, 20], which assumes the ex-
istence of scores {θi }1≤i≤n for n items such that the preference between item i and item j is
given by P(i is preferred over j) = sigmoid(θi − θj ), where sigmoid denotes the sigmoid function
sigmoid(x) = 1/(1 + exp(−x)). This probabilistic model effectively captures the relative preferences
between items, based on the disparity in their underlying scores.
To illustrate our framework, we consider the following expected version problem:
X
max S(r1 , · · · , rn ), where S(r1 , · · · , rn ) = U (ri − rj )sigmoid(θi − θj ).
0≤r1 ,··· ,rn ≤1
1≤i,j≤n

The function S(r) is similar to a family of log-likelihood functions considered in [23]. We presume
that U is increasing and concave. Then similar to Lemma P 4.1, U is also concave in (r1 , · · · , rn ).
Let r̂ = (r̂1 , . . . , r̂n ) be the vector that maximizes S(r) = 1≤i,j≤n U (ri − rj )sigmoid(θi − θj ). We
present the following consistency result for r̂:
Theorem 5. Assume that U is increasing and µ-strongly concave for µ > 0. Write θmax =
max1≤i≤n |θi |. Then, r̂ keeps the order of {θi }1≤i≤n and satisfies
q
|r̂i − r̂j | ≤ 2 U (1)(1 + eθmax )|θi − θj |/µ.

The proof of these results can be found in Appendix D. Theorem 5 ensures that for any increasing
and strongly concave utility function U , r̂ is a reliable estimate of {θi }1≤i≤n , in the sense that r̂i
and r̂j are close if θi and θj are close.

10
Even though we may not be able to determine the precise limiting distribution of rn in this
extended setting, we can still extract insights from our previous analysis in Section 2. As previously
observed, selecting U (x) = x tends to polarize the reward distribution, while selecting U (x) = −1/x
yields a more uniform reward [Link] phenomenon is also evident in this setting, as observed
in the results presented in Figure 5. More details is given in Appendix D.

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Figure 5: Reward distribution with different choice of {θ}1≤i≤n when n = 20.

Based on these findings, we can conclude that in this extended setting, we can also employ
a prompt-aware utility function U to mitigate reward collapse and achieve the desired reward
distribution by carefully selecting the form of U . This provides us with flexibility in shaping the
reward distribution according to our specific requirements.

6 Discussion
In this paper, we have introduced an empirical phenomenon known as reward collapse that arises
during reward model training for aligning LLMs using human preference rankings. This phenomenon
results in the same reward distribution regardless of the prompt type. The occurrence of reward
collapse stems from neural network interpolation during the final training phase. To mitigate reward
collapse, we propose utility functions that consider the nature of prompts and an analytical framework
that evaluates reward distribution, yielding closed-form reward expressions. Synthetic experiments
substantiate our findings, presenting a method superior to early stopping to tackle reward collapse.
While our experiments provide valuable insights, it is important to acknowledge their limitations,
primarily due to constrained computational resources available to us. Given abundant resources,
future research can explore the use of a more diverse range of prompts, varying in terms of their
open-endedness. Additionally, it would be interesting to investigate the extent to which the trained
reward model enhances the capabilities of large language models, such as their ability to self-calibrate
uncertainty [18, 15]. Theoretical investigations could focus on finding increasing, concave functions
that precisely match a given discrete reward distribution. On the practical side, developing a method
to choose a utility function based on prompts, perhaps using a parameter such as γ in Section 2.2,
poses an intriguing avenue for further exploration. Furthermore, exploring the potential benefits of
truncated ranking by requiring human labelers to provide partial rankings of acceptable completions
and ignore unacceptable completions could offer valuable insights into improving the training of
reward models.

11
Acknowledgments
We are grateful to Banghua Zhu for helpful discussions. We also thank Long Ouyang and Jan Leike
for clarifications on [25]. This work was supported in part by NSF through CAREER DMS-1847415
and CCF1934876, Analytics at Wharton, and Wharton AI and Analytics for Business.

References
[1] P. Amore and M. Jacobo. Thomson problem in one dimension: Minimal energy configurations
of n charges on a curve. Physica A: Statistical Mechanics and its Applications, 519:256–266,
2019.

[2] D. Bahdanau, F. Hill, J. Leike, E. Hughes, A. Hosseini, P. Kohli, and E. Grefenstette. Learning
to understand goal specifications by modelling reward. arXiv preprint arXiv:1806.01946, 2018.

[3] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli,
T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from
human feedback. arXiv preprint arXiv:2204.05862, 2022.

[4] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho-
seini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint
arXiv:2212.08073, 2022.

[5] M. Bowick, A. Cacciuto, D. R. Nelson, and A. Travesset. Crystalline order on a sphere and the
generalized thomson problem. Physical Review Letters, 89(18):185502, 2002.

[6] S. P. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[7] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of
paired comparisons. Biometrika, 39(3/4):324–345, 1952.

[8] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement
learning from human preferences. Advances in neural information processing systems, 30, 2017.

[9] K. Ethayarajh, H. Zhang, Y. Wang, and D. Jurafsky. Stanford human preferences dataset, 2023.

[10] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer,
K. Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors,
and lessons learned. arXiv preprint arXiv:2209.07858, 2022.

[11] D. P. Hardin, E. B. Saff, et al. Discretizing manifolds via minimum energy points. Notices of
the AMS, 51(10):1186–1194, 2004.

[12] P. He, J. Gao, and W. Chen. Debertav3: Improving deberta using electra-style pre-training
with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543, 2021.

[13] S. L. ([Link] lee). Expected ab-

solute difference between two iid variables. Mathematics Stack Exchange.
URL:[Link] (version: 2017-11-29).

12
[14] B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei. Reward learning from human
preferences and demonstrations in atari. Advances in neural information processing systems, 31,
2018.

[15] S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. H. Dodds,

N. DasSarma, E. Tran-Johnson, et al. Language models (mostly) know what they know. arXiv
preprint arXiv:2207.05221, 2022.

[16] A. Köksal, T. Schick, A. Korhonen, and H. Schütze. Longform: Optimizing instruction tuning
for long text generation with corpus extraction, 2023.

[17] N. S. Landkof and N. Landkof. Foundations of modern potential theory, volume 180. Springer,
1972.

[18] S. Lin, J. Hilton, and O. Evans. Teaching models to express their uncertainty in words.
Transactions on Machine Learning Research, 2022.

[19] H. Liu, C. Sferrazza, and P. Abbeel. Chain of hindsight aligns language models with feedback.
arXiv preprint arXiv:2302.02676, 2023.

[20] R. D. Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.

[21] A. Martinez-Finkelshtein, V. Maymeskul, E. Rakhmanov, and E. Saff. Asymptotics for minimal

discrete riesz energy on curves in Rd . Canadian Journal of Mathematics, 56(3):529–552, 2004.

[22] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju,
W. Saunders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess,
and J. Schulman. Webgpt: Browser-assisted question-answering with human feedback. In arXiv,
2021.

[23] R. Noothigattu, D. Peters, and A. D. Procaccia. Axioms for learning from pairwise comparisons.
Advances in Neural Information Processing Systems, 33:17745–17754, 2020.

[24] OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[25] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,

K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback.
Advances in Neural Information Processing Systems, 35:27730–27744, 2022.

[26] V. Papyan, X. Han, and D. L. Donoho. Prevalence of neural collapse during the terminal phase
of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663,
2020.

[27] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, and C. Gan. Principle-driven
self-alignment of language models from scratch with minimal human supervision. arXiv preprint
arXiv:2305.03047, 2023.

[28] B. Zhu, J. Jiao, and M. I. Jordan. Principled reinforcement learning with human feedback from
pairwise or k-wise comparisons. arXiv preprint arXiv:2301.11270, 2023.

13
[29] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irv-
ing. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,
2019.

14
A Training Details
We use the following extension of the utility functions during our training.
(
log(x + ϵ) for x > 0
• log x: U (x) = , where ϵ is set to 0.1.
x + log(ϵ) for x ≤ 0
(
−1/(x + ϵ) for x > 0
• −1/x: U (x) = , where ϵ is also set to 0.1.
x − 1/ϵ for x ≤ 0

• log sigmoid(x): U (x) = log sigmoid(4x). Here, the scaling factor of 4 ensures the output of
log sigmoid spans a sufficient range.
To train the reward model, we adopted the approach used in the OpenAssistant project, which
utilizes the DeBERTaV3 Base model [12]. To constrain the reward output between 0 and 1, a
sigmoidfunction was appended before the final output. The reward model was trained with a batch
size of 224 (comprising eight questions per batch, each with 28 pairs) for a total of 1000 steps,
approximately equivalent to 1 epoch. The maximum learning rate was configured to 1e-5, utilizing
the Adam optimizer and a linear learning rate schedule, inclusive of 10% warmup steps. The reward
model was trained on a single A6000 GPU, with the entire training process concluding in roughly 1
hour.

B Missing Proofs in Section 4.1

B.1 Proof of Lemma 4.1
We break the proof of Lemma 4.1 into two different lemma.
Lemma B.1. If the utility function U (x) is strictly increasing, let r̂ be the solution of optimization
problem 2: X
max U (ri − rj )
0≤r1 ,...,rn ≤1
1≤i<j≤n

Then r̂ satisfies: r̂1 ≥ · · · ≥ r̂n .

Proof. Let S(r) = 1≤i<j≤n U (ri − rj ). Suppose the conclusion is not true, then there exists a
P
k ≥ 0, such that r̂1 ≥ · · · ≥ r̂k and r̂k < r̂k+1 . Let us define

if i ̸= k, k + 1;

r̂i

r̃i = r̂k+1 if i = k;
if i = k + 1.

r̂k


Then
X X
U (r̂i − r̂j ) − U (r̃i − r̃j ) = U (r̂k − r̂k+1 ) − U (r̂k+1 − r̂k ) < 0
1≤i<j≤n 1≤i<j≤n

because U is strictly increasing and r̂k − r̂k+1 < 0. This contracts with the fact that r̂ is the solution
of the optimization problem, and thus the conclusion holds.

15
LemmaPB.2. If the utility function U (x) is strictly increasing and strictly concave, then the function
S(r) = 1≤i<j≤n U (ri − rj ) is concave. Moreover, the solution of optimization problem (2) is unique
and satisfies: 1 − r̂i = r̂n−i+1 for i = 1, 2, · · · , n.
Proof. The concavity of S follows directly from definition:
X
S(r) + S(r′ ) = U (ri − rj ) + U (ri′ − rj′ )
1≤i<j≤n
X ri + ri′ − rj − rj′ r + r′
≤ 2U ( ) = 2S( ).
2 2
1≤i<j≤n

The above inequality is an equality if and only if ri − rj = ri′ − rj′ for all 1 ≤ i < j ≤ n when U (x) is
strictly concave. When U is increasing, the solution r̂ of the optimization problem satisfies r̂1 = 1.
Thus the solution of the optimization problem max1≤r1 ,··· ,rn ≤1 S(r) is unique, otherwise the vector
r1 +r2
2 makes S larger where r1 and r2 are two different solutions.
Finally, let r̂ be the unique solution of the optimization problem. Let us define r̃i = 1 − r̂n−i+1 for
all i = 1, 2, · · · , n. It follows that r̃i − r̃j = r̂n−j+1 − r̂n−i+1 , and we have S(r̂) = S(r̃). Consequently,
the uniqueness of the solution implies r̂ = r̃. This means that r̂i = 1 − r̂n−i+1 for i = 1, · · · , n.

B.2 Proof of Lemma 4.2

Proof of Lemma 4.2. The definition of S(r) is
X
S(r) = U (ri − rj ).
1≤i<j≤n

The value of S does not change if we increase all ri by the same constant. Thus the value of S(r)
only depends on ProjVn (r) where Vn ⊂ Rn denotes the subspace orthogonal to (1, 1, · · · , 1). We can
define a new function on Vn by letting
F (ProjVn (r)) = S(r).
The domain of F is A = {v ∈ Vn |∃r ∈ Rn such that 0 ≤ ri ≤ 1 and v = ProjVn (r)}. First, we can
show that F is nµ-strongly concave.
Because U is µ-strongly concave, U (x) + µ2 x2 is concave. It follows that
µ X
S(r) + (ri − rj )2
2
1≤i<j≤n

is also concave. We can write − rj )2 as

P
1≤i<j≤n (ri

X n
X n
X
(ri − rj )2 = n ri2 − ( ri )2
1≤i<j≤n i=1 i=1

by Lagrange identity. Then note that Vn is the subspace orthogonal to (1, 1, · · · , 1). The projection
onto Vn is given by
n n
1X 1X
ProjVn (r) = (r1 − ri , · · · , rn − ri ).
n n
i=1 i=1

16
As a result,
 2
n n n n
ri − 1 1 X 2 1
X X X X
∥ ProjVn (r)∥2 = rj  = ri2 − ( ri ) = (ri − rj )2 .
n n n
i=1 j=1 i=1 i=1 1≤i<j≤n

µP
From this equation and the concavity of S(r) + 2 1≤i<j≤n (ri − rj )2 , we know that
nµ
S(r) + ∥ ProjVn (r)∥2
2
nµ
is also concave. Consequently, F (ProjVn (r)) + 2 ∥ ProjVn (r)∥
2 is concave, which lead to the strong
concavity of F because
nµ
F (v) + ∥v∥2
2
is concave. Let v̂ be the optimal vector that maximizes F (v), strong concavity implies (See e.g.
Section 9.1.2 in [6])
nµ
F (v) − F (v̂) ≤ − ∥v − v̂∥2 .
2
Therefore, by the definition of F (ProjVn (r)) = S(r), we have
nµ
S(u) − S(r̂) ≤ − ∥ ProjVn (u − r̂)∥2 .
2

B.3 Proof of Lemma 4.3

Proof of Lemma 4.3. Because µj , j = 1, 2 are symmetric, we have

(j) 1 i−1 1 i−1

rn,n−i+1 = inf{t : µj ([0, t]) ≥ } + sup{t : µj ([0, t)) < }})
2 n−1 2 n−1
1 i−1 1 i−1
= (1 − sup{t : µj ([t, 1]) ≥ }) + (1 − inf{t : µj ((t, 1]) < })
2 n−1 2 n−1
1 n−i 1 n−i
= (1 − sup{t : µj ([0, t)) < }) + (1 − inf{t : µj ([0, t]) ≥ }
2 n−1 2 n−1
(j)
= 1 − rn,i .
(1) (2))
So we have ni=1 (rn,i − rn,i ) = 0. Note that Vn ⊂ Rn is the subspace which is orthogonal to
P
(1, 1, · · · , 1), the projection of x = (x1 , · · · , xn ) onto Vn is given by
n n
1X 1X
ProjVn (x) = (x1 − xi , · · · , xn − xi ).
n n
i=1 i=1

Consequently,
n
(1) (2)
X
∥ ProjVn (r(1) (2) 2
n − rn )∥2 = (rn,i − rn,i )2 .
i=1

17
If µ1 and µ2 are two different symmetric probability measure on [0, 1], we can assume that there
exists q1 < q2 ∈ [0, 1] and δ ≥ 0, such that µ1 ([0, q2 ]) < µ2 ([0, q1 ]) − δ. So when n−1
i−1
∈
(1)
(µ1 ([0, q2 ]), µ2 ([0, q1 ]) − δ), we have rn,n−i+1 ≥ q2 because µ1 ([0, q2 ]) < n−1 .
i−1
We also have
(2) (1) (2)
rn,n−i+1 ≤ q1 because µ2 ([0, q1 ]) > As a result,
i−1
n−1 . − ≥ q2 − q1 whenever
rn,n−i+1 rn,n−i+1
(i − 1)/(n − 1) ∈ (µ1 ([0, q2 ]), µ2 ([0, q1 ]) − δ). Because the length of the interval is positive, the number
of such i is larger than c1 n where c1 is a constant independent of n. Then we conclude that
n
(1) (2)
X
∥ ProjVn (r(1) (2) 2
n − rn )∥2 = (rn,i − rn,i )2
i=1
≥ c1 n(q1 − q2 )2 .

Choosing c0 = c1 (q1 − q2 )2 gives the inequality

∥ ProjVn (r(1) (2) 2

n − rn )∥2 ≥ c0 n.

B.4 Proof of Lemma 4.4

Proof of Lemma 4.4. Suppose there exist two different symmetric probability measure µ1 and µ2 ,
they both maximize E ′ iid
U (|X − X ′ |). Let M = E ′ iid
U (|X − X ′ |), j = 1, 2. Now let
X,X ∼ µ X,X ∼ µj
(j)
rn,i = 1
2
i−1 1 i−1
inf{t : µj ([0, t]) ≥ n−1 } + 2 sup{t : µj ([0, t)) < n−1 }}), i = 1, 2, · · · , n; j = 1, 2 as defined in
(j)
Lemma 4.3. Accordingly, let Pn = n1 ni=1 δr(j) . Then we have
P
n,i

d
P(j)
n → µj , j = 1, 2.

This can be proved easily by considering the definition of convergence in distribution. Since G is
bounded, this lead to E ′ iid (j)
U (|X − X ′ |) → M, j = 1, 2 as n → ∞.
X,X ∼ Pn
The expectation E iid (j) U (|X − X ′ |) can be written more precisely as
X,X ′ ∼ Pn

1 X (j) (j)
E iid (j) U (|X − X ′ |) = U (|rn,i − rn,i′ |).
X,X ′ ∼ Pn n2
1≤i,i′ ≤n

By Lemma 4.2, we can bound the difference

1 X (j) (j) 1 X
U (|rn,i − rn,i′ |) − U (|r̂n,i − r̂n,i′ |)
n2 n2
1≤i,i′ ≤n 1≤i≤i′ ≤n
2 X (j) (j) 2 X
= n
U (rn,i − rn,i′ ) − n
U (r̂n,i − r̂n,i′ )
2 1≤i<i′ ≤n 2 1≤i<i′ ≤n
2µ
≤ − ∥ ProjVn (r(j) 2
n − r̂n )∥2 .
n−1
Then apply Lemma 4.3, there exist c0 ≥ 0 such that

2∥ ProjVn (r(1) 2 (2) 2 (1) (2) 2

n − r̂n )∥ + 2∥ ProjVn (rn − r̂n )∥ ≥ ∥ ProjVn (rn − rn )∥ .

18
Here, we uses 2∥x∥22 + 2∥y∥22 ≥ ∥x − y∥22 . So
 
1  X (j) (j)
min U (|rn,i − rn,i′ |) − U (|r̂n,i − r̂n,i′ |)
j=1,2 n2
′ 1≤i,i ≤n
2µ
= − max ∥ ProjVn (r(j) 2
n − r̂n )∥2
n − 1 j=1,2
(1) (2)
2µ ∥ ProjVn (rn − rn )∥2
≤ −
n−1 4
µc0 n c0 µ
≤ − ≤− .
2n − 2 2

Since M = max E U (|X − X ′ |), we know n12 1≤i,i′ ≤n U (|r̂n,i − r̂n,i′ |) ≤ M . As a result,
P
′ iid
X,X ∼ µ

1 X
min E iid (j) U (|X − X ′ |) ≤ U (|r̂n,i − r̂n,i′ |) − µc0 /2 ≤ M − µc0 /2.
j=1,2 X,X ′ ∼ Pn n2
1≤i≤i′ ≤n

This contradicts the assumption that E iid (j) U (|X − X ′ |) → M, j = 1, 2, n → ∞.

X,X ′ ∼ Pn

C Proof of Theorem 1
Given Theorem 4, we only need to find a symmetric probability measure on [0, 1], which maximizes

E iid U (|X − X ′ |).

X,X ′ ∼ µ

The following proof in this section is adapted from [13]. Let M (B([0, 1])) denote the sets of all finite
signed measure on the Borel sigma algebra B([0, 1]). Apparently, P (B([0, 1])) ⊂ M (B([0, 1])). Then
we define the following “inner product” in M (B([0, 1])):
Z
′
⟨µ, ν⟩ = EX∼µ,X ′ ∼ν,independent U (|X − X |) = U (|x − y|)µ(dx)ν(dy).
[0,1]2

We also define I(µ) as I(µ) := ⟨µ, µ⟩. With these notations, the problem becomes

max I(µ).
µ∈P (B([0,1]))

Lemma C.1. For U (x) = xγ with γ ∈ (0, 1). If µ is a signed measure satisfying µ([0, 1]) = 0, then
we have I(µ) ≤ 0. Moreover, I(µ) = 0 if and only if µ(E) = 0 for all E ⊂ [0, 1].
1−cos(xt)
Proof. f (t) = t1+γ
is integrable on (0, ∞). As a result, using change of variables, we have
∞
1 − cos(xt)
Z
γ
|x| = C dt
0 t1+γ

19
for come constant C > 0. Then by Fubini’s theorem, we have
Z
⟨µ, µ⟩ = |x − y|γ µ(dx)µ(dy)
[0,1]2
∞
1 − cos((x − y)t)
Z Z
=C dtµ(dx)µ(dy)
[0,1]2 0 t1+γ
!
∞
1 − cos((x − y)t)
Z Z
=C µ(dx)µ(dy) dt.
0 [0,1]2 t1+γ

Note that cos((x − y)t) = ℜ(eixt−iyt ), we have

1 − cos((x − y)t)
Z
µ(dx)µ(dy)
[0,1]2 t1+γ
!
eixt e−iyt
Z
= −ℜ µ(dx)µ(dy)
[0,1]2 t1+γ
= − ℜ |µ̂(t)|2 ,

where µ̂(t) = [0,1] eitx µ(dx) is the Fourier transform of µ. Then

∞
|µ̂(t)|2
Z
I(µ) = −C dt ≤ 0.
0 t1+γ

Moreover, I(µ) = 0 if and only if µ̂(t) = 0 for all t ∈ [0, ∞) if and only if µ(E) = 0 for all
E ∈ B([0, 1]).

C.1 Proof of Lemma 4.5

We first restate the lemma.

Lemma C.2. Let U (x) = xγ for some γ ∈ (0, 1). If a probability measure µ on [0, 1] maximize
E ′ iid
U (|X − X ′ |) if it satisfies that EX∼µ U (|X − c|) does not depend on c ∈ [0, 1].
X,X ∼ µ

Proof of Lemma 4.5. For two probability measure µ and ν on [0, 1], (µ − ν)([0, 1]) = 0. Suppose µ
satisfies EX∼µ U (|X − c|) = K does not depend on c ∈ [0, 1]. Note that
Z Z ! Z
⟨ν − µ, µ⟩ = |x − y|γ µ(dx) (ν − µ)(dy) = K(ν − µ)(dy) = 0.
[0,1] [0,1] [0,1]

And by lemma C.1, ⟨ν − µ, ν − µ⟩ ≤ 0. Therefore,

⟨ν, ν⟩ = ⟨µ, µ⟩ + 2 ⟨ν − µ, µ⟩ + ⟨ν − µ, ν − µ⟩ ≤ ⟨µ, µ⟩ .

This means that µ maximize E iid U (|X − X ′ |).

X,X ′ ∼ µ

20
C.2 Proof of Theorem 1
Proof of Theorem 1. Let µ be the probability measure induced by Beta( 1−γ 1−γ
2 , 2 ). It has probability
density function
1 − 1+γ − 1+γ
fγ (x) = 2 (1 − x)
1−γ 1−γ x
2 .
B( 2 , 2 )
For any c ∈ [0, 1], EX∼µ U (|X − c|) can be expressed as
Z 1
1 1+γ 1+γ
EX∼µ U (|X − c|) = 1−γ 1−γ |x − c|γ x− 2 (1 − x)− 2 dx
B( 2 , 2 ) 0
Z π
1 2
= 1−γ 1−γ | sin2 θ − c|γ (sin θ)−1−γ (cos θ)−1−γ d sin2 θ
B( 2 , 2 ) 0
π γ
sin2 θ − c
Z
2 2
= 1−γ 1−γ dθ
B( 2 , 2 ) 0 sin θ cos θ
γ
!
∞ Z π/2
sin2 θ − c
Z
2
= 1−γ 1−γ 1 ≥ t dθ dt.
B( 2 , 2 ) 0 0 sin θ cos θ

Because
γ
cos θ + 2c − 1 γ
Z π/2
sin2 θ − c 1 π
Z
1 ≥ t dθ = 1 ≥ t dθ
0 sin θ cos θ 2 0 sin θ
π 1 π n
Z o
= − 1 − cos θ − t1/γ sin θ ≤ 2c − 1 ≤ − cos θ + t1/γ sin θ dθ
2 2 0
( )
π 1 π 2c − 1
Z
= − 1 − cos(θ − ϕ) ≤ p ≤ − cos(θ + ϕ) dθ
2 2 0 1 + t2/γ
π
= − ϕ,
2

where tan ϕ = t1/γ and ϕ ∈ [0, π/2], and the last equation use the fact that c ∈ [0, 1]. As a result,
EX∼µ U (|X − c|) does not depend on c.
Note that Beta distribution
is also
symmetric. It follows from Theorem 4 that the reward
1−γ 1−γ
distribution converges to Beta 2 , 2

D Proof in Section 5
D.1 Proof of Theorem 5
Proof of Theorem 5. First, we prove that r̂ keeps the order of {θi }1≤i≤n . If there exist i and j such
that θi < θj and r̂i > r̂j , we define

if k ̸= i, j;

r̂k

r̃k = r̂j if k = i;
if k = j.

r̂i


21
Then
X X
S(r̂) − S(r̃) = U (r̂i − r̂k )sigmoid(θi − θk ) + U (r̂k − r̂i )sigmoid(θk − θi )
k̸∈{i,j} k̸∈{i,j}
X X
+ U (r̂j − r̂k )sigmoid(θj − θk ) + U (r̂k − r̂j )sigmoid(θk − θj )
k̸∈{i,j} k̸∈{i,j}
X X
− U (r̂j − r̂k )sigmoid(θi − θk ) − U (r̂k − r̂j )sigmoid(θk − θi )
k̸∈{i,j} k̸∈{i,j}
X X
− U (r̂i − r̂k )sigmoid(θj − θk ) − U (r̂k − r̂i )sigmoid(θk − θj )
k̸∈{i,j} k̸∈{i,j}

+ U (r̂i − r̂j )sigmoid(θi − θj ) + U (r̂j − r̂i )sigmoid(θj − θi )

− U (r̂j − r̂i )sigmoid(θi − θj ) + U (r̂i − r̂j )sigmoid(θj − θi ).

Note that for a < b and c < d, we have inequality ad + bc − ac − bd = (a − b)(d − c) < 0. It follows
that from the monotonicity of U and sigmoid function, and our assumption θi < θj and r̂i > r̂j , we
have

U (r̂i − r̂k )sigmoid(θi − θk ) + U (r̂j − r̂k )sigmoid(θj − θk )

− U (r̂j − r̂k )sigmoid(θi − θk ) − U (r̂i − r̂k )sigmoid(θj − θk ) < 0,
U (r̂k − r̂i )sigmoid(θk − θi ) + U (r̂k − r̂j )sigmoid(θk − θj )
− U (r̂k − r̂j )sigmoid(θk − θi ) − U (r̂k − r̂i )sigmoid(θk − θj ) < 0,
U (r̂i − r̂j )sigmoid(θi − θj ) + U (r̂j − r̂i )sigmoid(θj − θi )
− U (r̂j − r̂i )sigmoid(θi − θj ) − U (r̂i − r̂j )sigmoid(θj − θi ) < 0.

As a result, S(r̂) < S(r̃), which contradicts the optimality of r̂. This gives that r̂ keep the order of
{θi }1≤i≤n .
To prove the inequality in Theorem 5, let us consider the case where i = 1 and j = 2 without
loss of generality. We also assume θ1 ≥ θ2 . Then it follows from previous argument that r̂1 ≥ r̂2 .
From the optimality of r̂, we have
r̂1 + r̂2 r̂1 + r̂2
S(r̂) ≥ S( , , r̂3 , · · · , r̂n ).
2 2

22
The difference can be written as
r̂1 + r̂2 r̂1 + r̂2
S( , , r̂3 , · · · , r̂n ) − S(r̂)
2 2
X r̂1 + r̂2
= U( − r̂i )(sigmoid(θ1 − θi ) + sigmoid(θ2 − θi ))
2
i>2
X X
− U (r̂1 − r̂i )sigmoid(θ1 − θi ) − U (r̂2 − r̂i )sigmoid(θ2 − θi )
i>2 i>2
X r̂1 + r̂2
+ U (r̂i − )(sigmoid(θi − θ1 ) + sigmoid(θi − θ2 ))
2
i>2
X X
− U (r̂i − r̂1 )sigmoid(θi − θ1 ) − U (r̂i − r̂2 )sigmoid(θi − θ2 )
i>2 i>2
+ U (0)sigmoid(θ1 − θ2 ) + U (0)sigmoid(θ2 − θ1 )
− U (r̂1 − r̂2 )sigmoid(θ1 − θ2 ) − U (r̂2 − r̂1 )sigmoid(θ2 − θ1 ).

From strong concavity of U , we have

U (x) + U (y) x+y µ(x − y)2

≤ U( )− .
2 2 8
As a result,
r̂1 + r̂2
U( − r̂i )(sigmoid(θ1 − θi ) + sigmoid(θ2 − θi ))
2
r̂1 + r̂2
≥ 2U ( − r̂i )sigmoid(θ2 − θi )
2
µ(r̂1 − r̂2 )2
≥ sigmoid(θ2 − θi )(U (r̂1 − r̂i ) + U (r̂2 − r̂i ) + ).
4
Similarly,
r̂1 + r̂2
U (r̂i − )(sigmoid(θi − θ1 ) + sigmoid(θi − θ2 ))
2
µ(r̂1 − r̂2 )2
≥ sigmoid(θi − θ1 )(U (r̂i − r̂1 ) + U (r̂i − r̂2 ) + ),
4
U (0)sigmoid(θ1 − θ2 ) + U (0)sigmoid(θ2 − θ1 )
≥ sigmoid(θ2 − θ1 )(U (r̂1 − r̂2 ) + U (r̂2 − r̂1 ) + µ(r̂1 − r̂2 )2 ).

As a consequence, letting m = min|x|<θmax sigmoid(x) = 1

1+eθmax
, M = maxx∈[−1,1] U (x) = U (1),

23
and L be the Lipschitz constant of sigmoid(·), which is bounded by 1, we have

r̂1 + r̂2 r̂1 + r̂2

S( , , r̂3 , · · · , r̂n ) − S(r̂)
X 2 2
≥ (sigmoid(θ2 − θi ) − sigmoid(θ1 − θi ))U (r̂1 − r̂i )
i>2
X
+ (sigmoid(θi − θ1 ) − sigmoid(θi − θ2 ))U (r̂i − r̂2 )
i>2
m(2(n − 2) + 4)µ(r̂1 − r̂2 )2
+ (sigmoid(θ2 − θ1 ) − sigmoid(θ1 − θ2 ))U (r̂1 − r̂2 ) +
4
mnµ(r̂1 − r̂2 )2
≥ − (2n − 2)LM |θ1 − θ2 | + .
2
So the optimality of r̂ gives
mnµ(r̂1 − r̂2 )2
≤ 2nLM |θ1 − θ2 |.
2
This yields the inequality
p q
|r̂1 − r̂2 | ≤ 2 LM |θ1 − θ2 |/(mµ) ≤ 2 U (1)(1 + eθmax )|θ1 − θ2 |/µ.

D.2 Choice of θ in Figure 5

The choice of {θi }1≤i≤n in the left is
(
i/20 if i ≤ 15,
θi =
(i + 10)/6 if i > 15.

The choice of {θi }1≤i≤n in the left is

(
i/10 if i ≤ 5,
θi =
(i + 10)/6 if i > 5.

Generalist Reward Models in LLMs
No ratings yet
Generalist Reward Models in LLMs
23 pages
Uncertainty-Aware RLHF for Alignment
No ratings yet
Uncertainty-Aware RLHF for Alignment
25 pages
Causal Reward Modeling for LLM Alignment
No ratings yet
Causal Reward Modeling for LLM Alignment
19 pages
Efficient Reasoning with Conciseness Reward
No ratings yet
Efficient Reasoning with Conciseness Reward
15 pages
Generalizable Reward Models for LLMs
No ratings yet
Generalizable Reward Models for LLMs
25 pages
Evaluating Reward Models with RM-BENCH
No ratings yet
Evaluating Reward Models with RM-BENCH
29 pages
LLM Alignment Techniques Survey
No ratings yet
LLM Alignment Techniques Survey
37 pages
Mitigating Length Bias in Reward Models
No ratings yet
Mitigating Length Bias in Reward Models
17 pages
Implicit Preference Optimization in LLMs
No ratings yet
Implicit Preference Optimization in LLMs
16 pages
Language Models for Reward Design in RL
No ratings yet
Language Models for Reward Design in RL
18 pages
Preference Optimization in LLMs
No ratings yet
Preference Optimization in LLMs
89 pages
Co-Optimizing RL for Language Models
No ratings yet
Co-Optimizing RL for Language Models
17 pages
Reinforcement Learning in Language Models
No ratings yet
Reinforcement Learning in Language Models
4 pages
Challenges in RLHF Feedback Quality
No ratings yet
Challenges in RLHF Feedback Quality
14 pages
Meta-Rewarding for Self-Improving LLMs
No ratings yet
Meta-Rewarding for Self-Improving LLMs
18 pages
Evaluating LLMs for Text Summarization
No ratings yet
Evaluating LLMs for Text Summarization
5 pages
Scaling Laws in Reward Model Overoptimization
No ratings yet
Scaling Laws in Reward Model Overoptimization
28 pages
RLAIF: AI Feedback in RL Scaling
No ratings yet
RLAIF: AI Feedback in RL Scaling
18 pages
RewardBench: Evaluating Reward Models
No ratings yet
RewardBench: Evaluating Reward Models
44 pages
Efficient RFT with Plug-and-Play LLM
No ratings yet
Efficient RFT with Plug-and-Play LLM
25 pages
Chapter 3
No ratings yet
Chapter 3
78 pages
RLAIF: AI Feedback in Reinforcement Learning
No ratings yet
RLAIF: AI Feedback in Reinforcement Learning
31 pages
Meta-Rewarding for LLM Self-Improvement
No ratings yet
Meta-Rewarding for LLM Self-Improvement
18 pages
RLHF with Human Feedback Analysis
No ratings yet
RLHF with Human Feedback Analysis
31 pages
Reward Redistribution in Reinforcement Learning
No ratings yet
Reward Redistribution in Reinforcement Learning
21 pages
Active Reward Modeling for LLM Alignment
No ratings yet
Active Reward Modeling for LLM Alignment
32 pages
Self-Rewarding Language Models Explained
No ratings yet
Self-Rewarding Language Models Explained
23 pages
Reward Design for Faster Learning
No ratings yet
Reward Design for Faster Learning
5 pages
DLMA: Self-Rewarding LLM Alignment
No ratings yet
DLMA: Self-Rewarding LLM Alignment
25 pages
Reinforcement Learning for LLM Explanations
No ratings yet
Reinforcement Learning for LLM Explanations
19 pages
Evaluating Reward Models with REWARDBENCH
No ratings yet
Evaluating Reward Models with REWARDBENCH
38 pages
Reinforcement Learning for LLM Alignment
No ratings yet
Reinforcement Learning for LLM Alignment
41 pages
Chapter 2 Evaluation of LLMs Using RL
No ratings yet
Chapter 2 Evaluation of LLMs Using RL
4 pages
Fine-Tuning Language Models via RL
No ratings yet
Fine-Tuning Language Models via RL
26 pages
Min-Form Credit Assignment in PRMs
No ratings yet
Min-Form Credit Assignment in PRMs
18 pages
Vulnerabilities in LLM-as-Judge Systems
No ratings yet
Vulnerabilities in LLM-as-Judge Systems
26 pages
RLHF and PPO in Language Models
No ratings yet
RLHF and PPO in Language Models
37 pages
L5 Alignment
No ratings yet
L5 Alignment
52 pages
Understanding LLMs and RLHF Techniques
No ratings yet
Understanding LLMs and RLHF Techniques
7 pages
Personalized Rewardbench: Evaluating Reward Models With Human Aligned Personalization
No ratings yet
Personalized Rewardbench: Evaluating Reward Models With Human Aligned Personalization
19 pages
Negative Reinforcement in LLMs
No ratings yet
Negative Reinforcement in LLMs
18 pages
Self-Rewarding Language Models Explained
No ratings yet
Self-Rewarding Language Models Explained
15 pages
Inference-Time Scaling for Reward Models
No ratings yet
Inference-Time Scaling for Reward Models
42 pages
Self-Taught Evaluators for LLMs
No ratings yet
Self-Taught Evaluators for LLMs
16 pages
Self-Taught Evaluators for LLMs
No ratings yet
Self-Taught Evaluators for LLMs
15 pages
Entropy-Regularized Process Reward Model
No ratings yet
Entropy-Regularized Process Reward Model
18 pages
Rethinking Bradley-Terry in Reward Modeling
No ratings yet
Rethinking Bradley-Terry in Reward Modeling
52 pages
WARM: Enhancing Reward Models in RLHF
No ratings yet
WARM: Enhancing Reward Models in RLHF
33 pages
Aligning LLMs with Strategic Feedback
No ratings yet
Aligning LLMs with Strategic Feedback
10 pages
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
No ratings yet
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
18 pages
Mitigating Reward Ambiguity in RL
No ratings yet
Mitigating Reward Ambiguity in RL
2 pages
1456 Sample Efficient Preferen
No ratings yet
1456 Sample Efficient Preferen
27 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
154 pages
Reinforcement Learning From Human Feedback: Alignment Stability
No ratings yet
Reinforcement Learning From Human Feedback: Alignment Stability
2 pages
Paper4 RLHF Alignment
No ratings yet
Paper4 RLHF Alignment
3 pages
Direct Reasoning Optimization in LLMs
No ratings yet
Direct Reasoning Optimization in LLMs
17 pages
Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL
No ratings yet
Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL
18 pages
IELTS Advanced Listening Practice Guide
No ratings yet
IELTS Advanced Listening Practice Guide
8 pages
Big Bang: The Start of Everything
No ratings yet
Big Bang: The Start of Everything
3 pages
Mechanisms of Social Cognition
100% (1)
Mechanisms of Social Cognition
30 pages
Ayurveda's Approach to Mental Health Evaluation
No ratings yet
Ayurveda's Approach to Mental Health Evaluation
9 pages
Understanding Past Continuous Tense
No ratings yet
Understanding Past Continuous Tense
4 pages
Statement of Teaching Interests and Teaching Philosophy
No ratings yet
Statement of Teaching Interests and Teaching Philosophy
5 pages
Service Quality Model and Research Insights
No ratings yet
Service Quality Model and Research Insights
11 pages
Seven Key Drivers of Effective Learning
No ratings yet
Seven Key Drivers of Effective Learning
6 pages
Understanding Indian Psychology
No ratings yet
Understanding Indian Psychology
6 pages
Thinking Maps Part 1
No ratings yet
Thinking Maps Part 1
74 pages
Child Growth and Development Overview
No ratings yet
Child Growth and Development Overview
55 pages
Intercultural Communication Essentials
No ratings yet
Intercultural Communication Essentials
54 pages
E-Commerce Analytics Dashboard Guide
No ratings yet
E-Commerce Analytics Dashboard Guide
4 pages
Undertaking Min Eligibility
No ratings yet
Undertaking Min Eligibility
1 page
IELTS Speaking Pronunciation Guide FINAL
No ratings yet
IELTS Speaking Pronunciation Guide FINAL
29 pages
Effective Vocabulary Teaching Methods
No ratings yet
Effective Vocabulary Teaching Methods
4 pages
Smart Education Innovations in Flanders
No ratings yet
Smart Education Innovations in Flanders
8 pages
Salary Structure and CTC Formats
No ratings yet
Salary Structure and CTC Formats
3 pages
Diabetes Symptoms and Management Insights
No ratings yet
Diabetes Symptoms and Management Insights
6 pages
Intention, Sincerity and Truthfulness: Kitāb Al-Niyya Wa'l-Ikhlā Wa'l - Idq
No ratings yet
Intention, Sincerity and Truthfulness: Kitāb Al-Niyya Wa'l-Ikhlā Wa'l - Idq
28 pages
Zeenia - Udvada PHD Thesis - Detailed
No ratings yet
Zeenia - Udvada PHD Thesis - Detailed
195 pages
The Work of Byron Katie
No ratings yet
The Work of Byron Katie
17 pages
Understanding School Curricula Types
No ratings yet
Understanding School Curricula Types
73 pages
Basic Two First Term Scheme 2025-26
No ratings yet
Basic Two First Term Scheme 2025-26
13 pages
Antonyms for Inflammatory Vocabulary
No ratings yet
Antonyms for Inflammatory Vocabulary
6 pages
Indian Adaptation of Vineland Scale
59% (17)
Indian Adaptation of Vineland Scale
7 pages
Understanding Interdisciplinary Researchers
No ratings yet
Understanding Interdisciplinary Researchers
5 pages
Edsoc Annual 2025
No ratings yet
Edsoc Annual 2025
20 pages
Galgotias University Grade Card Summary
No ratings yet
Galgotias University Grade Card Summary
2 pages
5-History of Open Source Movement
No ratings yet
5-History of Open Source Movement
2 pages

Reward Collapse in Aligning Large Language Models

Uploaded by

Reward Collapse in Aligning Large Language Models

Uploaded by

Reward Collapse in Aligning Large Language Models

Ziang Song* Tianle Cai† Jason D. Lee† Weijie J. Su‡

May 25, 2023

2 What Is Reward Collapse and How to Mitigate It?

over 0 ≤ r1 , . . . , rn ≤ 1. However, the solution to this optimization program is independent of the

2.2 Prompt-Aware Optimization

0.8 0.8 0.8

0.6 0.6 0.6

0.48 0.45 0.44

0.2 0.2 0.445 0.2 0.435

(a) U (x) = x0.8 (b) U (x) = x0.2 (c) U (x) = log x

0.8 0.8 0.8

0.6 0.6 0.6

0.43 0.48 0.44

0.2 0.41 0.2 0.465 0.2 0.4

Figure 2: Reward distribution for different utility function.

The proof of Theorem 1 is defered to Section 4.

where κ = U ′ (0)/U ′ (1).

over all probability measures µ on [0, 1], and

3.1 Experimental Setup

3.2 Experimental Results

4.1 Proof of Theorem 4

∥ ProjVn (r(1) − r(2) )∥2 ≥ c0 n,

over all probability measures µ on [0, 1]. Then we have µ1 = µ2 .

≥ lim E iid U (|X − X ′ |) = E iid U (|X − X ′ |).

4.2 Proof of Theorem 1

Proof of Theorem 3. The derivative of − i<j U (ri − rj ) with respect to rk is given by

a result, r̂1 = · · · = r̂[n/(κ+1)] = 1. This gives P̂n ({1}) ≥ [ κ+1

5 Extension to Pairwise Comparisons

Figure 5: Reward distribution with different choice of {θ}1≤i≤n when n = 20.

[13] S. L. ([Link] lee). Expected ab-

[15] S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. H. Dodds,

[21] A. Martinez-Finkelshtein, V. Maymeskul, E. Rakhmanov, and E. Saff. Asymptotics for minimal

[24] OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[25] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,

B Missing Proofs in Section 4.1

Then r̂ satisfies: r̂1 ≥ · · · ≥ r̂n .

B.2 Proof of Lemma 4.2

is also concave. We can write − rj )2 as

B.3 Proof of Lemma 4.3

(j) 1 i−1 1 i−1

Choosing c0 = c1 (q1 − q2 )2 gives the inequality

∥ ProjVn (r(1) (2) 2

B.4 Proof of Lemma 4.4

By Lemma 4.2, we can bound the difference

2∥ ProjVn (r(1) 2 (2) 2 (1) (2) 2

This contradicts the assumption that E iid (j) U (|X − X ′ |) → M, j = 1, 2, n → ∞.

E iid U (|X − X ′ |).

Note that cos((x − y)t) = ℜ(eixt−iyt ), we have

where µ̂(t) = [0,1] eitx µ(dx) is the Fourier transform of µ. Then

C.1 Proof of Lemma 4.5

And by lemma C.1, ⟨ν − µ, ν − µ⟩ ≤ 0. Therefore,

⟨ν, ν⟩ = ⟨µ, µ⟩ + 2 ⟨ν − µ, µ⟩ + ⟨ν − µ, ν − µ⟩ ≤ ⟨µ, µ⟩ .

This means that µ maximize E iid U (|X − X ′ |).

+ U (r̂i − r̂j )sigmoid(θi − θj ) + U (r̂j − r̂i )sigmoid(θj − θi )

U (r̂i − r̂k )sigmoid(θi − θk ) + U (r̂j − r̂k )sigmoid(θj − θk )

From strong concavity of U , we have

U (x) + U (y) x+y µ(x − y)2

As a consequence, letting m = min|x|<θmax sigmoid(x) = 1

r̂1 + r̂2 r̂1 + r̂2

D.2 Choice of θ in Figure 5

The choice of {θi }1≤i≤n in the left is

You might also like