Reward Collapse in Aligning Large Language Models
Reward Collapse in Aligning Large Language Models
Abstract
The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4
are in part unleashed by aligning them with reward models that are trained on human preferences,
which are often represented as rankings of responses to prompts. In this paper, we document the
phenomenon of reward collapse, an empirical observation where the prevailing ranking-based
approach results in an identical reward distribution regardless of the prompts during the terminal
phase of training. This outcome is undesirable as open-ended prompts like “write a short story
about your best friend” should yield a continuous range of rewards for their completions, while
specific prompts like “what is the capital of New Zealand” should generate either high or low
rewards. Our theoretical investigation reveals that reward collapse is primarily due to the
insufficiency of the ranking-based objective function to incorporate prompt-related information
during optimization. This insight allows us to derive closed-form expressions for the reward
distribution associated with a set of utility functions in an asymptotic regime. To overcome
reward collapse, we introduce a prompt-aware optimization scheme that provably admits a
prompt-dependent reward distribution within the interpolating regime. Our experimental results
suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse
during the training of reward models.
1 Introduction
A cornerstone of the recent remarkable advancements in the capabilities of large language models
(LLMs) like ChatGPT and GPT-4 is the integration of human feedback [25, 24]. The approach
to leveraging human feedback often begins with the training of a reward model that encapsulates
human preferences, values, and ethical considerations [8, 14, 2, 29, 10]. This is followed by the
fine-tuning of the LLMs using reinforcement learning, guided by the reward model. This process,
often referred to as reinforcement learning from human feedback (RLHF), has proven effective in
aligning LLMs with human intent, substantially enriching the quality of human interaction.
However, developing an effective reward model based on human preferences is challenging [4,
19, 27]. A notable difficulty arises when a human labeler struggles to give a quantitative score
to a response/completion for a specific prompt. Instead, it is much easier for humans to make
pairwise comparisons between completions in terms of their quality, which is indeed employed in the
development of InstructGPT [25]. Specifically, a human labeler is presented with several completions
*
Stanford University
†
Princeton University
‡
University of Pennsylvania. Email: suw@[Link].
1
generated by the LLMs for the same prompt and arranges the responses from the highest to lowest
perceived quality.1 A neural network is then trained to obtain a reward model that assigns rewards
to the responses in an attempt to align as closely as possible with human preferences in the form of
rankings.
Despite some benefits, such as eliminating calibration issues, rankings fall short in reflecting the
varied reward distributions of different prompts. This is because ranking one completion higher than
another does not indicate how much superior the former is compared to the latter. This concern is
especially pertinent in RLHF as some prompts are open-ended or, in other words, are dependent on
the users’ backgrounds, allowing the reward distribution to span a continuous range. Conversely,
some prompts are closed-ended, resulting in a response that should be either highly or lowly scored,
thus generating a roughly two-point mass distribution for the reward distribution. Instances of the
first type of prompts include write a short story about how AI will look like in 100 years and what is
the best cuisine in the world, while examples of the second type are prove the Pythagorean theorem
and is chicken a dinosaur. The reward model may struggle to aid LLMs in accurately calibrating
uncertainty without accounting for the nuances of different prompts.2
Figure 1: Illustration of reward collapse, with rewards assigned to eight responses, arranged from
least to most preferred. One type of prompt is open-ended, which should result in a roughly uniform
distribution of rewards, while the other is closed-ended, which should yield either high or low rewards
(polarized). However, as evidenced in the first three plots, when a common utility function is
employed (see Eq. 1 in Section 2), the two types of prompts result in a strikingly similar reward
distribution. Conversely, when a prompt-aware utility is applied, as seen in the fourth plot, the two
types of prompts exhibit distinct reward distributions. Further details are elaborated in Section 3.
All of our code is publicly available at [Link]
As our first main contribution, this paper documents a surprising phenomenon through a series
of experiments, demonstrating that training a reward model on preference rankings could result in
the same reward distribution regardless of the prompts. We call this phenomenon reward collapse,
1
Through private communication, [25] required human labelers to utilize a drag-and-drop interface to construct
consistent rankings from pairwise comparisons.
2
For instance, we suspect that this is partly accountable for the poor calibration of GPT-4 after RLHF (see page 12
of [24]), although we are unable to verify due to the black-box nature of GPT-4 as well as insufficient computational
resources.
2
which occurs during the terminal phase of training [26]. Intriguingly, our theoretical analysis first
predicted this phenomenon before it was confirmed experimentally. Indeed, we show that the collapse
reward distribution can be numerically deduced from a simple optimization program or, even simpler,
admits a closed-form expression. As demonstrated in Figure 1, our prediction of reward collapse is
in excellent agreement with the empirical results.
Reward collapse is clearly undesirable, as it overlooks the subtle differences among various
prompts, potentially leading to the miscalibration of human preference during the training of LLMs
via reinforcement learning with the reward model. A rudimentary strategy to bypass this issue is to
early stop the training of the reward model [25], which, however, is somewhat arbitrary and can
make it challenging to determine the stopping point.
As our second main contribution, we introduce a principled approach to alleviating reward
collapse, leveraging insights derived from the same optimization program that was instrumental in
predicting this phenomenon. In essence, we propose to use distinct utility functions depending on
prompts in training the reward model, such that the resulting reward distribution can be either
widely dispersed or tightly concentrated, contingent on whether the prompt is open-ended or closed-
ended. A notable advantage of this prompt-aware strategy is that our analysis is analytical, enabling
full control over the shape of the reward distribution as required. As depicted in the right-most
panel of Figure 1 and more results in Section 3, our experiments show that reward collapse can be
substantially mitigated using this prompt-aware methodology.
where U is an (increasing) utility function, θ is the weights of the reward neural network, and
Π is the ranking dataset and complw is a preferred completion than compll in the ranking πprom .
ex/σ
In InstructGPT [25], U is set to Uσ (x) = log sigmoid(x/σ) ≡ log ex/σ +1
, which is an increasing
concave function. While maximizing Eq. 1, the reward model learns to not only align with the
human-provided ranking but also distinguish the rewards as much as possible.
To gain insights into how the rewards depend on U , note that the above is equivalent to
X X
max U (Rθ (prom, complw ) − Rθ (prom, compll )) .
prom (complw ,compll )∈πprom
Next, assume that the neural network parameterized by θ is sufficiently overparameterized such that
X
U (Rθ (prom, complw ) − Rθ (prom, compll ))
(complw ,compll )∈πprom
3
is exactly maximized. This is precisely the same as maximizing 1≤i<j≤n U rπprom (i) − rπprom (j)
P
up to a permutation. That is, the empirical distribution of the rewards is independent of the prompt
itself in the interpolating regime, thereby leading to reward collapse.
In general, the choice of Uprom should reflect the open-endedness of the prompt prom. An
important feature is that if Uprom is concave, this problem becomes a convex optimization problem
(Lemma 4.1). Given the high flexibility in choosing Uprom , it is generally recommended to let the
practitioners choose these functions to meet their needs. Nonetheless, below we introduce a family
of such functions.
For a strictly increasing utility function U , it can be easily demonstrated that the maximum can
only be attained when r1 ≥ · · · ≥ rn (see Lemma B.1 in the Appendix). As a result, we only need to
consider the problem X
max U (ri − rj ) . (4)
0≤rn ≤...≤r1 ≤1
1≤i<j≤n
Class 1. Let Uγ (x) = xγ , x ∈ [0, 1], for some 0 < γ < 1. This utility function encourages the
reward distribution to take values either near 0 or 1 as γ tends to be large. Some plots showing the
empirical distribution of solutions to (2) is given in Figure 2(a) and (b).
Class 2. Let Uγ (x) = −x−γ , x ∈ (0, 1], for 0 < γ ≤ 1 and U0 (x) = log x for γ = 0. We also let
Uγ (0) = −∞ for 0 ≤ γ ≤ 1. In this case, the reward distribution of Eq. 2 becomes more even as γ
increases from 0 to 1. Some plots are shown in Figure 2(c) and (d).
Class 3. Let Uσ (x) = log sigmoid(x/σ) for σ > 0. The reward distribution becomes more spread
between 0 and 1 as σ becomes smaller. Some plots are shown in Figure 2(e) and (f).
2.3 Asymptotics
In general, we can explicitly evaluate the reward distribution for any n by solving the optimization
program (4). Nevertheless, it is helpful to get a handle on the empirical distribution of the solution
4
1 1 1
0.46 0.45
0.485
0.4 0.4 0.455
0.4 0.445
1 1 1
0.46
0.4 0.38
0.4 0.405 0.41 0.415 0.42 0.36 0.38 0.4 0.36 0.38 0.4
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(d) U (x) = −x−1 (e) U (x) = log sigmoid(x) (f) U (x) = log sigmoid(4x)
to this optimization program in the limit n → ∞. The next result gives a closed-form expression of
the reward distribution in the case of a large number of completions.
Theorem 1. Let Uγ (x) = 0 < γ < 1. Then the reward distribution of (4)
xγ for some converges to
1−γ 1−γ − 1+γ 1+γ
the Beta distribution Beta 2 , 2 as n → ∞, which has probability density x 2 (1 − x)− 2
on (0, 1).
Theorem 2. For Uγ (x) = −x−γ for 0 ≤ γ ≤ 1 (as a convention, take U0 (x) = log x). Then. the
reward distribution of (4) converges in distribution to Beta( 1+γ 1+γ
2 , 2 ).
The proof of Theorem 2 can be found in [21, 17]. In the limit γ → 1 in Theorem 2, the Beta
distribution tends to Beta(1, 1), which is the uniform distribution on [0, 1]. This is indeed an example
of the one-dimensional Thomson problem [5], which asks the configuration of n electrons constrained
to a line that repel each other with a force given by Coulomb’s law. This problem was first considered
by Maxwell. Indeed, [21, 11, 1] prove that the reward distribution will converge to the uniform
distribution for Uγ (x) = −x−γ with γ ≥ 1.
For the above two classes, the limiting distribution does not admit a probability mass. However,
probability mass can emerge in the case of a scaled log-sigmoid function.
Theorem 3. If U is strictly increasing and concave, the derivative of the utility function satisfies
U ′ (0) < ∞, U ′ (1) > 0, then the reward distribution of (4) converges in distribution to a probability
5
measure µ∗ that satisfies
µ∗ ({0}) = µ∗ ({1}) ≥ 1
κ+1 ,
In general, the reward distribution can be characterized from a variational perspective. This
gives the following theorem.
Theorem 4. If U is bounded, strongly concave, and increasing. There exists a probability measure µ∗
such that the reward distribution of (2) converges in distribution to µ∗ , which is uniquely determined
by the following two properties:
(a) µ∗ maximizes
E iid U (|X − X ′ |)
X,X ′ ∼ µ
(b) it is symmetric with respect to 12 in the sense that, for any measurable set A ∈ [0, 1] and
1 − A = {x : 1 − x ∈ A}, µ∗ (A) = µ∗ (1 − A).
3 Experiments
In this section, we conduct experiments to investigate the phenomenon of reward collapse in a
controlled setting and demonstrate that prompt-aware training can prevent reward collapse.
6
Figure 3: Reward collapse on the test set. This figure follows the same setting as Figure 1 while
the evaluation is on the test set. As we can see from the figure, the reward distributions have similar
collapse phenomenons on the test set, and using prompt-aware loss can mitigate the collapse.
Prompt-aware training avoids reward collapse. Figures 1 and 3 show the reward distribution
at the end of training with varying utility functions. The results along with Figure 3.2 reveal that
using a prompt-aware U function effectively prevents reward collapse across both training and test
datasets. This strategy yields a more uniform reward distribution for open-ended prompts while
promoting a more polarized reward distribution for concrete prompts.
4 Proofs
In this section, we present the proofs of the theorems in Section 2. For ease of presentation, we start
by proving Theorem 4. Let
X
S(r1 , · · · , rn ) := U (ri − rj ) and r̂ ≡ (r̂1 , . . . , r̂n ) := arg max S(r1 , · · · , rn ).
0≤r1 ,··· ,rn ≤1
1≤i<j≤n
In addition, for any vector (u1 , · · · , un ) ∈ Rn , we employ boldface notation u to represent the entire
vector. This allows us to write S(r).
7
(a) log sigmoid as utility function (b) Prompt-aware utility function
Figure 4: (Left) Reward collapse when using log sigmoid as utility function [25]. The
reward distribution of different prompts gradually converges into a single distribution during training.
(Right) Prompt-aware training avoids reward collapse. When using the prompt-aware loss
function, the reward distributions of the two different prompts can be gradually separated during
training.
for all n.
8
The proof of this lemma is also provided in Appendix B.3. Now, we are ready to prove the
uniqueness part of Theorem 4. Due to the length constraint, we will present it as a separate lemma
and defer the proof to Appendix B.4. In summary, we use Lemma 4.2 and Lemma 4.3 to demonstrate
that for two distinct symmetric measures, their distance is sufficiently large such that at least one of
them is not optimal.
Lemma 4.4. If µ1 and µ2 are two symmetric probability measure which both maximize
E iid U (|X − X ′ |)
X,X ′ ∼ µ
d
Let µ be another probability measure on [0, 1]. Let Q̂n = n1 ni=1 δqn,i such that Q̂n → µ. By the
P
same argument before, we also have E ′ iid
U (|X − X ′ |) → E ′ iid
U (|X − X ′ |). Then by
X,X ∼ Q̂k(n) X,X ∼ µ
the optimal assumption of r̂n ,
E iid U (|X − X ′ |) = lim E iid U (|X − X ′ |)
X,X ′ ∼ µ̂ n→∞ X,X ′ ∼ P̂k(n)
This means µ̂ maximize E iid U (|X − X ′ |) over all probability measure µ on [0, 1]. From Lemma
X,X ′ ∼ µ
4.1, we know that 1 − r̂i = r̂n−i+1 , so µ̂ is symmetric. If there is another sub-sequence m(n) such
d
that P̂m(n) → ν̂. By the same argument before, ν̂ is also optimal and symmetric. From Lemma 4.4,
µ̂ = ν̂. Thus for every converging sub-sequence of {P̂n }, the limit distribution must be the same. By
d
the tightness of {P̂n }, we have P̂n → µ∗ .
9
4.3 Proof of Theorem 3
Theorem 3 can be intuitively understood as follows: If the function U satisfies U ′ (0) < ∞ and
U ′ (1) > 0, we can show, by analyzing the first-order optimality condition, that a positive fraction of
r̂ is equal to 1.
P k−1 N
∂ U (ri −rj ) X
′
X
− i<j
∂rk = U (r̂i − r̂k ) − U ′ (r̂k − r̂j ) ≤ (k − 1)U ′ (0) − (n − k)U ′ (1).
r̂1 ,··· ,r̂n
i=1 i=k+1
The inequality follows from the convexity of U . If k ≤ n/(κ+1), we have (k−1)U ′ (0)−(n−k)U ′ (1) ≤ 0.
Hence, we can get r̂k = 1. Otherwise, we could increase r̂k to make i<j U (r̂i − r̂j ) larger. As
P
The function S(r) is similar to a family of log-likelihood functions considered in [23]. We presume
that U is increasing and concave. Then similar to Lemma P 4.1, U is also concave in (r1 , · · · , rn ).
Let r̂ = (r̂1 , . . . , r̂n ) be the vector that maximizes S(r) = 1≤i,j≤n U (ri − rj )sigmoid(θi − θj ). We
present the following consistency result for r̂:
Theorem 5. Assume that U is increasing and µ-strongly concave for µ > 0. Write θmax =
max1≤i≤n |θi |. Then, r̂ keeps the order of {θi }1≤i≤n and satisfies
q
|r̂i − r̂j | ≤ 2 U (1)(1 + eθmax )|θi − θj |/µ.
The proof of these results can be found in Appendix D. Theorem 5 ensures that for any increasing
and strongly concave utility function U , r̂ is a reliable estimate of {θi }1≤i≤n , in the sense that r̂i
and r̂j are close if θi and θj are close.
10
Even though we may not be able to determine the precise limiting distribution of rn in this
extended setting, we can still extract insights from our previous analysis in Section 2. As previously
observed, selecting U (x) = x tends to polarize the reward distribution, while selecting U (x) = −1/x
yields a more uniform reward [Link] phenomenon is also evident in this setting, as observed
in the results presented in Figure 5. More details is given in Appendix D.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Based on these findings, we can conclude that in this extended setting, we can also employ
a prompt-aware utility function U to mitigate reward collapse and achieve the desired reward
distribution by carefully selecting the form of U . This provides us with flexibility in shaping the
reward distribution according to our specific requirements.
6 Discussion
In this paper, we have introduced an empirical phenomenon known as reward collapse that arises
during reward model training for aligning LLMs using human preference rankings. This phenomenon
results in the same reward distribution regardless of the prompt type. The occurrence of reward
collapse stems from neural network interpolation during the final training phase. To mitigate reward
collapse, we propose utility functions that consider the nature of prompts and an analytical framework
that evaluates reward distribution, yielding closed-form reward expressions. Synthetic experiments
substantiate our findings, presenting a method superior to early stopping to tackle reward collapse.
While our experiments provide valuable insights, it is important to acknowledge their limitations,
primarily due to constrained computational resources available to us. Given abundant resources,
future research can explore the use of a more diverse range of prompts, varying in terms of their
open-endedness. Additionally, it would be interesting to investigate the extent to which the trained
reward model enhances the capabilities of large language models, such as their ability to self-calibrate
uncertainty [18, 15]. Theoretical investigations could focus on finding increasing, concave functions
that precisely match a given discrete reward distribution. On the practical side, developing a method
to choose a utility function based on prompts, perhaps using a parameter such as γ in Section 2.2,
poses an intriguing avenue for further exploration. Furthermore, exploring the potential benefits of
truncated ranking by requiring human labelers to provide partial rankings of acceptable completions
and ignore unacceptable completions could offer valuable insights into improving the training of
reward models.
11
Acknowledgments
We are grateful to Banghua Zhu for helpful discussions. We also thank Long Ouyang and Jan Leike
for clarifications on [25]. This work was supported in part by NSF through CAREER DMS-1847415
and CCF1934876, Analytics at Wharton, and Wharton AI and Analytics for Business.
References
[1] P. Amore and M. Jacobo. Thomson problem in one dimension: Minimal energy configurations
of n charges on a curve. Physica A: Statistical Mechanics and its Applications, 519:256–266,
2019.
[2] D. Bahdanau, F. Hill, J. Leike, E. Hughes, A. Hosseini, P. Kohli, and E. Grefenstette. Learning
to understand goal specifications by modelling reward. arXiv preprint arXiv:1806.01946, 2018.
[3] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli,
T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from
human feedback. arXiv preprint arXiv:2204.05862, 2022.
[4] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho-
seini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint
arXiv:2212.08073, 2022.
[5] M. Bowick, A. Cacciuto, D. R. Nelson, and A. Travesset. Crystalline order on a sphere and the
generalized thomson problem. Physical Review Letters, 89(18):185502, 2002.
[6] S. P. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
[7] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of
paired comparisons. Biometrika, 39(3/4):324–345, 1952.
[8] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement
learning from human preferences. Advances in neural information processing systems, 30, 2017.
[9] K. Ethayarajh, H. Zhang, Y. Wang, and D. Jurafsky. Stanford human preferences dataset, 2023.
[10] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer,
K. Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors,
and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
[11] D. P. Hardin, E. B. Saff, et al. Discretizing manifolds via minimum energy points. Notices of
the AMS, 51(10):1186–1194, 2004.
[12] P. He, J. Gao, and W. Chen. Debertav3: Improving deberta using electra-style pre-training
with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543, 2021.
12
[14] B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei. Reward learning from human
preferences and demonstrations in atari. Advances in neural information processing systems, 31,
2018.
[16] A. Köksal, T. Schick, A. Korhonen, and H. Schütze. Longform: Optimizing instruction tuning
for long text generation with corpus extraction, 2023.
[17] N. S. Landkof and N. Landkof. Foundations of modern potential theory, volume 180. Springer,
1972.
[18] S. Lin, J. Hilton, and O. Evans. Teaching models to express their uncertainty in words.
Transactions on Machine Learning Research, 2022.
[19] H. Liu, C. Sferrazza, and P. Abbeel. Chain of hindsight aligns language models with feedback.
arXiv preprint arXiv:2302.02676, 2023.
[20] R. D. Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.
[22] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju,
W. Saunders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess,
and J. Schulman. Webgpt: Browser-assisted question-answering with human feedback. In arXiv,
2021.
[23] R. Noothigattu, D. Peters, and A. D. Procaccia. Axioms for learning from pairwise comparisons.
Advances in Neural Information Processing Systems, 33:17745–17754, 2020.
[26] V. Papyan, X. Han, and D. L. Donoho. Prevalence of neural collapse during the terminal phase
of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663,
2020.
[27] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, and C. Gan. Principle-driven
self-alignment of language models from scratch with minimal human supervision. arXiv preprint
arXiv:2305.03047, 2023.
[28] B. Zhu, J. Jiao, and M. I. Jordan. Principled reinforcement learning with human feedback from
pairwise or k-wise comparisons. arXiv preprint arXiv:2301.11270, 2023.
13
[29] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irv-
ing. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,
2019.
14
A Training Details
We use the following extension of the utility functions during our training.
(
log(x + ϵ) for x > 0
• log x: U (x) = , where ϵ is set to 0.1.
x + log(ϵ) for x ≤ 0
(
−1/(x + ϵ) for x > 0
• −1/x: U (x) = , where ϵ is also set to 0.1.
x − 1/ϵ for x ≤ 0
• log sigmoid(x): U (x) = log sigmoid(4x). Here, the scaling factor of 4 ensures the output of
log sigmoid spans a sufficient range.
To train the reward model, we adopted the approach used in the OpenAssistant project, which
utilizes the DeBERTaV3 Base model [12]. To constrain the reward output between 0 and 1, a
sigmoidfunction was appended before the final output. The reward model was trained with a batch
size of 224 (comprising eight questions per batch, each with 28 pairs) for a total of 1000 steps,
approximately equivalent to 1 epoch. The maximum learning rate was configured to 1e-5, utilizing
the Adam optimizer and a linear learning rate schedule, inclusive of 10% warmup steps. The reward
model was trained on a single A6000 GPU, with the entire training process concluding in roughly 1
hour.
if i ̸= k, k + 1;
r̂i
r̃i = r̂k+1 if i = k;
if i = k + 1.
r̂k
Then
X X
U (r̂i − r̂j ) − U (r̃i − r̃j ) = U (r̂k − r̂k+1 ) − U (r̂k+1 − r̂k ) < 0
1≤i<j≤n 1≤i<j≤n
because U is strictly increasing and r̂k − r̂k+1 < 0. This contracts with the fact that r̂ is the solution
of the optimization problem, and thus the conclusion holds.
15
LemmaPB.2. If the utility function U (x) is strictly increasing and strictly concave, then the function
S(r) = 1≤i<j≤n U (ri − rj ) is concave. Moreover, the solution of optimization problem (2) is unique
and satisfies: 1 − r̂i = r̂n−i+1 for i = 1, 2, · · · , n.
Proof. The concavity of S follows directly from definition:
X
S(r) + S(r′ ) = U (ri − rj ) + U (ri′ − rj′ )
1≤i<j≤n
X ri + ri′ − rj − rj′ r + r′
≤ 2U ( ) = 2S( ).
2 2
1≤i<j≤n
The above inequality is an equality if and only if ri − rj = ri′ − rj′ for all 1 ≤ i < j ≤ n when U (x) is
strictly concave. When U is increasing, the solution r̂ of the optimization problem satisfies r̂1 = 1.
Thus the solution of the optimization problem max1≤r1 ,··· ,rn ≤1 S(r) is unique, otherwise the vector
r1 +r2
2 makes S larger where r1 and r2 are two different solutions.
Finally, let r̂ be the unique solution of the optimization problem. Let us define r̃i = 1 − r̂n−i+1 for
all i = 1, 2, · · · , n. It follows that r̃i − r̃j = r̂n−j+1 − r̂n−i+1 , and we have S(r̂) = S(r̃). Consequently,
the uniqueness of the solution implies r̂ = r̃. This means that r̂i = 1 − r̂n−i+1 for i = 1, · · · , n.
The value of S does not change if we increase all ri by the same constant. Thus the value of S(r)
only depends on ProjVn (r) where Vn ⊂ Rn denotes the subspace orthogonal to (1, 1, · · · , 1). We can
define a new function on Vn by letting
F (ProjVn (r)) = S(r).
The domain of F is A = {v ∈ Vn |∃r ∈ Rn such that 0 ≤ ri ≤ 1 and v = ProjVn (r)}. First, we can
show that F is nµ-strongly concave.
Because U is µ-strongly concave, U (x) + µ2 x2 is concave. It follows that
µ X
S(r) + (ri − rj )2
2
1≤i<j≤n
X n
X n
X
(ri − rj )2 = n ri2 − ( ri )2
1≤i<j≤n i=1 i=1
by Lagrange identity. Then note that Vn is the subspace orthogonal to (1, 1, · · · , 1). The projection
onto Vn is given by
n n
1X 1X
ProjVn (r) = (r1 − ri , · · · , rn − ri ).
n n
i=1 i=1
16
As a result,
2
n n n n
ri − 1 1 X 2 1
X X X X
∥ ProjVn (r)∥2 = rj = ri2 − ( ri ) = (ri − rj )2 .
n n n
i=1 j=1 i=1 i=1 1≤i<j≤n
µP
From this equation and the concavity of S(r) + 2 1≤i<j≤n (ri − rj )2 , we know that
nµ
S(r) + ∥ ProjVn (r)∥2
2
nµ
is also concave. Consequently, F (ProjVn (r)) + 2 ∥ ProjVn (r)∥
2 is concave, which lead to the strong
concavity of F because
nµ
F (v) + ∥v∥2
2
is concave. Let v̂ be the optimal vector that maximizes F (v), strong concavity implies (See e.g.
Section 9.1.2 in [6])
nµ
F (v) − F (v̂) ≤ − ∥v − v̂∥2 .
2
Therefore, by the definition of F (ProjVn (r)) = S(r), we have
nµ
S(u) − S(r̂) ≤ − ∥ ProjVn (u − r̂)∥2 .
2
Consequently,
n
(1) (2)
X
∥ ProjVn (r(1) (2) 2
n − rn )∥2 = (rn,i − rn,i )2 .
i=1
17
If µ1 and µ2 are two different symmetric probability measure on [0, 1], we can assume that there
exists q1 < q2 ∈ [0, 1] and δ ≥ 0, such that µ1 ([0, q2 ]) < µ2 ([0, q1 ]) − δ. So when n−1
i−1
∈
(1)
(µ1 ([0, q2 ]), µ2 ([0, q1 ]) − δ), we have rn,n−i+1 ≥ q2 because µ1 ([0, q2 ]) < n−1 .
i−1
We also have
(2) (1) (2)
rn,n−i+1 ≤ q1 because µ2 ([0, q1 ]) > As a result,
i−1
n−1 . − ≥ q2 − q1 whenever
rn,n−i+1 rn,n−i+1
(i − 1)/(n − 1) ∈ (µ1 ([0, q2 ]), µ2 ([0, q1 ]) − δ). Because the length of the interval is positive, the number
of such i is larger than c1 n where c1 is a constant independent of n. Then we conclude that
n
(1) (2)
X
∥ ProjVn (r(1) (2) 2
n − rn )∥2 = (rn,i − rn,i )2
i=1
≥ c1 n(q1 − q2 )2 .
d
P(j)
n → µj , j = 1, 2.
This can be proved easily by considering the definition of convergence in distribution. Since G is
bounded, this lead to E ′ iid (j)
U (|X − X ′ |) → M, j = 1, 2 as n → ∞.
X,X ∼ Pn
The expectation E iid (j) U (|X − X ′ |) can be written more precisely as
X,X ′ ∼ Pn
1 X (j) (j)
E iid (j) U (|X − X ′ |) = U (|rn,i − rn,i′ |).
X,X ′ ∼ Pn n2
1≤i,i′ ≤n
18
Here, we uses 2∥x∥22 + 2∥y∥22 ≥ ∥x − y∥22 . So
1 X (j) (j)
min U (|rn,i − rn,i′ |) − U (|r̂n,i − r̂n,i′ |)
j=1,2 n2
′ 1≤i,i ≤n
2µ
= − max ∥ ProjVn (r(j) 2
n − r̂n )∥2
n − 1 j=1,2
(1) (2)
2µ ∥ ProjVn (rn − rn )∥2
≤ −
n−1 4
µc0 n c0 µ
≤ − ≤− .
2n − 2 2
Since M = max E U (|X − X ′ |), we know n12 1≤i,i′ ≤n U (|r̂n,i − r̂n,i′ |) ≤ M . As a result,
P
′ iid
X,X ∼ µ
1 X
min E iid (j) U (|X − X ′ |) ≤ U (|r̂n,i − r̂n,i′ |) − µc0 /2 ≤ M − µc0 /2.
j=1,2 X,X ′ ∼ Pn n2
1≤i≤i′ ≤n
C Proof of Theorem 1
Given Theorem 4, we only need to find a symmetric probability measure on [0, 1], which maximizes
The following proof in this section is adapted from [13]. Let M (B([0, 1])) denote the sets of all finite
signed measure on the Borel sigma algebra B([0, 1]). Apparently, P (B([0, 1])) ⊂ M (B([0, 1])). Then
we define the following “inner product” in M (B([0, 1])):
Z
′
⟨µ, ν⟩ = EX∼µ,X ′ ∼ν,independent U (|X − X |) = U (|x − y|)µ(dx)ν(dy).
[0,1]2
We also define I(µ) as I(µ) := ⟨µ, µ⟩. With these notations, the problem becomes
max I(µ).
µ∈P (B([0,1]))
Lemma C.1. For U (x) = xγ with γ ∈ (0, 1). If µ is a signed measure satisfying µ([0, 1]) = 0, then
we have I(µ) ≤ 0. Moreover, I(µ) = 0 if and only if µ(E) = 0 for all E ⊂ [0, 1].
1−cos(xt)
Proof. f (t) = t1+γ
is integrable on (0, ∞). As a result, using change of variables, we have
∞
1 − cos(xt)
Z
γ
|x| = C dt
0 t1+γ
19
for come constant C > 0. Then by Fubini’s theorem, we have
Z
⟨µ, µ⟩ = |x − y|γ µ(dx)µ(dy)
[0,1]2
∞
1 − cos((x − y)t)
Z Z
=C dtµ(dx)µ(dy)
[0,1]2 0 t1+γ
!
∞
1 − cos((x − y)t)
Z Z
=C µ(dx)µ(dy) dt.
0 [0,1]2 t1+γ
1 − cos((x − y)t)
Z
µ(dx)µ(dy)
[0,1]2 t1+γ
!
eixt e−iyt
Z
= −ℜ µ(dx)µ(dy)
[0,1]2 t1+γ
= − ℜ |µ̂(t)|2 ,
∞
|µ̂(t)|2
Z
I(µ) = −C dt ≤ 0.
0 t1+γ
Moreover, I(µ) = 0 if and only if µ̂(t) = 0 for all t ∈ [0, ∞) if and only if µ(E) = 0 for all
E ∈ B([0, 1]).
Lemma C.2. Let U (x) = xγ for some γ ∈ (0, 1). If a probability measure µ on [0, 1] maximize
E ′ iid
U (|X − X ′ |) if it satisfies that EX∼µ U (|X − c|) does not depend on c ∈ [0, 1].
X,X ∼ µ
Proof of Lemma 4.5. For two probability measure µ and ν on [0, 1], (µ − ν)([0, 1]) = 0. Suppose µ
satisfies EX∼µ U (|X − c|) = K does not depend on c ∈ [0, 1]. Note that
Z Z ! Z
⟨ν − µ, µ⟩ = |x − y|γ µ(dx) (ν − µ)(dy) = K(ν − µ)(dy) = 0.
[0,1] [0,1] [0,1]
20
C.2 Proof of Theorem 1
Proof of Theorem 1. Let µ be the probability measure induced by Beta( 1−γ 1−γ
2 , 2 ). It has probability
density function
1 − 1+γ − 1+γ
fγ (x) = 2 (1 − x)
1−γ 1−γ x
2 .
B( 2 , 2 )
For any c ∈ [0, 1], EX∼µ U (|X − c|) can be expressed as
Z 1
1 1+γ 1+γ
EX∼µ U (|X − c|) = 1−γ 1−γ |x − c|γ x− 2 (1 − x)− 2 dx
B( 2 , 2 ) 0
Z π
1 2
= 1−γ 1−γ | sin2 θ − c|γ (sin θ)−1−γ (cos θ)−1−γ d sin2 θ
B( 2 , 2 ) 0
π γ
sin2 θ − c
Z
2 2
= 1−γ 1−γ dθ
B( 2 , 2 ) 0 sin θ cos θ
γ
!
∞ Z π/2
sin2 θ − c
Z
2
= 1−γ 1−γ 1 ≥ t dθ dt.
B( 2 , 2 ) 0 0 sin θ cos θ
Because
γ
cos θ + 2c − 1 γ
Z π/2
sin2 θ − c 1 π
Z
1 ≥ t dθ = 1 ≥ t dθ
0 sin θ cos θ 2 0 sin θ
π 1 π n
Z o
= − 1 − cos θ − t1/γ sin θ ≤ 2c − 1 ≤ − cos θ + t1/γ sin θ dθ
2 2 0
( )
π 1 π 2c − 1
Z
= − 1 − cos(θ − ϕ) ≤ p ≤ − cos(θ + ϕ) dθ
2 2 0 1 + t2/γ
π
= − ϕ,
2
where tan ϕ = t1/γ and ϕ ∈ [0, π/2], and the last equation use the fact that c ∈ [0, 1]. As a result,
EX∼µ U (|X − c|) does not depend on c.
Note that Beta distribution
is also
symmetric. It follows from Theorem 4 that the reward
1−γ 1−γ
distribution converges to Beta 2 , 2
D Proof in Section 5
D.1 Proof of Theorem 5
Proof of Theorem 5. First, we prove that r̂ keeps the order of {θi }1≤i≤n . If there exist i and j such
that θi < θj and r̂i > r̂j , we define
if k ̸= i, j;
r̂k
r̃k = r̂j if k = i;
if k = j.
r̂i
21
Then
X X
S(r̂) − S(r̃) = U (r̂i − r̂k )sigmoid(θi − θk ) + U (r̂k − r̂i )sigmoid(θk − θi )
k̸∈{i,j} k̸∈{i,j}
X X
+ U (r̂j − r̂k )sigmoid(θj − θk ) + U (r̂k − r̂j )sigmoid(θk − θj )
k̸∈{i,j} k̸∈{i,j}
X X
− U (r̂j − r̂k )sigmoid(θi − θk ) − U (r̂k − r̂j )sigmoid(θk − θi )
k̸∈{i,j} k̸∈{i,j}
X X
− U (r̂i − r̂k )sigmoid(θj − θk ) − U (r̂k − r̂i )sigmoid(θk − θj )
k̸∈{i,j} k̸∈{i,j}
Note that for a < b and c < d, we have inequality ad + bc − ac − bd = (a − b)(d − c) < 0. It follows
that from the monotonicity of U and sigmoid function, and our assumption θi < θj and r̂i > r̂j , we
have
As a result, S(r̂) < S(r̃), which contradicts the optimality of r̂. This gives that r̂ keep the order of
{θi }1≤i≤n .
To prove the inequality in Theorem 5, let us consider the case where i = 1 and j = 2 without
loss of generality. We also assume θ1 ≥ θ2 . Then it follows from previous argument that r̂1 ≥ r̂2 .
From the optimality of r̂, we have
r̂1 + r̂2 r̂1 + r̂2
S(r̂) ≥ S( , , r̂3 , · · · , r̂n ).
2 2
22
The difference can be written as
r̂1 + r̂2 r̂1 + r̂2
S( , , r̂3 , · · · , r̂n ) − S(r̂)
2 2
X r̂1 + r̂2
= U( − r̂i )(sigmoid(θ1 − θi ) + sigmoid(θ2 − θi ))
2
i>2
X X
− U (r̂1 − r̂i )sigmoid(θ1 − θi ) − U (r̂2 − r̂i )sigmoid(θ2 − θi )
i>2 i>2
X r̂1 + r̂2
+ U (r̂i − )(sigmoid(θi − θ1 ) + sigmoid(θi − θ2 ))
2
i>2
X X
− U (r̂i − r̂1 )sigmoid(θi − θ1 ) − U (r̂i − r̂2 )sigmoid(θi − θ2 )
i>2 i>2
+ U (0)sigmoid(θ1 − θ2 ) + U (0)sigmoid(θ2 − θ1 )
− U (r̂1 − r̂2 )sigmoid(θ1 − θ2 ) − U (r̂2 − r̂1 )sigmoid(θ2 − θ1 ).
23
and L be the Lipschitz constant of sigmoid(·), which is bounded by 1, we have
24