0% found this document useful (0 votes)
5 views51 pages

2602.05927v1 Test

The document discusses the inherent biases present in randomly initialized transformer models, challenging the assumption that they are blank slates before training. It reveals that these models exhibit strong token preferences due to structural biases linked to their architecture, which persist throughout training and can be utilized for model identification through a method called SeedPrint. Additionally, the paper addresses a positional discrepancy related to the attention mechanism that contributes to the attention-sink phenomenon, offering insights for potential control strategies.

Uploaded by

mikeschuster
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views51 pages

2602.05927v1 Test

The document discusses the inherent biases present in randomly initialized transformer models, challenging the assumption that they are blank slates before training. It reveals that these models exhibit strong token preferences due to structural biases linked to their architecture, which persist throughout training and can be utilized for model identification through a method called SeedPrint. Additionally, the paper addresses a positional discrepancy related to the attention mechanism that contributes to the attention-sink phenomenon, offering insights for potential control strategies.

Uploaded by

mikeschuster
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Transformers Are Born Biased

Transformers Are Born Biased: Structural Inductive Biases at


Random Initialization and Their Practical Consequences

Siquan Li∗ lisiquan@[Link]


The Chinese University of Hong Kong, Shenzhen

Yao Tong∗ tongyao@[Link]


National University of Singapore
arXiv:2602.05927v1 [[Link]] 5 Feb 2026

Haonan Wang∗ [Link]@[Link]


National University of Singapore

Tianyang Hu† hutianyang@[Link]


The Chinese University of Hong Kong, Shenzhen

Abstract
Transformers underpin modern large language models (LLMs) and are commonly assumed
to be behaviorally unstructured at random initialization, with all meaningful preferences
emerging only through large-scale training. We challenge this assumption by showing that
randomly initialized transformers already exhibit strong and systematic structural biases.
In particular, untrained models display extreme token preferences: across random input
sequences, certain tokens are predicted with probabilities orders of magnitude larger.
We provide a mechanistic explanation for this phenomenon by dissecting the transformer
architecture at initialization. We show that extreme token preference arises from a contrac-
tion of token representations along a random seed-dependent direction. This contraction is
driven by two interacting forces: (i) asymmetric nonlinear activations in MLP sublayers
induce global (inter-sequence) representation concentration, and (ii) self-attention further
amplifies this effect through local (intra-sequence) aggregation. Together, these mechanisms
align hidden representations along a direction determined solely by the random initialization,
producing highly non-uniform next-token predictions.
Beyond mechanistic insight, we demonstrate that these initialization-induced biases persist
throughout training, forming a stable and intrinsic model identity. Leveraging this property,
we introduce SeedPrint, a fingerprinting method that can reliably distinguish models that
differ only in their random initialization, even after extensive training and under substantial
distribution shift. Finally, we identify a fundamental positional discrepancy inherent to the
attention mechanism’s intra-sequence contraction that is causally linked to the attention-sink
phenomenon. This discovery provides a principled explanation for the emergence of sinks
and offers a pathway for their control.
Keywords: Transformer, Inductive Bias, LLM Fingerprint, Attention Sink

∗. Equal contribution.
†. Correspondence to Tianyang Hu.

1
(a) (b)

Figure 1: Initialized models are not blank states. (a) When conducting next-token
prediction on random sequences, randomly initialized transformer exhibits extreme
biases where certain tokens are preferred by magnitudes larger than others. For
reference, the red dashed line indicates the empirical top-ranked frequencies ob-
served under uniform random sampling. (b) The token representation from random
transformers are severely contracted towards a common direction, as indicated by
the pairwise cosine similarity of the last-token representation among sequences.

1 Introduction

Large Language Models (LLMs) have become the cornerstone of contemporary AI, driving
advances in language understanding, code generation, reasoning, and beyond (Brown et al.,
2020; Achiam et al., 2023; Wei et al., 2022b; Touvron et al., 2023). With access to astronomical
amounts of data, it is often assumed that a model’s capabilities are entirely shaped by
training—that the vast pre-training and post-training processes alone define its personality,
preference, and knowledge (Kaplan et al., 2020; Hoffmann et al., 2022; Ouyang et al., 2022).
Under this prevailing view, a model at random initialization is merely a blank slate, a white
sheet of parameters awaiting instruction from data.
However, we reveal a surprising phenomenon: a randomly initialized transformer is not
featureless. Even before seeing any data, it exhibits systematic biases in token preferences.
As shown in Figure 1a, when performing next-token prediction on random input sequences,
certain tokens are preferred by magnitudes larger than others. This counterintuitive observa-
tion challenges the blank-slate assumption and motivates a closer examination of the model’s
initialization regime.
Despite the celebrated success of transformers, the theoretical understanding of these archi-
tectures has lagged behind their empirical progress, and modern LLMs still function largely
as black boxes. Existing studies primarily focus on scaling laws, optimization dynamics, or
emergent behaviors after training (Kaplan et al., 2020; Liu et al., 2020; Noci et al., 2022; Wei
et al., 2022a), while the initialization regime—where no learning has yet occurred—remains
largely unexplored.
To address this gap, we use random token sequences as probes to systematically study
the initialization state of transformers. First, we show that the observed extreme token

2
Transformers Are Born Biased

preferences are tied to random initialization: different random seeds give rise to distinct
preference patterns. Then, we analyze the internal dynamics from input to output and
attribute the extreme next-token biases (Figure 1b) to an inherent contraction of token
representations. Tracing this contraction through the transformer architecture, we identify
two complementary forces that jointly drive the effect:

• Inter-sequence Concentration from MLP sublayers (Section 4.2): The asym-


metric nonlinear activation (e.g., GELU) in MLP sublayers contracts representations
of random input tokens toward a shared direction among different sequences.

• Intra-sequence Concentration from self-attention sublayers (Sections 4.3


and 4.4) The aggregation of value vectors in self-attention contracts token representa-
tions within the same sequence toward a shared direction, which in turn amplifies the
existing MLP-induced inter-sequence contraction.

These results are conceptually striking: they demonstrate that the transformer architecture
itself, even prior to training, possesses structural biases tied to random initialization. In
Section 5, we further reveal that these initial biases persists after training, forming a hidden
and enduring model identity.
Finally, beyond mechanistic understanding, our findings also have practical implications.

• Biometric-prototype LLM fingerprinting (Section 6.1): Since the observed bias


is model-distinct, seed-specific, and persistent through training, it thus offers a natural
foundation for birth-to-life model identification (Galton, 1892) that does not rely on
training artifacts. Utilizing the initial token biases, we propose a novel fingerprinting
method that can even differentiate two LLMs trained with identical training pipelines
but initialized with different random seeds.

• Attention-sink Mitigation (Section 6.2): We identify a fundamental but previously


overlooked positional discrepancy inherent to the attention mechanism’s intra-sequence
contraction. We demonstrate that this discrepancy is causally linked to the widely
observed attention-sink phenomenon (Xiao et al., 2023) in pretrained LLMs. Uncovering
this statistical origin provides a principled way to modulate sink strength through
simple architectural adjustments.

The rest of this paper is structured as follows. In Section 2, we provide notation and formally
introduce the structure of transformers. In Section 3, we introduce the next-token extreme
preference phenomenon in details. A mechanistic understanding of this phenomenon is
provided in Section 4, followed by empirical evidence in Section 5 demonstrating that such
extreme token preference and the representation contraction persist during training. In
Section 6, we exploit this new-found understanding for practical benefits for practitioners
and discuss potential impact in Section 7.

3
2 Preliminary

Notations Bold uppercase letters X denote matrices, bold lowercase letters x denote
vectors, and italic letters x denote scalars. ∥x∥ denotes the l2 norm of the vector x. Unless oth-
erwise specified, all vectors are column vectors, and all nonlinearities (e.g., GELU (Hendrycks
and Gimpel, 2016), SiLU (Ramachandran et al., 2017)) are applied element-wise. We use ⊙
to denote element-wise (Hadamard) multiplication and ⊕ to denote residual addition. Id
denotes d-dimensional identity matrix.

2.1 Decoder-only Transformer Architecture

The standard decoder-only transformer architecture (Vaswani et al., 2017) has become
the dominant backbone for modern LLMs (Touvron et al., 2023; Team, 2024; OpenAI,
2025). Given a sequence of input token indices i = [i1 , . . . , iT ]⊤ from a vocabulary V, the
transformer first maps these indices into a continuous vector space via an embedding layer
X (0) = Embed(i) ∈ RT ×d , where T denotes the sequence length and d represents the hidden
dimension. A standard transformer block then processes these representations through
a sequence of interleaved components, typically consisting of self-attention mechanisms,
position-wise Multi-Layer Perceptron (MLP) networks, and layer normalization modules.

self-attention For a single self-attention head, given hidden states X ∈ RT ×d , the query,
key and value matrices are computed as

Q = XWQ , K = XWK , V = XWV ,

where WQ , WK , WV ∈ Rd×dk are learnable projection matrices and dk denotes the head-
dimension. The attention mechanism computes a weighted sum of values based on the
compatibility between queries and keys:

QK ⊤
 
Attn(X) = Softmax √ V.
dk

Multi-head attention (MHA) is constructed by executing h such heads in parallel, concate-


nating their outputs, and applying a final linear projection WO ∈ Rhdk ×d .

Multi-Layer Perceptron The MLP sublayer is applied to each token representation


independently across the sequence. Following the architecture popularized by Radford et al.
(2019), we consider a two-layer architecture with a non-linear activation ϕ:

MLP(X) = ϕ(XWup )Wdown ,

where Wup ∈ Rd×dMLP and Wdown ∈ RdMLP ×d are learnable weight matrices, and dMLP > d
is the intermediate MLP hidden dimension. Popular choices for ϕ include GELU and
SwiGLU. This sublayer expands the model capacity by projecting the hidden states into a
higher-dimensional space before mapping them back to the original model dimension d.

4
Transformers Are Born Biased

Normalization Normalization is essential for stabilizing the training of both MHA and
MLP sublayers. The original transformer architecture utilized LayerNorm (Ba et al., 2016),
which normalizes each token representation x ∈ Rd by subtracting its mean µ and dividing
by its standard deviation σ, followed by a learned affine transformation:
x−µ
LayerNorm(x) = γ ⊙ √ + β,
σ2 + ϵ

where γ, β ∈ Rd are learnable scale and shift parameters, and ϵ is a small constant for
numerical stability. Modern LLMs such as LLaMA-4 (Meta AI, 2025) and Qwen3 (Yang
et al., 2025) typically adopt Root Mean Square Layer Normalization (RMSNorm) (Zhang
and Sennrich, 2019), which omits mean-centering (and often the bias term) and only rescales
by the root mean square of the activations:
x
RMSNorm(x) = γ ⊙ q P .
1 d 2
d i=1 xi + ϵ

Unless otherwise specified, Norm(·) in our equations denotes LayerNorm.

Block Structure and Output Layer Each transformer block comprises an MHA sublayer
followed by a position-wise MLP sublayer. We employ the pre-norm configuration with
residual connections (He et al., 2016):

H (l) = X (l−1) ⊕ MHA Norm(X (l−1) ) ,




X (l) = H (l) ⊕ MLP Norm(H (l) ) ,




for layers l = 1, . . . , L. After L decoder blocks, the final hidden states X (L) ∈ RT ×d undergo
a final normalization and an output projection (the language-model head) to obtain logits
L ∈ RT ×|V| over the vocabulary:

L = Normfinal (X (L) )WU , (1)

where WU ∈ Rd×|V| is the unembedding matrix. We follow the common practice of weight
tying (Press and Wolf, 2016) and set WU = WE⊤ .

Autoregressive Language Modeling The transformer backbone described above param-


eterizes an autoregressive language model. Given a sequence of tokens x1 , x2 , . . . , xT , the
model assumes the usual factorization
T
Y
P (x1 , x2 , . . . , xT ) = P (xt | x1 , . . . , xt−1 ),
t=1

where each conditional distribution P (xt | x<t ) is obtained by applying a softmax to the logits
at position t in Equation (1). In the remainder of the paper, we study how the architectural
choices above, together with standard initialization, induce systematic preferences over
next-token probabilities even before training.

5
2.2 Current LLM Research Frontline

Modern LLM research predominantly focuses on the capabilities and behaviors that emerge
after substantial training. This includes scaling laws and pretraining efficiency (Hoffmann
et al., 2022; Sun et al., 2024; Kaplan et al., 2020), post-training alignment and reasoning (Lin
et al., 2022; Wang et al., 2022, 2025a,b; Ren and Sutherland, 2024; Guo et al., 2025), and
model interpretability (Elhage et al., 2021; Allen-Zhu and Li, 2023; Dai et al., 2022; Chen
et al., 2024; Rai et al., 2024). Under this prevailing paradigm, a model at random initialization
is often viewed as a “blank slate”—a featureless collection of parameters awaiting instruction
from data (Bommasani, 2021; Mueller et al., 2022).
In contrast, we propose a shift in perspective from “what the model learns” to “what the
model is born with”. Our work investigates the fundamental structural inductive biases
present at the very beginning: the initialization regime. We argue that the transformer
architecture is not a neutral container for data, but rather a structured system with innate
computational tendencies. This fundamental perspective contributes to the community by
demonstrating that initialization-born structures are the mechanistic root of certain model
behaviors.
Specifically, our discovery provides a physical basis for SeedPrint, a LLM fingerprinting
method that enables reliable “birth-to-life” identification by leveraging seed-specific biases
that persist throughout the model’s lifecycle. Furthermore, we demystify the widespread
attention-sink phenomenon, revealing it to be a statistical byproduct of the architecture’s
initial variance discrepancy rather than a learned strategy. By isolating these effects, we offer
a direct pathway for controlling such behaviors through principled architectural adjustments
rather than intensive data-driven tuning.
In the following sections, we move beyond these high-level observations to provide a rigorous
mechanistic account of these biases, beginning with the empirical characterization of the
extreme token preference phenomenon.

3 Extreme Token Preference at Initialization

We use random sequences as a controlled probe to investigate transformers at initialization.


Studying how a randomly initialized transformer processes random input offers a principled
way to reveal the innate computational biases embedded in its architecture and initialization.
Specifically, we conduct our exploratory experiments on randomly initialized nano-scale
LLMs, including RoPE-enhanced GPT-2 (Radford et al., 2019) and LLaMA-2 (Touvron et al.,
2023). To ensure architectural comparability, both models adopt a consistent GPT-style
parameter initialization1 . Specifically, we generate N = 2, 000 input sequences, where each
token is sampled uniformly and independently from the vocabulary V (|V| ≈ 50, 000). For
each sequence, we perform a forward pass through the randomly initialized model and record
the token predicted at the final position (i.e., the token with the maximum logit). Below we
present our findings.

1. The specific details of this initialization are provided in Appendix B.1.

6
Transformers Are Born Biased

Finding 1: Across different random sequences, randomly initialized transformers have


abnormally large tendency to predict the same token.

Figure 1(a) illustrates the frequency distribution of the top-ranked predicted tokens derived
from this process. If the initialized model were truly a blank slate or featureless, the predicted
tokens should be distributed roughly uniformly across the vocabulary. This baseline is depicted
by the red dashed line, which represents the empirical top-ranked frequencies observed when
sampling N tokens completely at random from the vocabulary V. However, the actual model
behavior (blue bars) deviates drastically from this baseline. Instead of a uniform spread,
a tiny subset of tokens dominates the predictions. The most preferred token appears 58.5
times more frequent than the random baseline.
To demonstrate the robustness of this random token prediction bias, we evaluate models across
three distinct architectures: (a) RoPE-enhanced Nano GPT-2, (b) Nano LLaMA-2, and (c)
larger RoPE-enhanced 1.2B GPT-2. For each model configuration, we conduct independent
trials varying the input sequence length and random initialization seed. Each trial involves
generating N = 10, 000 random input sequences2 . For quantitative assessment, we analyze
the statistical significance of the observed token preference by calculating the p-values under
the null hypothesis of a uniform distribution3 . The results are summarized in Table 1, which
explicitly reports the most frequent token (top-1 ID), its frequency, and statistical significance.
We observe three prominent characteristics regarding the phenomenon:

Finding 1.1: The extreme token preference remains highly consistent and robust across
diverse model architectures, scales, and input lengths.

Regardless of whether a RoPE-enhanced GPT-2 or a LLaMA-2 architecture is used, the


model consistently exhibits a non-random preference for specific tokens, confirming that the
bias is a structural property rather than a stochastic artifact.

Finding 1.2: The larger the model, the longer the sequence, the more extreme the token
preference.

The intensity of the bias scales positively with both model scale and input sequence length.
For instance, in the 1.2B GPT-2 model (part (c)) under seed 43, the top-1 frequency increases
from 5.20% at a sequence length of 64 to a striking 10.20% at a length of 1024. Comparing
part (a) with part (c), the 1.2B model consistently exhibits higher bias intensities than its
nano counterpart at the same input length, suggesting that larger scales and longer sequences
further intensify the next-token prediction bias.

2. To ensure the comparability of token identifiers across different random seeds within the same architecture,
we maintain fixed weights for the embedding and language modeling head layers, isolating the source of
variation to the internal transformer blocks.
3. We applied Bonferroni correction to strictly account for multiple hypothesis testing across the vocabulary
size. Specific details of experimental setup are provided in Appendix B.2.

7
Table 1: Token prediction bias statistics across different models. We report the identity
of the most predicted token (top-1 ID), its frequency in percentage (top-1 Freq.
%), and its statistical significance (p-value). The results are grouped by model
architecture: (a) RoPE-enhanced Nano GPT-2, (b) Nano LLaMA-2, and (c)
RoPE-enhanced 1.2B GPT-2.

Seed SeqLen Top-1 ID Top-1 Freq. (%) p-value


(a) RoPE-enhanced Nano GPT-2 (12L, 12H, 768d)
42 64 6336 2.60% 4.55 × 10−137
42 256 30 425 4.85% 6.59 × 10−285
42 1024 666 5.20% 6.75 × 10−309
43 64 7328 3.00% 2.20 × 10−162
43 256 7328 5.25% 2.42 × 10−312

43 1024 7328 6.75% 0.00
(b) Nano LLaMA-2 (12L, 12H, 768d)
42 64 18 219 0.06% 3.16 × 10−2
42 256 19 647 0.06% 3.16 × 10−2
42 1024 11 048 0.07% 1.40 × 10−3
43 64 6479 0.09% 1.90 × 10−6
43 256 17 743 0.06% 3.16 × 10−2
43 1024 21 436 0.07% 1.40 × 10−3
(c) RoPE-enhanced 1.2B GPT-2 (24L, 32H, 2048d)
42 64 23 062 3.95% 2.57 × 10−224

42 256 23 062 6.20% 0.00

42 1024 23 062 6.35% 0.00
43 64 20 694 5.20% 6.75 × 10−309

43 256 20 694 7.45% 0.00

43 1024 20 694 10.20% 0.00

The p-value is reported as 0.00 due to numerical underflow in floating-point precision.

Finding 1.3: The extreme token preference is tied to the random initialization of
transformers.

Even for models with the identical architecture and shared embedding weights, varying the
initialization seeds consistently leads to a convergence on different top-1 IDs. For instance,
in the 1.2B GPT-2 model, while seed 42 consistently favors token 23062 across all sequence
lengths, seed 43 shifts this preference to token 20694. This evidence confirms that while
the phenomenon of prediction bias is a universal structural property, its specific target is
uniquely determined by the random initial weights.

8
Transformers Are Born Biased

4 Representation Contraction at Initialization

In this section, we elucidate the underlying mechanism leading to the observed extreme token
preference. Tracing back the next-token prediction process, the logits are determined by
the product of the last token representation and the unembedding matrix as in Equation
(1). Since the unembedding matrix (row-wise) is randomly initialized with independent and
identically distributed Gaussian vectors that are approximately orthogonal and of similar
norms, a reasonable suspect responsible for the observed extreme preference is the last token
representations.
To uncover the abnormalities in the last token representations, we conduct an empirical
study using a RoPE-enhanced Nano GPT-2 model. Specifically, we feed the model with
2,000 sequences, each consisting of 1,024 randomly sampled tokens. For each input sequence,
we extract the representations of the last token from the output of every transformer block.
Then, layer by layer, we compute the pairwise cosine similarities among the last-token
representations from different sequences. This metric measures the representational similarity
of the last token across sequences. The results are illustrated in Figure 2, which characterizes
the evolution of the last token representations through the network layers. The line chart in
Figure 2 reveals that the last-token representations of different random input sequences exhibit
substantial cosine similarity, indicating a clear directional contraction in the representation
space across sequences. We term this phenomenon, in which the last-token representations
of different sequences become increasingly similar, inter-sequence contraction. As the
model deepens, this contraction becomes more pronounced, as evidenced by the increasing
trend in the average pairwise cosine similarity of the last token.

Finding 2.1: The last token representations from a random transformer exhibit severe
inter-sequence contraction.

To further evaluate the relationship between the contracted direction and the model’s token
prediction preference, we revisit the logit formulation in Equation (1). First, we define the
representation contraction direction as the average of the normalized final-layer representations
(L)
at the last token position (Normfinal (Xlast )). Then, we treat this direction as the final hidden
states and project it via WU to obtain the corresponding logits. We are interested in
whether the token with the maximum logit (most aligned with the representation contraction
direction) is identical to the observed most frequently predicted next token presented in
Section 3.

Finding 2.2: The inter-sequence representation contraction direction aligns with the
favorite token.

Specifically, for each of the 100 randomly initialized models, we feed 2,000 sequences of length
1,024 consisting of random tokens. The results demonstrate a 76% overlap. This indicates a
significant alignment between the contracted direction in the representation space and the
embedding of the most likely token.

9
Figure 2: Pairwise cosine similarity of last-token representations between different sequences
and its evolution with increasing transformer blocks.

Combining Findings 2.1 and 2.2, we can draw the following conclusion.

Finding 2: The extreme next-token prediction preference stems from the inter-sequence
representation contraction.

In the next section, we dive deeper into the transformer model architecture to expose the
root cause and underlying mechanism of the representation contraction.

4.1 Architectural Ablation

A transformer consists of an interweaving of self-attention and MLP modules. To attribute


the source of the representation contraction, we analyze the self-attention and MLP modules
in isolation. For simplicity, we omit the embedding layer and directly feed random Gaussian
vectors into the transformer to simulate the embeddings of random sequences. Crucially,
these input vectors are sampled with a mean and standard deviation strictly identical to the
initialization strategy of the omitted embedding layer. This simplification prevents token
overlap from random sampling that would complicate the similarity analysis. Unless explicitly
stated otherwise, all subsequent experiments utilize this Gaussian input setting.
We evaluate three architectural configurations:

• full transformer model (self-attention + MLP);

10
Transformers Are Born Biased

Figure 3: Next-token preference induced by self-attention and MLP modules separately


and combined. The self-attention-only model (orange) is flat, aligning with the
empirical random baseline, while the MLP-only (green) and full (blue) models
show strong preference.

• model with only self-attention blocks (self-attention-only);

• model with only MLP blocks (MLP-only).

Each variant consists of 12 layers and is probed using random sequences of Gaussian vectors to
compare their internal processing dynamics. First, we characterize the next-token preference
for each variant following the procedure detailed in Section 3. The resulting distributions
are shown in Figure 3. Second, we quantify the inter-sequence representation contraction by
computing the pairwise cosine similarities between the last-layer last-token representations
(extracted from the final LayerNorm) across all input sequences. The average value and
statistical significance are reported in Table 2.

Finding 3.1: The extreme token preference and the inter-sequence representation
contraction are due to MLP modules.

Finding 3.2: While attention modules alone do not introduce extreme token preference
nor inter-sequence representation contraction, their pairing with MLP modules boosts
the severity of the abnormalities.

As can be seen in Figure 3, the self-attention-only model (orange) exhibits a flat histogram
that closely aligns with the empirical random baseline, indicating no clear token preference.

11
Table 2: Architectural ablation for extreme next-token preference by comparing the average pairwise
cosine similarity of the last-token representations across different sequences and their
statistical significance. † The p-value is reported as 0.00 due to numerical underflow in
floating-point precision. Refer to Section B.2 for the detailed calculation protocol.

Model Mode Avg. Sim p-value


full Model 0.46 0.00†
self-attention-only 1.11 × 10−4 0.16
MLP-only 0.25 0.00†

The MLP-only model (green) produces a clear spiky preference for specific outlier tokens.
The full model (blue) produces the most extreme preference outliers, far exceeding the other
two. Similar interpretations can also be made from Table 2, where the self-attention-only
model maintains near-orthogonal representations, whereas the MLP-only model exhibits
significant contraction, with the full model displaying the most severe contraction.
These results suggest that the MLP module is the primary driver of this initial representation
contraction. Furthermore, the interaction between the self-attention and MLP modules
creates a significantly more pronounced contraction effect, which in turn causes the strong
next-token preference seen in the full model. In the following sections, we conduct deeper
investigations to uncover the underlying mechanism.

4.2 MLP-Induced Contraction of Inter-Sequence Representations

Our ablation study on the model architecture reveals that MLP blocks alone are sufficient
to induce severe token preference bias and representation contraction. In this section, we
investigate how such bias originates and propagates through the network.
We first zoom in on an MLP-only architecture and analyze its layer-wise behavior to pinpoint
which components within the MLP are responsible for this contraction. Following the same
pipeline as before, we generate N = 2, 000 distinct input sequences sampled from Gaussian
distribution. These sequences are processed through a stack of L = 12 randomly initialized
MLP blocks, following the standard GPT-style initialization (details in Appendix B.1).
After each block, we compute the average and standard deviation of the pairwise cosine
similarity across all N last-token representations to track the evolution of inter-sequence
contraction. Figure 4 shows that the average of pairwise cosine similarity steadily increases
as the representations pass through successive MLP blocks.
To understand this phenomenon, we examine the internal structure of the MLP block. In the
MLP-only architecture, each MLP block consists of the pre-MLP LayerNorm, the two-layer
perceptron with GELU activation, and the subsequent residual connection. Specifically,


Y = MLP Norm(X) ⊕ X,

MLP(X) = GELU XWup Wdown .

12
Transformers Are Born Biased

Figure 4: Average and standard deviation of the pairwise cosine similarity between the
last-token representations of different sequences, measured after each successive
MLP block.

Since random initialized linear projections are unlikely to introduce significant representation
contraction, we hypothesize that the contraction is caused by the asymmetric activation
function in the MLP block.

Finding 4: Asymmetric nonlinear activation in MLP can cause inter-sequence repre-


sentation contraction.

To isolate the role of nonlinear activation from the effects of normalization and residual
connections, we first consider a simplified MLP setting with both components removed.

Definition 1 (MLP0 block) The simplified MLP block, denoted as MLP0 , processes the
input through a single feedforward pass
 without residual paths or normalization. The output
of the MLP0 block is Y = ϕ XWup Wdown where ϕ denotes the activation function.

For convenience, we mainly consider two choices for ϕ: (1) ReLU, serving as an asymmetric
nonlinear baseline; and (2) tanh, serving as a symmetric nonlinear baseline. For each of the
these MLP0 block variants, similar to the standard model, we calculate the pairwise cosine
similarities between the last-token representations and track its progression as we include
more blocks4 . Figure ?? summarizes the results. For comparison, the standard MLP block is
also included.
A stark contrast is observed in Figure 5. While the symmetric tanh baseline (red dashed line)
maintains an average pairwise cosine similarity of approximately zero, the asymmetric ReLU
4. Crucially, to ensure a controlled comparison, both baselines are initialized with identical random weights.

13
Figure 5: Average pairwise cosine similarity analysis. We compare the asymmetric ReLU
(blue) against the symmetric tanh (red) in a simplified setting without LayerNorm
or residuals to isolate the effect of activation symmetry. The full ReLU based
MLP block with residual connections and LayerNorm (green) is included for
reference, demonstrating that the contraction phenomenon persists in the standard
architecture but is moderated by the residual mechanism.

baseline (blue line) exhibits a rapid and significant increase in similarity5 . The standard MLP-
only (green line) model, which incorporates ReLU activation alongside residual connections
and LayerNorm, also demonstrates a steady increase in similarity, albeit at a more moderate
rate compared to the standalone MLP0 . This difference is attributable to the residual
connections, which preserve input diversity and dampen the rapid contraction observed in the
residual-free setting. Together, these empirical results support our hypothesis that activation
asymmetry is the primary driver of inter-sequence representation contraction.
To ground these observations, we provide theoretical evidence demonstrating that asymmetric
activation functions such as ReLU provably induce extreme inter-sequence representation
contraction in the MLP0 setting.

Proposition 2 Let X1 and X2 be two independent Gaussian vectors with mean zero and
(l) (1)
covariance σ 2 Id . Denote ZlReLU (X) = MLP0 ◦ · · · ◦ MLP0 (X) as the output after l layers
of independent MLP mappings with ReLU  activation. Then, the expected cosine similarity,
ρ̄ReLU
l = E ρ(Z l
ReLU (X ), Z ReLU (X )) , satisfies:
1 l 2

5. The drop in deeper layers for the ReLU model may be attributed to the representations exponentially
collapsing towards the zero vector due to vanishing variance, a known phenomenon in deep networks that
lack LayerNorm and residual connections (Glorot and Bengio, 2010; He et al., 2015)

14
Transformers Are Born Biased

• ρ̄ReLU
1 = 1/π;

• ρ̄ReLU
l is monotonically increasing with l;

• liml→∞ ρ̄ReLU
l = 1.

In contrast, if ReLU is substituted by tanh, E(ρ̄tanh


l ) = 0 for any l.

Proposition 2 establishes that asymmetric activations such as ReLU inherently drive inter-
sequence similarity toward total collapse (ρ = 1), whereas symmetric activations such as
tanh maintain orthogonality. The formal proof, where the mathematical role of symmetry is
central, is provided in Section A.1.
While the analysis in Proposition 2 establishes the fundamental role of asymmetry in a
simplified setting, in practice, we find that LayerNorm also plays an important role by
regulating the degree of asymmetry engaged in the GELU activation.

Finding 4.1: (Interplay between GELU and LayerNorm) LayerNorm activates the
inherent asymmetry of GELU by increasing input variance, which in turn causes inter-
sequence representation contraction.

In practice, standard transformer MLP blocks typically adopt activation functions from
the GLU family, such as GELU (Hendrycks and Gimpel, 2023; Devlin et al., 2019) and
SwiGLU (Shazeer, 2020; Bai et al., 2023). This introduces an additional layer of complexity:
the interplay between LayerNorm and asymmetric activation. Take GELU as a representative
example, GELU(x) = xΦ(x), where Φ(x) is the cumulative distribution function (CDF) of
the standard normal distribution.
As visualized in Figure 6, although GELU is fundamentally asymmetric, this property becomes
effective only when the inputs span a sufficiently wide range of values. LayerNorm dictates
whether this asymmetry is “engaged” by regulating the input variance. With LayerNorm, the
inputs exhibit a relatively large standard deviation (e.g., σ ≈ 0.55), forcing them to cover the
asymmetric region of GELU. In contrast, without LayerNorm, the inputs remain concentrated
near the origin (e.g., σ ≈ 0.011), where GELU behaves approximately symmetrically.

4.3 Interplay between MLP and Self-attention

Although our previous results show that self-attention-only transformers do not inherently
cause representation contraction (Figure 4), the contraction induced by the MLP is substan-
tially amplified by the attention component, thereby strengthening the token-preference bias.
In this section, we analyze the mechanism underlying this synergistic amplification.
First, we extend the experimental setting as in Table 2, to a more fine-grained, layer-by-layer
setup to better evaluate the interplay between self-attention and MLP. We quantify the
inter-sequence representation contraction across layers by computing the pairwise cosine
similarity between last-token representations of different input sequences. Figure 7 illustrates

15
Figure 6: Interplay between GELU and LayerNorm. Inputs with LayerNorm (Norm) are
pushed into the asymmetric regime of GELU, whereas inputs without LayerNorm
(NoNorm) stay near zero where GELU is approximately symmetric.

the layer-wise evolution of this similarity across the full, MLP-only, and self-attention-only
models.

Finding 3.2a: A self-attention layer can amplify the inter-sequence representation


contraction initiated by the MLP, even more so than having an extra MLP layer.

Figure 7 reveals two key divergence points in the layer-wise evolution of representation
contraction:

• Initial divergence (Layer 1): A clear separation emerges between the self-attention-
only model and the other two configurations. While the full and MLP-only models
show an immediate increase in inter-sequence cosine similarity, the self-attention-only
model remains nearly flat throughout, confirming that self-attention alone does not
initiate contraction.
• Amplification divergence (Layer 2): A second divergence point emerges between
the full and MLP-only models. Starting from layer 2, the full model—which alternates
between attention and MLP sublayers—exhibits substantially stronger contraction
than the MLP-only model. This gap widens progressively with depth, supporting the
hypothesis that self-attention acts as a powerful amplifier of the contraction effect
initially seeded by the MLP blocks.

To explicitly test the amplification hypothesis, we isolate the interaction between modules
by comparing minimal two-block configurations: an MLP-MLP structure versus an MLP-

16
Transformers Are Born Biased

Figure 7: Pairwise cosine similarity of the last-token representation across different sequences,
measured over the output of each block. The self-attention-only model (blue)
remains perfectly orthogonal. The MLP-only model (green) shows a steady increase
in similarity. The full GPT model (red) shows a faster and more severe contraction.

attention structure. For analytical tractability, we retain the simplified ReLU MLP0 block
without residual connections and LayerNorm (Definition 1). We further introduce a simplified
version of self-attention as follows.

Definition 3 (Attn0 block) Define the Attn0 block as a simplified attention module that
only outputs the average of previous input vectors:
T
1X
Attn0 (x1 , · · · , xT ) = xi .
T
i=1

This definition distills the standard self-attention mechanism into two core assumptions:
attention allocation and value aggregation. Definition 3 can be thought of as making the
following simplification assumptions.

• Uniform attention weights: We assume attention weights to be uniform, i.e,


attending equally to all previous tokens. This assumption is reasonable under random
initialization, since the query-key dot products are centered at zero with minimal
variance due to the small initialization scale. As a result, the attention weights after
the softmax function tend to be close to uniform.

• Identity Wv and Wo : We assume the value and output projection matrices are
identity matrices. This is also reasonable since at initialization, the value matrix Wv

17
Table 3: Analysis of the amplification effect by comparing the average pairwise cosine
similarity of the last-token representations across different sequences and their
statistical significance. † The p-value is reported as 0.00 due to numerical underflow
in floating-point precision.

Experiment setting Avg. Sim p-value


h = MLP0 (x) 0.31 0.00†
yMLP = MLP0 (h) 0.49 0.00†
yattn = Attn0 (h) 0.98 0.00†

and the output projection matrix Wo are both Gaussian random matrices, which do
not inherently introduce representation contraction.

Under the simplified attention setting, we compare the following two-block configurations
with a shared “pre-contracted” one-layer MLP0 layer h = MLP0 (x).

• Attn0 ◦ MLP0 (x): h is further processed by a simplified Attn0 layer with output
yattn = Attn0 (h).

• MLP0 ◦ MLP0 (x): h is further processed by a second MLP0 layer with output
yMLP = MLP0 (h).

In the experimental setup described above, a common first-layer MLP0 is employed to induce
an initial, shared representation contraction. If self-attention acts as an amplifier of this
effect, the Attn0 ◦ MLP0 (x) configuration should exhibit a stronger contraction than the
MLP0 ◦ MLP0 (x) counterpart. To quantify the contraction, we follow similar procedures
as in Table 2 and calculate the pairwise cosine similarities between the last-layer last-token
representations across all input sequences.
Results are summarized in Table 3, which confirms this hypothesized trend. Starting from
an initial contraction of 0.31 after the first MLP0 layer, the Attn0 module amplifies the
inter-sequence similarity to 0.98, far exceeding the contraction achieved by the MLP-only
baseline (0.49).
The simplifications introduced in Definition 1 and Definition 3 allow for a rigorous theoretical
analysis of of this amplification effect. We characterize this relationship in the following
proposition:

Proposition 4 (self-attention as a contraction amplifier) Let the sequence length be


T . The expected inter-sequence cosine similarity ρ̄ for the M LP0 ◦M LP0 case is approximately
T
0.49 while that for the Attn0 ◦ M LP0 case is T +π−1 , which converges to 1 as T increases.

The proof is provided in Appendix A.2. Proposition 4 illustrates that if the inputs are already
partially contracted (e.g., due to the asymmetric activations in the MLP0 ), the attention’s
value aggregation process reinforces the shared directional bias while averaging out the unique,

18
Transformers Are Born Biased

uncorrelated components. Consequently, the output representations will be significantly more


contracted (i.e., exhibit higher inter-sequence cosine similarity). This analytical derivation
aligns remarkably well with the empirical results in Table 3: the MLP0 ◦ MLP0 configuration
yields ρ̄ ≈ 0.49, while the Attn0 ◦ MLP0 configuration gives ρ̄ = T +π−1 T
≈ 0.98 when
evaluated at sequence length T = 128.

4.4 Attention-Induced Intra-Sequence Representation Contraction

Unlike MLP, which applies an identical operation to every position, the self-attention
mechanism inherently differentiates between positions by aggregating information across
the entire sequence. To understand how this mechanism amplifies global, inter-sequence
contraction, we must zoom in on its effect on token representations within a single sequence.
We demonstrate that self-attention causes the hidden representations of distinct tokens to
become increasingly similar, a phenomenon we term intra-sequence contraction. This
effect provides a mechanistic explanation for the amplification of inter-sequence contraction:
when representations across sequences are already partially aligned toward a common direction,
seeded by the MLP blocks, self-attention further reinforces this shared direction inside each
sequence through local aggregation, thereby amplifying inter-sequence contraction.
To empirically characterize this intra-sequence contraction effect, we sample a single sequence
of isotropic Gaussian random vectors (T = 512 tokens) and feed it through L = 12 successive
standard self-attention blocks. After each block, we compute the pairwise cosine similarity
among token hidden representations within the sequence. We repeat this procedure for
N = 2000 independent sequences and report the average and standard deviation. We further
examine how intra-sequence contraction varies as a function of sequence length by evaluating
shorter sequences (T = 16). The results are shown in Figure 8.

Finding 5: self-attention mechanism can induce intra-sequence representation contrac-


tion, pushing hidden representations of different tokens within a single sequence toward
a shared direction. The more self-attention layers, the more severe the contraction.

The solid curves in Figure 8 reveal two clear trends. First, the intra-sequence cosine similarity
increases more rapidly for shorter sequences, indicating that representation contraction is
more severe for smaller sequence lengths T . Second, for both sequence lengths, the similarity
grows monotonically with layer depth and converges toward 1. This demonstrates that
token hidden representations progressively contract toward a common direction as more
self-attention layers are stacked, even when the initial inputs are drawn from an isotropic
Gaussian distribution.
For a formal theoretical analysis, we leverage the simplified Attn0 block defined in Definition 3.
We first empirically verify that this simplified model properly preserves the behavior of
standard self-attention. As illustrated by the dashed curves in Figure 8, Attn0 trajectories
closely match the solid curves corresponding to the full attention module for both evaluated
sequence lengths. This strong agreement validates our simplification and suggests that, under
Gaussian initialization, self-attention effectively behaves as a uniform aggregation operator.

19
Figure 8: Comparison of intra-sequence cosine similarity between the standard self-attention
(solid lines) and the simplified Attn0 block (dashed lines). Red and blue colors
correspond to sequence lengths T = 16 and T = 512, respectively. The color-shaded
regions indicate standard deviations of the average intra-sequence cosine similarity
computed over 2,000 sequences. The close overlap between the simplified and
standard models justifies the use of Attn0 for analyzing the contraction mechanism.

Using this validated framework, we next theoretically characterize how sequence length T
and layer depth L influence the expected intra-sequence cosine similarity.

Proposition 5 Consider the simplified Attn0 block setting. Denote T as the sequence length.
After L layers of Attn0 , the expected intra-sequence cosine similarity ρ̄′ can be characterized
by  

 1 1
E ρ̄ (T, L) ≈ 1 − 2 1 − ,
L T
which becomes accurate for L ≥ 3.

The formal proof is provided in Section A.3. Proposition 5 establishes that for a fixed
sequence length T , the expected intra-sequence cosine similarity increases monotonically with
the number of attention layers L and converges to 1 as L → ∞. In other words, self-attention
layers will progressively contract token representations within the same sequence toward a
common direction. The sequence length T influences the convergence rate via the factor
1 − T1 : larger T leads to weaker contraction, while smaller T yields a stronger effect. This
theoretical result aligns with the empirical trends observed in Figure 8.
Our previous experiments establishes that self-attention drives representations toward a com-
mon direction within a single sequence, this same aggregation process inevitably introduces a
structural positional discrepancy. Because the mechanism functions analogously to a uniform

20
Transformers Are Born Biased

Figure 9: Verification of positional variance decay in a randomly initialized self-attention


block (T = 32). The red dashed line represents the empirical standard deviation of
the output representation at each token position, while the blue dashed line denotes
the theoretical decay curve (∝ √1i ) derived from the uniform attention assumption.
The near-perfect overlap confirms that the initial token representations degrade in
variance as the sequence length increases.

aggregation operator at initialization, it creates a position-dependent variance decay in the


representations. This positional variance discrepancy substantiates our next finding.

Finding 6: The self-attention mechanism induces a systematic token disparity within


each sequence. Specifically, the first token maintains a unique statistical profile, acting
as a structural outlier relative to all subsequent positions.

To formalize this, we approximate the value vectors vj at initialization as zero-mean, uncor-


related variables with identical variance σ 2 . Under the uniform aggregation assumption, the
output representation at position i, denoted as oi , satisfies
i i
1X 1 X σ2
oi ≈ vj , =⇒ Var(oi ) ≈ 2 Var(vj ) = . (2)
i i i
j=1 j=1

The variance additivity here stems from the zero cross-position covariance at initialization.
Consequently, the first token (i = 1) preserves the full initial variance σ 2 , while subsequent
tokens (i > 1) experience a rapid 1/i variance shrinkage.
To validate this phenomenon, we injected Gaussian vectors of length T = 32 into a randomly
initialized self-attention block. We then extracted the output o following the value aggregation

21
Figure 10: Extreme token preference occurs for both the maximum-logit tokens (top, red)
and the maximum-hidden-representation dimensions in the final layer (bottom,
blue) across random inputs. The dashed line denotes the expected frequency
under a uniform token distribution. The arrows in the top panel indicate a broken
x-axis that omits the low-frequency tail ranks.

and calculated the standard deviation of the representation at each position. As illustrated in
Figure
√ 9, the empirical results (red dashed line) nearly perfectly overlap with the theoretical
1/ i decay curve (blue dashed line).
This inherent peculiarity of the first token is remarkably reminiscent of the attention sink
phenomenon observed in LLMs (Xiao et al., 2023), where pretrained models assign dispro-
portionately high attention scores to initial tokens (even if semantically irrelevant) to offload
probability mass. Our results suggest that this behavior is not merely learned from data,
but is rooted in a fundamental positional bias present at “birth”. This heterogeneity may
even effectively compete with explicit positional schemes like RoPE (Su et al., 2024). In
Section 6.2, we leverage this insight to mitigate the initialization-induced variance discrepancy
and establish its causal link to the attention sink.

5 Persistence of Seed-specific Structural Biases After Training

In previous sections, we have investigated randomly initialized transformers and uncovered


the underlying mechanism leading to the systematic biases tied to the random initialization.
A natural question arises: to what extent do these initial structural properties survive the
LLM training process? We demonstrate that while a model’s specific knowledge evolves
through data-driven learning, the underlying structural biases inherited from initialization
remain detectable, forming a persistent and idiosyncratic model identity.

22
Transformers Are Born Biased

To investigate this persistence with minimal interference from the training objective, we shift
our focus from the final next-token predictions to the internal representations of the final
hidden layer. This one-layer rollback helps mitigate the direct influence of training objectives—
since the model is optimized for token prediction rather than hidden representations—and
thus provides a cleaner view of the initialization-induced bias. This shift in focus is motivated
by our earlier findings in Section 4.4, where we observed that representation contractions
also arise in the later hidden layers. We therefore expect that extreme bias patterns should
also display in internal representations. In Figure 10, we plot the frequency with which each
token is predicted (upper) and the frequency with which each hidden dimension attains
the maximum value at the final layer (lower) over 10,000 random input sequences. The
pronounced non-uniform shape in the lower plot indicates that similar extreme bias patterns
are also present in the hidden representations of the final layer.

Finding 1 (Extended): Across diverse random sequences, randomly initialized trans-


formers consistently exhibit an abnormal preference for which hidden dimensions to
assign the maximum value in each layer.

We next show that a model trained from a given initialization preserves a weak but non-
negligible correlated bias pattern. Finding 2.2 reveals an alignment between the model’s
predictive preference and the contracted directions reflected in the averaged outputs over
random inputs; that is, for some position j, the initialized model f exhibits an unusually large
expected response Ex∼Drand [f (x)j ] compared to a uniform baseline. We can thus interpret
the outputs of a trained model f ′ through a simple decomposition on an N × dout response
matrix: for each dimension j,
f ′ (x)j = bj (x) + ϵj (x),

where bj (x) represents the initialization-induced bias and ϵj (x) denotes training-specific
variation. Therefore, if such bias bj (x) persists, we would expect to observe, on cer-
tain dimensions j, that the  output distributions of two models over random inputs satisfy
Corrx∼Drand f (x)j , f ′ (x)j > 0.
To test this, we focus on the most biased
Pn dimensions M, i.e., the top-m ranked dimensions
¯
in the contraction direction f := n i=1 f (xi ) ∈ R . Define
1 dout

X
M= arg max f¯j ,
J⊆{1,...,dout }, |J|=m j∈J

Following the setup of Table 1, we generate 10,000 random sequences of length 1024. Starting
from a randomly initialized model with seed 42, we continuously train this model on the
OpenWebText dataset (Gokaslan et al., 2019) for one epoch and evaluate intermediate
training checkpoints as target models. For each checkpoint, we compute the correlation
between the output distribution at each selected dimension and that of the initialized model.
To establish an uncorrelated baseline, we additionally compute the correlation between the
initialized model and models trained from a different initialization (seed 123). We report the
mean correlation and standard deviation over the top-m selected dimensions with m = 50.

23
Figure 11: Continuously trained models exhibit weakly correlated bias profiles across training
checkpoints, consistently shifted upward relative to the uncorrelated baseline.

Finding 7: The extreme next-token prediction preference persists during training, which
inspires novel LLM fingerprinting methods.

As illustrated in Figure 11, although the correlation between each checkpoint and the
corresponding initialized model is small in absolute value, it is systematically shifted upward
relative to the uncorrelated baseline. The two distributions exhibit overlap, but the persistent
upward shift supports the presence of initialization-induced bias throughout training. Such
persistent bias is therefore exploitable—in Section 6.1, we show that these bias profiles can
serve as statistically reliable and idiosyncratic model fingerprints.

6 Practical Implications

6.1 Seed-Specific Outliers as a Biological-Metaphor LLM Fingerprint

LLM fingerprinting aims to provide identifiers for trained models, serving as a foundation for
model attribution, lineage tracing, and misuse detection. However, existing fingerprinting
approaches (Pasquini et al., 2024; Xu et al., 2024; Yoon et al., 2025; Zhang et al., 2024; Zeng
et al., 2024; Zhang, 2025; Luan et al., 2025; Tsai et al., 2025; Alhazbi et al., 2025) primarily
rely on semantic behaviors or parameter similarities that emerge only after substantial
training, and therefore are usually unavailable at early pretraining stages. In this section,
we show that the observed seed-dependent token preferences can act as Biological-Metaphor
fingerprints for LLMs, SeedPrints, that are uniquely determined at models’ birth and
detectable at any time of the subsequent training.

24
Transformers Are Born Biased

6.1.1 Algorithm
Figure 11 reveals a key observation: bias profiles between models from the same lineage
are weakly correlated, and their correlation scores are distributed above the uncorrelated
baselines. However, three challenges remain: (1) the existing bias direction relies on access
to the initialization model and its top-m ranked dimensions, which are often unavailable
in realistic scenarios; (2) although the same-lineage correlations are distributionally higher,
the two distributions overlap, making statistics based solely on mean or standard deviation
less powerful; and (3) empirically constructing a reliable uncorrelated baseline involves a
non-trivial trade-off between computational cost and estimation accuracy. We next show
how our proposed method addresses these three issues.

Resolve (1): Extract identity dimensions between any two models. While the
top-m ranked dimensions can be directly obtained from the initialized model, such dimensions
are unavailable when only two trained models are provided (e.g., an early checkpoint and
a late checkpoint). To address this, we consider the coset of the two models, with the
expectation that if both models inherit bias from the same initialization, their high-preference
dimensions will exhibit non-trivial overlap and are more likely to align with a shared bias
direction. Conversely, if the two models show little or nearly no overlap, they are likely to be
unrelated. We therefore extract dimensions that are jointly prominent in both models and
use this intersection as a proxy for the latent initialization-induced identity dimensions.
Formally, let X = {xi }ni=1 , where each xi ∈ Rℓ×d denotes a random sequence of length ℓ,
obtained by independently sampling ℓ random vectors from a d-dimensional Pisotropic Gaussian
n
distribution. For any model g, define the mean output vector ḡ := n i=1 g(xi ) ∈ Rdout ,
1

where g is either the base model f or the suspect model f ′ , and dout denotes the output
dimensionality (vocabulary size for logits, or hidden size for the final hidden representation).6
For each model, we identify its high-preference dimensions as
X
Mg = arg max ḡj .
J⊆{1,...,dout }, |J|=m j∈J

Let S := Mf ∩ Mf ′ = {s1 , . . . , s|S| } denote the intersection of the two sets. We refer to S
as the identity dimensions.

Resolve (2): Hypothesis test on distribution of correlation statistics. As in Sec-


tion 5, we obtain |S| empirical Kendall–Tau correlations forming a correlation distribution.
We then perform a hypothesis test (e.g., a one-sided t-test or the Mann–Whitney U -test
in Section 6.1.2) to evaluate whether this distribution is significantly higher than a null
distribution of no association, constructed from an uncorrelated baseline. The distributional
test avoids relying on a fixed summary statistic (e.g., mean or standard deviation), which can
be unstable when the two distributions exhibit substantial overlap, and thus provides stronger
discrimination in such cases. If the null hypothesis is rejected, we deem f ′ to be derived from
f . We declare significance at p < 0.01; further details are provided in Algorithm 1.
6. Using the final hidden representation instead of logits avoids detokenization noise and is more robust to
the rare case that a random sequence appears in the training data.

25
Algorithm 1 Distribution Correlation Test on Identity Dimensions
Require: base model f , suspicious model f ′ ; random inputs X = {xi }ni=1 ; fingerprint size m;
significance level α
▶ Step 1: Localize biased dimensions
1: Compute average outputs f¯, f¯′ over X
2: M(f ), M(f ′ ) ← top-m dimensions of f¯, f¯′
3: S ← M(f ) ∩ M(f ′ ) (identity dimensions)
▶ Step 2: Form correlation distribution
4: for each sj ∈ S do
τj ← KendallTau {f (xi )sj }ni=1 , {f ′ (xi )sj }ni=1

5:
6: end for
|T |
7: T ← {τj }j=1
▶ Step 3: Hypothesis test against null
8: Construct Tnull by applying the same pipeline to two Gaussian matrices Y (1) , Y (2) ∼ N (0, I)N ×dout
9: Test H0 : T = Tnull vs. H1 : T > Tnull
10: Return SameLineage ← 1(p-value < α)

Resolve (3): Constructing a practical null distribution. A final challenge is how


to obtain a reliable null baseline without relying on costly empirical comparisons between
thousands of independently initialized models. Empirically, the correlation distribution
between independently initialized models closely follows a Gaussian centered around zero
(see Section B.3.1 for more details). Therefore, we construct the null by applying the
same dimension-selection and correlation computation pipeline to two independent Gaussian
matrices of the same size as the model outputs, as an approximation to unrelated models. This
provides a lightweight surrogate for two unrelated models while preserving the dependency
structure introduced by selecting identity dimensions.

6.1.2 Experiments: Birth-to-Lifecycle Biological-Metaphor fingerprinting

In this section, we demonstrate that our method is a genuine fingerprint: (i) it enables birth
verification at the seed level ; and (ii) it remains verifiable throughout the full training lifecycle,
even for 7B-scale models. We test with both the one-sided t-test (t-test) and Mann–Whitney
U test (U test) to demonstrate the robustness to the choice of hypothesis test.
For experiments requiring training from scratch, we use 12-layer, 12-head LLaMA-style
models with RoPE (Touvron et al., 2023; Su et al., 2021) and Qwen-style models (Team,
2024). We further evaluate fingerprinting in the fine-tuning stage using 7B pretrained variants
(see Table 6). Because our random baseline is stochastic, we report p-values averaged over
10 independent trials and adopt a significance level of α = 0.01. Importantly, the absolute
magnitude of extremely small p-values is not meaningful: once p falls below numerical and
sampling noise (e.g., < 10−20 ), values like 10−260 should not be interpreted as stronger
evidence than 10−20 —both decisively reject the null. In the main paper, we present results
for LLaMA-style models and defer those for Qwen-style models to Section B.3.3. The overall
conclusions are consistent across the two.
We compare against four state-of-the-art passive fingerprinting methods: Intrinsic Finger-
print (Yoon et al., 2025), REEF (Zhang et al., 2024), and the two HuRef variants—PCS

26
Transformers Are Born Biased

(4a) Comparison of fingerprint (4b) Trained models share the (4c) The same dataset and train-
behaviors between models ini- same fingerprint behaviors as ing order do not shape finger-
tialized with different seeds. their initialization (p-value < print behaviors to be identical
0.01). across different initializations.
Seed Pair t-test U -test Model Pair t-test U -test Model Pair t-test U -test
s42 vs. s2000 0.357 0.532
42 vs. s42
sinit 2.20e-8 6.28e-8 123 vs. s1000 0.385 0.486
base sinit base
s123 vs. s42 0.678 0.565
123 vs. s123
sinit 7.09e-6 1.37e-5 1000 vs. s2000 0.035 0.096
base sinit base
s1000 vs. s123 0.363 0.335 s1000 vs. sbase
init 5.58e-4 2.81e-3 s42 vs. s123
init base 0.426 0.337
1000
s2000 vs. s1000 0.434 0.481
2000 vs. s2000
sinit 4.00e-10 1.27e-9 2000 vs. s42 0.388 0.287
base sinit base

Table 5: Fingerprint persistence under continual training on diverse datasets (base model:
seed 1000, corpus openwebtext). U -test refers to the Mann–Whitney U test.

Setting Ours Baselines


Continual corpus (seed) t-test U -test Intrinsic REEF PCS ICS
TinyStoriesV2_cleaned (1000) 0✓ 7.77e-89✓ 1.000✓ 0.759× 0.999✓ 0.996✓
TinyStoriesV2_cleaned (123) 0.943✓ 0.902✓ 0.950× 0.658✓ 0.332✓ 0.012✓
the_stack (1000) 0✓ 3.09e-69✓ 0.489× 0.557× 0.585× 0.123×
the_stack (123) 0.732✓ 0.831✓ 0.445✓ 0.580✓ 0.301✓ 0.026✓

and ICS (Zeng et al., 2024). Additional details are in Section B.3.2. Note, in all experiment
tables, cell colors indicate lineage: with green denotes models from the same source, and
red denotes different sources. For example, 42 vs. s42
sinit base compares a model initialized with
seed 42 and its continued-pretrained counterpart, hence green. By contrast, sinit 2000 vs. s42
base

compares a seed-2000 initialization with a model trained from a seed-42 initialization, hence
red. Additionally, ✓ denotes a correct detection, while × denotes an error.

Different initialization seeds produce distinct fingerprints. Table 4a reports p-values


from our correlation tests between pairs of models initialized with different random seeds
(42, 123, 1000, and 2000). All p-values are consistently > 0.01, indicating that our method
reliably distinguishes models trained from different seeds. This shows that distinct seeds
yield distinct fingerprint behaviors, allowing models to be separated “at birth.”

Training preserves the initialization fingerprint. Table 4b compares each initialization


model sinit with its descendant sbase trained on the OpenWebText dataset (Gokaslan et al.,
2019) (≈10B tokens). Across all seed–model pairs, p-values are consistently < 0.01, indicating
that their bias profiles remain strongly correlated and thus share a common lineage. In short,
the trained model inherits the same fingerprint as its initialization. We also evaluate baseline
methods (results in Appendix 8); without exception, they fail to distinguish across seeds,
which in turn suggests their separability stems from training-induced artifacts rather than
properties inherent to a specific model instantiation.

Identical data and order do not make fingerprints converge. In Table 4c, all four
“suspicious” models sbase
i for i ∈ {42, 123, 100, 2000} are trained on exactly the same corpus

27
Table 6: Fingerprinting results vs. LLaMA-2-7B. Each row compares a target model against
LLaMA-2-7B. U -test p reports the p-value from our hidden-state correlation
test (< 0.01 indicates a strong signal). Intrinsic, REEF, PCS, and ICS report
similarity scores (higher = better).

Model # Tokens U -test p (< 0.01) Intrinsic ↑ REEF (↑) PCS (↑) ICS (↑)
Llama-2-finance-7B (Heenan, 2023) 5M 1.34 × 10−41✓ 1.0000✓ 0.9950✓ 0.9979✓ 0.9952✓
Vicuna-1.5-7B (Chiang et al., 2023) 370M 1.49 × 10−96✓ 1.0000✓ 0.9985✓ 0.9985✓ 0.9949✓
Wizardmath-7B (Luo et al., 2023) 1.8B 4.09 × 10−100✓ 1.0000✓ 0.9979✓ 1.0000✓ 0.9994✓
Meditron-7B (Chen et al., 2023) 48B 5.212 × 10−4✓ 0.9990✓ 0.9978✓ 1.0000✓ 0.9817✓
CodeLlama-7B (Meta AI, 2024) 500B 2.008 × 10−3✓ 0.9480✓ 0.9947✓ 0.6863× 0.3369×

(OpenWebText) and in the same data order (we fix the training seed to lock the data order);
the only difference lies in their initialization seeds i. Across all cross-seed pairs, p-values
remain consistently > 0.01, in sharp contrast to the near-zero values in Table 4b. That is,
fingerprints remain seed-specific even under identical data and curriculum.

Continual training on diverse datasets does not confound the fingerprint. From
a copyright perspective, the true weakness of existing fingerprinting methods is their fragility
to distribution shifts, which prevents reliable lineage attribution under continual training.
In Table 5, we continue training a base model (seed 1000, pretrained on OpenWebText) on
two very different datasets: TinyStories (Eldan and Li, 2023) (synthetic children’s stories)
and The Stack (Kocetkov et al., 2022) (permissively licensed GitHub code). We compare (i)
true descendants trained from the base, versus (ii) distractors derived from a different base
model (initialized with seed 123 and trained with a different data order on OpenWebText),
then continued training on the same corpus. The question is whether attribution methods
can identify which descendant truly shares lineage with the base.
We find that prior baselines all fail under the code setting (The Stack), misclassifying
true descendants as distractors. This indicates that they largely track domain similarity
rather than lineage identity: TinyStories is closer in distribution to the pretraining corpus
(OpenWebText), while The Stack diverges sharply; such a large distribution shift can easily
bypass detection. In contrast, our method correctly attributes lineage across both corpora.
Hence, our fingerprint is not a proxy for data distribution: it survives substantial domain
shift and persists beyond the initial pretraining stage.

From early training to finetuning We further compare our method with existing
baselines under standard evaluations on pretrained models. In particular, we test suspect
models fine-tuned from Llama-2-7b (base model) with data volumes ranging from 5 million
to 500 billion tokens. The suspects include diverse downstream variants such as Llama-2-
finance-7b (Heenan, 2023), Vicuna-1.5-7b (Chiang et al., 2023), WizardMath-7b (Luo et al.,
2023), Chinese-LLaMA-2-7b (Chen et al., 2023), and Code-Llama-7b (Meta AI, 2024). Their
fine-tuning data volumes are 5M, 370M, 1.8B, 13B, and 500B tokens, respectively. As shown
in Table 6, our method consistently maintains p < 0.01 across all settings.

28
Transformers Are Born Biased

6.2 Connection Between Positional Variance Discrepancy and Attention Sinks

Our analysis of the self-attention mechanism identifies a fundamental positional variance


disparity that emerges early in the sequence. As formalized in Equation 2, the “running
average” nature of causal attention induces a systematic decay: the variance of a token’s
representation scales inversely with its position index. Consequently, the initial token—exempt
from this averaging effect—maintains a significantly higher variance than subsequent tokens,
manifesting as a structural and statistical outlier from the onset of training.
This uniqueness of the initial token is a likely precursor to the attention sink phenomenon,
where the first token disproportionately monopolizes attention scores regardless of semantic
context (Gu et al., 2024). This behavior has been characterized as a “massive activation”
phenomenon (Sun et al., 2024), wherein initial tokens develop exceptionally large magnitudes
to serve as a numerical bias that absorbs residual attention probability. Historically, this has
been interpreted as a “null attention” (Vig and Belinkov, 2019) or “no-op” mechanism (Clark
et al., 2019), and its preservation is known to be critical for the stability of sliding-window
inference (Xiao et al., 2023).
While recent studies have proposed various mechanistic drivers, including spectral subspaces
(Cancedda, 2024), positional “waiver” distributions (Yan et al., 2024), and high-norm bands
in Query/Key projections (Barbero et al., 2024), these works largely focus on the properties
of fully trained models. In contrast, we hypothesize that the initial positional variance
disparity acts as a statistical anchor at initialization. This inherent singularity biases the
optimization landscape, incentivizing the model to “latch” onto the first token as a stable
reference point, thereby inducing the emergence of the attention sink.

6.2.1 Intervention of Attention Sink

To investigate this causal link, we design an explicit intervention to rectify this inherent
variance decay. Specifically, we
Piapply a variance calibration operation to the aggregated
representation oi (where oi = j=1 Ai,j vj ) before it enters the subsequent projection layers.
We implement two strategies to enforce variance consistency across positions:

• Positional Amplification: We amplify the aggregated output by a factor of i to
directly counteract the theoretical √1i standard deviation decay:

i
√ √ X
õi = i · oi = i Ai,j vj .
j=1

• Positional Attenuation: To ensure numerical stability q while equalizing variance,


we propose a normalized variant that scales the output by Ti (where T denotes the
maximum context length):
r r i
i i X
õi = · oi = Ai,j vj .
T T
j=1

29
Both methods effectively neutralize the positional bias at initialization, ensuring that repre-
sentations across all sequence indices contribute with comparable variance to the subsequent
layers.
To empirically validate these strategies, we conducted controlled pre-training experiments
from scratch, where we adopt the Nano Llama-2 architecture (Touvron et al., 2023) (dmodel =
768, 12 layers, 12 heads, 2048 context window, RoPE). The models were trained on the
OpenWebText dataset (Gokaslan et al., 2019) for 200k steps using the AdamW optimizer
(full details in Appendix B.4). To quantify the attention sink phenomenon, we adopt the
threshold-based metric proposed by (Gu et al., 2024). Let αl,h 1 denote the importance score of

the first token (index 1) in the h-th head of the l-th layer, calculated as the average attention
weight it receives:
T
1 1 X i,1
αl,h = Al,h ,
T
i=1
where a head is classified as a “sink head” if 1
αl,h > ϵ. The model-wide sink rate is defined as
the proportion of such heads:
L H
1 XX
Sinkϵ = 1
I(αl,h > ϵ).
L·H
l=1 h=1

We set ϵ = 0.25 and evaluate this metric on real-world text from the WikiText-2 (Merity
et al., 2016) test set, averaging the Sink Rate over 100 randomly selected sequences across
lengths ranging from 32 to 512 tokens.

Finding 7: The initial representation contraction is causally related to the attention-sink


appeared after pre-training.

As illustrated in Figure 12, the Baseline model consistently exhibits a significant sink rate
across all sequence lengths, confirming the prevalence of the attention sink phenomenon in
the pretraining stage. In stark contrast, the model initialized with the Positional Attenuation
strategy maintains a near-zero sink rate regardless of sequence length. Meanwhile, the
Positional Amplification strategy also demonstrates effectiveness, yielding a significantly
lower sink rate compared to the baseline, although it does not eliminate the phenomenon as
completely as the attenuation approach. This robust quantitative evidence demonstrates that
correcting the initial variance disparity effectively mitigates the formation of attention sinks,
supporting our hypothesis that they are learned artifacts driven by statistical anomalies.

6.2.2 Impact on LLM Pretraining

We further verify that our structural interventions do not compromise the model’s pretraining
process. We compare the validation perplexity (PPL) of the fully converged Nano Llama-2
models trained with the two standard deviation compensation strategies against the Baseline
on two standard benchmarks: WikiText-2 (Merity et al., 2016) and C4 (Raffel et al., 2020).
To ensure a rigorous comparison, we conduct the evaluation on 256 sequences, each with a
fixed context length of 2048 tokens.

30
Transformers Are Born Biased

Figure 12: Quantitative Comparison of Attention Sink Rate on WikiText-2. We


report the average Sink Rate across varying sequence lengths (N=100 samples
per point). While the Baseline model shows a persistently high sink rate, both
Positional Amplification and Attenuation strategies effectively eliminate the
phenomenon across all lengths on real text inputs.

Table 7: Analysis of language modeling performance under positional compensation (Nano LLaMA-
2), measured by perplexity (PPL; lower is better) on the WikiText-2 and C4 validation sets.
Both compensation strategies achieve competitive performance, with Positional Attenuation
yielding lower PPL on WikiText-2 and Positional Amplification yielding lower PPL on C4.

Model WikiText-2 (PPL) ↓ C4 (PPL) ↓


Baseline 21.33 30.36
Positional Amplification 21.95 28.70
Positional Attenuation 21.02 30.48

As presented in Table 7, our methods maintain robust performance. The Positional At-
tenuation strategy achieves the best performance on WikiText-2 (21.02 vs. 21.33), while
the Positional Amplification strategy demonstrates superior generalization on the larger C4
dataset (28.70 vs. 30.36). Overall, both strategies achieve performance comparable to or
better than the Baseline, confirming that eliminating attention sinks via variance alignment
resolves structural pathologies without sacrificing the model’s ability to capture semantic
dependencies.

31
7 Discussion

The investigation presented in this study reveals that transformers are not “blank slates” at
birth; rather, they are born with an innate structural identity determined by their random
initialization seed. This identity manifests as extreme next-token preferences. Our mechanistic
dissection reveals that this phenomenon is driven by the interaction of asymmetric nonlinear
activations in MLP sublayers, which induce inter-sequence representation concentration,
and self-attention sublayers, which amplify this effect through intra-sequence aggregation.
By simplifying these components into ReLU and uniform attention approximations, we
provided a rigorous theoretical framework that accurately predicts the rate of representation
contraction observed in our empirical trials.
Beyond providing mechanistic insight, our findings have significant practical implications for
model security and architectural design. On one hand, we have shown that these initialization-
induced biases are persistent throughout the training process. This persistence enables the
development of SeedPrint, a novel fingerprinting method that can reliably distinguish models
sharing identical architectures and training data but differing only in their initial random
seeds. Unlike existing passive fingerprinting techniques that often track domain similarity
or training artifacts, SeedPrint remains robust under substantial distribution shifts and
extensive fine-tuning, making it a powerful tool for lineage tracing and copyright auditing.
On the other hand, our analysis identifies a causal link between the structural biases of the
initialization regime and the widely observed attention-sink phenomenon, where the value
aggregation process of causal attention induces a positional variance decay that makes the
first token a statistical outlier. By introducing architectural interventions such as Positional
Amplification or Attenuation to equalize variance, we demonstrated that attention sinks can
be effectively mitigated or eliminated without compromising the model’s capture of semantic
dependencies.
These discoveries hold the potential for a profound shift in the LLM research paradigm—moving
from a focus on “what the model learns” to an understanding of “what the model is born
with”. For years, the community has treated the transformer as a neutral container for data,
yet our work suggests that the architecture’s birth state mechanistically roots behaviors
previously thought to be learned strategies. By demystifying phenomena like attention
sinks as architectural byproducts rather than data-driven artifacts, we provide practitioners
with a direct pathway to control model stability and identity through principled structural
adjustments. This perspective fosters a more transparent and accountable ecosystem for
large-scale model development, where model lineage and copyright can be verified from the
first epoch of training.
However, this work is not without limitations. Our current analysis primarily focuses on
the structural state at initialization and its empirical persistence, without directly modeling
the complex dynamics of the training process itself. While we have shown that initial
biases remain detectable after training, a more rigorous theoretical treatment is needed to
understand how these structural tendencies interact with gradient-based optimization and
the acquisition of semantic knowledge. Furthermore, while we evaluated models up to the 7B
parameter scale, the behavior of these biases in ultra-large-scale regimes remains an area for
further empirical verification. Future research should explore how to intentionally manipulate

32
Transformers Are Born Biased

the initial representational geometry to improve model alignment or enable more efficient
pre-training, ultimately demystifying the transition from a structured initialization to a fully
capable language model.

33
References
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al-
tenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint
arXiv:2303.08774, 2023.

S. Alhazbi, A. Hussain, G. Oligeri, and P. Papadimitratos. Llms have rhythm: Fingerprinting


large language models using inter-token times and network traffic analysis. IEEE Open
Journal of the Communications Society, 2025.

Z. Allen-Zhu and Y. Li. Physics of language models: Part 3.1, knowledge storage and
extraction. arXiv preprint arXiv:2309.14316, 2023.

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,


2016.

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui,
L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren,
C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang,
H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang,
Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu. Qwen technical report, 2023. URL
[Link]

F. Barbero, A. Vitvitskyi, C. Perivolaropoulos, R. Pascanu, and P. Veličković. Round


and round we go! what makes rotary positional encodings useful? arXiv preprint
arXiv:2410.06205, 2024.

R. Bommasani. On the opportunities and risks of foundation models. arXiv preprint


arXiv:2108.07258, 2021.

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,


P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in
neural information processing systems, 33:1877–1901, 2020.

N. Cancedda. Spectral filters, dark signals, and attention sinks. arXiv preprint
arXiv:2402.09221, 2024.

N. Carlini, S. Chien, M. Nasr, S. Song, A. Terzis, and F. Tramer. Membership inference


attacks from first principles. In 2022 IEEE symposium on security and privacy (SP), pages
1897–1914. IEEE, 2022.

B. K. Chen, T. Hu, H. Jin, H. K. Lee, and K. Kawaguchi. Exact conversion of in-context learn-
ing to model weights in linearized-attention transformers. arXiv preprint arXiv:2406.02847,
2024.

Z. Chen, A. Hernández-Cano, A. Romanou, A. Bonnet, K. Matoba, F. Salvi, M. Pagliardini,


S. Fan, A. Köpf, A. Mohtashami, A. Sallinen, A. Sakhaeirad, V. Swamy, I. Krawczuk,
D. Bayazit, A. Marmet, S. Montariol, M.-A. Hartley, M. Jaggi, and A. Bosselut. Meditron-
70b: Scaling medical pretraining for large language models, 2023.

34
Transformers Are Born Biased

W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E.
Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with
90%* chatgpt quality, March 2023. URL [Link]

K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does bert look at? an analysis
of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.

D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei. Knowledge neurons in pre-
trained transformers. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, 2022.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 conference of the
North American chapter of the association for computational linguistics: human language
technologies, volume 1 (long and short papers), pages 4171–4186, 2019.

R. Eldan and Y. Li. Tinystories: How small can language models be and still speak coherent
english? arXiv preprint arXiv:2305.07759, 2023.

N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen,


T. Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits
Thread, 1(1):12, 2021.

F. Galton. Finger prints. Number 57490-57492. Cosimo Classics, 1892.

X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings
of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15
May 2010. PMLR. URL [Link]

A. Gokaslan, V. Cohen, E. Pavlick, and S. Tellex. Openwebtext corpus. [Link]


[Link]/OpenWebTextCorpus, 2019.

X. Gu, T. Pang, C. Du, Q. Liu, F. Zhang, C. Du, Y. Wang, and M. Lin. When attention sink
emerges in language models: An empirical view. arXiv preprint arXiv:2410.10781, 2024.

S. Gugger, L. Debut, T. Wolf, P. Schmid, Z. Mueller, S. Mangrulkar, M. Sun, and B. Bossan.


Accelerate: Training and inference at scale made simple, efficient and adaptable. https:
//[Link]/huggingface/accelerate, 2022.

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al.
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv
preprint arXiv:2501.12948, 2025.

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. In Proceedings of the IEEE international conference
on computer vision, pages 1026–1034, 2015.

35
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages
770–778, 2016.
C. Heenan. Llama2-7b-finance (hugging face model). [Link]
Llama2-7b-Finance, 2023. Fine-tuned from NousResearch/Llama-2-7b-hf; MIT License;
accessed 2025-09-02.
D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). In arXiv preprint
arXiv:1606.08415, 2016. URL [Link]
D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus), 2023. URL https:
//[Link]/abs/1606.08415.
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L.
Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language
models. arXiv preprint arXiv:2203.15556, 2022.
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad-
ford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint
arXiv:2001.08361, 2020.
D. Kocetkov, R. Li, L. Ben Allal, J. Li, C. Mou, C. Muñoz Ferrandis, Y. Jernite, M. Mitchell,
S. Hughes, T. Wolf, D. Bahdanau, L. von Werra, and H. de Vries. The stack: 3 tb of
permissively licensed source code. Transactions on Machine Learning Research (TMLR),
Preprint, 2022.
J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington.
Wide neural networks of any depth evolve as linear models under gradient descent. Advances
in neural information processing systems, 32, 2019.
S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods.
In Proceedings of the 60th annual meeting of the association for computational linguistics
(volume 1: long papers), pages 3214–3252, 2022.
L. Liu, X. Liu, J. Gao, W. Chen, and J. Han. Understanding the difficulty of training
transformers. arXiv preprint arXiv:2004.08249, 2020.
X. Luan, Z. Wei, Y. Zhang, and M. Sun. Robust and efficient watermarking of large language
models using error correction codes. Proceedings on Privacy Enhancing Technologies, 2025.
H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang.
Wizardmath: Empowering mathematical reasoning for large language models via reinforced
evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models, 2016.
Meta AI. Codellama-7b-hf: Code llama base 7b model (hugging face). [Link]
co/codellama/CodeLlama-7b-hf, 2024. Base 7 billion-parameter Code Llama model for
code synthesis and understanding; trained between January and July 2023; licensed under
Meta Llama 2 license; accessed 2025-09-02.

36
Transformers Are Born Biased

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.
[Link] 2025. Accessed: 2025-10-06.

A. Mueller, R. Frank, T. Linzen, L. Wang, and S. Schuster. Coloring the blank slate:
Pre-training imparts a hierarchical inductive bias to sequence-to-sequence models. arXiv
preprint arXiv:2203.09397, 2022.

L. Noci, S. Anagnostidis, L. Biggio, A. Orvieto, S. P. Singh, and A. Lucchi. Signal propagation


in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural
Information Processing Systems, 35:27198–27211, 2022.

OpenAI. Gpt-5 system card. [Link] August


2025. Accessed: 2025-10-06.

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,


K. Slama, A. Ray, et al. Training language models to follow instructions with human
feedback. Advances in neural information processing systems, 35:27730–27744, 2022.

D. Pasquini, E. M. Kornaropoulos, and G. Ateniese. Llmmap: Fingerprinting for large


language models. arXiv preprint arXiv:2407.15847, 2024.

O. Press and L. Wolf. Using the output embedding to improve language models. arXiv
preprint arXiv:1608.05859, 2016.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are
unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and


P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of machine learning research, 21(140):1–67, 2020.

D. Rai, Y. Zhou, S. Feng, A. Saparov, and Z. Yao. A practical review of mechanistic


interpretability for transformer-based language models. arXiv preprint arXiv:2407.02646,
2024.

P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv preprint
arXiv:1710.05941, 2017.

Y. Ren and D. J. Sutherland. Learning dynamics of llm finetuning. arXiv preprint


arXiv:2407.10490, 2024.

N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.

J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary
position embedding. In Proceedings of the 59th Annual Meeting of the Association for Com-
putational Linguistics (ACL), pages 115–124. Association for Computational Linguistics,
2021.

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with
rotary position embedding. Neurocomputing, 568:127063, 2024.

37
M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models.
arXiv preprint arXiv:2402.17762, 2024.

Q. Team. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2, 2024.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière,


N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language
models. arXiv preprint arXiv:2302.13971, 2023.

Y.-Y. Tsai, C. Guo, J. Yang, and L. van der Maaten. Rofl: Robust fingerprinting of language
models. arXiv preprint arXiv:2505.12682, 2025.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and


I. Polosukhin. Attention is all you need. Advances in neural information processing systems,
30, 2017.

J. Vig and Y. Belinkov. Analyzing the structure of attention in a transformer language


model. arXiv preprint arXiv:1906.04284, 2019.

H. Wang, B. Chen, S. Li, X. Liang, H. K. Lee, K. Kawaguchi, and T. Hu. Prefix-tuning+:


Modernizing prefix-tuning by decoupling the prefix from attention. arXiv e-prints, pages
arXiv–2506, 2025a.

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou.


Self-consistency improves chain of thought reasoning in language models. arXiv preprint
arXiv:2203.11171, 2022.

Z. Wang, Y. Wang, Z. Zhang, Z. Zhou, H. Jin, T. Hu, J. Sun, Z. Li, Y. Zhang, and Z.-Q. J.
Xu. Understanding the language model to solve the symbolic multi-step reasoning problem
from the perspective of buffer mechanism. In Findings of the Association for Computational
Linguistics: EMNLP 2025, pages 16446–16474, 2025b.

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma,


D. Zhou, D. Metzler, et al. Emergent abilities of large language models. arXiv preprint
arXiv:2206.07682, 2022a.

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al.


Chain-of-thought prompting elicits reasoning in large language models. Advances in neural
information processing systems, 35:24824–24837, 2022b.

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,


M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L.
Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Transformers: State-of-the-art
natural language processing. In Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing: System Demonstrations, pages 38–45, Online, Oct. 2020.
Association for Computational Linguistics. URL [Link]
[Link]-demos.6.

G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with
attention sinks. arXiv preprint arXiv:2309.17453, 2023.

38
Transformers Are Born Biased

J. Xu, F. Wang, M. D. Ma, P. W. Koh, C. Xiao, and M. Chen. Instructional fingerprinting


of large language models. arXiv preprint arXiv:2401.12255, 2024.

R. Yan, X. Du, H. Deng, L. Zheng, Q. Sun, J. Hu, Y. Shao, P. Jiang, J. Jiang, and L. Zhao.
Unveiling and controlling anomalous attention distribution in transformers. arXiv preprint
arXiv:2407.01601, 2024.

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al.
Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025.

D.-h. Yoon, M. Chun, T. Allen, H. Müller, M. Wang, and R. Sharma. Intrinsic finger-
print of llms: Continue training is not all you need to steal a model! arXiv preprint
arXiv:2507.03014, 2025.

B. Zeng, L. Wang, Y. Hu, Y. Xu, C. Zhou, X. Wang, Y. Yu, and Z. Lin. Huref: Human-
readable fingerprint for large language models. Advances in Neural Information Processing
Systems, 37:126332–126362, 2024.

B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in neural
information processing systems, 32, 2019.

J. Zhang, D. Liu, C. Qian, L. Zhang, Y. Liu, Y. Qiao, and J. Shao. Reef: Representation
encoding fingerprints for large language models. arXiv preprint arXiv:2410.14273, 2024.

R. Zhang. Matrix-driven instant review: Confident detection and reconstruction of llm


plagiarism on pc. arXiv preprint arXiv:2508.06309, 2025.

39
Appendix Contents

A Technical Details 41
A.1 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.2 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.3 Proof of Proposition 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

B Experimental Details 44
B.1 Initialization Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B.2 Consistent Next-Token Preference Protocol . . . . . . . . . . . . . . . . . . . 45
B.3 LLM Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
B.3.1 Validation of the Gaussian Null Distribution . . . . . . . . . . . . . . . 46
B.3.2 Baselines Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
B.3.3 Qwen Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
B.4 Experimental Setup for Pre-training . . . . . . . . . . . . . . . . . . . . . . . 50
B.4.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
B.4.2 Initialization and Optimization . . . . . . . . . . . . . . . . . . . . . . 51
B.4.3 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B.4.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

40
Transformers Are Born Biased

Appendix A. Technical Details

A.1 Proof of Proposition 2

Proof We analyze the evolution of the expected cosine similarity through the layers by
leveraging the infinite-width limit, where the pre-activation distributions converge to a
Gaussian process.
The ReLU Recurrence Relation Let ρ̄l−1 denote the expected cosine similarity between
the representations of two inputs at layer l − 1. In the limit of infinite width, the pre-
(l) (l)
activations at layer l, denoted by U1 and U2 , follow a zero-mean bivariate Gaussian
distribution with correlation ρ̄l−1 . For the ReLU activation ϕ(u) = max(0, u), the expected
product of the activations (the unnormalized covariance) is given by the arc-cosine kernel of
degree 1:
(l) (l) σ2
E[ϕ(U1 )ϕ(U2 )] = l (sin θl−1 + (π − θl−1 ) cos θl−1 ) ,

where σl is the pre-activation variance and θl−1 = arccos(ρ̄l−1 ) is the angle between the
2

pre-activation vectors. The variance at the output is obtained by setting θl−1 = 0 (perfect
correlation):
(l) σ2 σ2
E[ϕ(U1 )2 ] = l (0 + π · 1) = l .
2π 2
The expected cosine similarity at layer l is the ratio of the covariance to the variance:
(l) (l)
E[ϕ(U1 )ϕ(U2 )] 1
ρ̄l = (l)
= (sin θl−1 + (π − θl−1 ) cos θl−1 ) .
E[ϕ(U1 )2 ] π

By substituting sin(arccos(ρ)) = 1 − ρ2 and cos(arccos(ρ)) = ρ, we obtain the recurrence


p

map g(ρ):
1 q 
ρ̄l = 1 − ρ̄2l−1 + (π − arccos(ρ̄l−1 ))ρ̄l−1 . (3)
π
Base case (l = 1): For l = 1, the inputs X1 and X2 are independent isotropic Gaussian
vectors, implying ρ̄0 = 0. Substituting ρ̄0 = 0 into Equation (3):
1 √  π  1
ρ̄1 = 1−0+ π− ·0 = .
π 2 π
Monotonicity: Notice that the function g(ρ) is continuous and differentiable on [0, 1]. Since
g(0) > 0 and g ′ (1) = 1, it can be shown that g(ρ) > ρ for all ρ ∈ [0, 1). Thus, the similarity
is monotonically increasing with l.
Limiting case (l → ∞): Since the sequence ρ̄l is bounded above by 1 and is monotonically
increasing, it must converge to a fixed point. We examine ρ̄ = g(ρ̄). The only stable fixed
point in the interval [0, 1] is ρ∗ = 1. Therefore, liml→∞ ρ̄ReLU
l = 1.
The tanh case Consider ϕ(u) = tanh(u). Since the inputs X1 , X2 are independent zero-
(1) (1)
mean Gaussians (ρ̄0 = 0), the pre-activations U1 and U2 are independent. Let fodd be
any odd function. Since the marginal distributions of U are symmetric about zero:
E[fodd (U )] = 0 =⇒ E[fodd (U1 )fodd (U2 )] = E[fodd (U1 )]E[fodd (U2 )] = 0.

41
For ϕ = tanh, this implies the output correlation ρ̄tanh
1 = 0. By induction, if ρ̄l−1 = 0, the
pre-activations at layer l remain independent, yielding ρ̄l = 0 for all l ≥ 1.

A.2 Proof of Proposition 4

Proof We analyze the cosine similarity between the representations of two independent
sequences, denoted as X (1) and X (2) , passed through the two different architectures. Let the
sequence length be T . We assume the infinite-width limit where the pre-activations follow a
Gaussian Process.
Correlation after the First MLP Layer: Let x and x′ be any two independent input
tokens (either from the same sequence or different sequences). From Proposition 2, the
expected cosine similarity between their representations after the first ReLU MLP layer,
h = MLP0 (x) and h′ = MLP0 (x′ ), is ρ̄ReLU
1 = π1 .
(1) (2)
Since the inputs Xi and Xj are all mutually independent standard normal vectors, the
(1) (2)
hidden representations hi and hj share the same pairwise correlation ρ1 for all i, j (and
for distinct indices within the same sequence). For the normalized representations, we have:

1
E[h⊤ ReLU
i hj ] = ρ̄1 = for i ̸= j, and E[∥hi ∥2 ] = 1.
π

Case I: MLP0 -MLP0 This architecture applies a second MLP layer element-wise. The
inter-sequence similarity is defined as the expected cosine similarity between a token processed
by the second layer from sequence 1 and sequence 2. This is equivalent to the propagation of
correlation through two ReLU layers starting from zero correlation. Using the recurrence
relation from Proposition 2 with ρ̄1 = 1/π:
q 
1
ρ̄ReLU
2 = 1 − ρ̄ 2 + (π − arccos(ρ̄ ))ρ̄
1 1 1
π
r    !
1 1 1 1
= 1 − 2 + π − arccos .
π π π π

Substituting values (π ≈ 3.14159, 1/π ≈ 0.3183):

1
ρ̄ReLU
2 ≈ (0.948 + (1.895)(0.318)) ≈ 0.493.
3.1416
This confirms the value is approximately 0.49.
Case II: MLP0 -Attn0 The simplified attention aggregates the values into a single vector
(conceptually the sequence embedding). Let y (1) and y (2) be the outputs for the two
sequences:
T T
(1) 1 X (1) (2) 1 X (2)
y = hi , y = hj .
T T
i=1 j=1

42
Transformers Are Born Biased

E[y (1)⊤ y (2) ]


We calculate the expected inter-sequence pairwise cosine similarity ρ̄ = √ .
E[∥y (1) ∥2 ]E[∥y (2) ∥2 ]

For the Numerator (cross-correlation): Since all pairs across sequences are distinct,
(1) (2)
E[hi ⊤ hj ] = ρ̄1 .

T T
1 XX (1) (2) 1
E[y (1)⊤ y (2) ] = 2
E[hi ⊤ hj ] = 2 (T 2 ρ̄1 ) = ρ̄1 .
T T
i=1 j=1

For the denominator (variance): For a single sequence, the variance of the mean of T
variables with uniform pairwise correlation ρ̄1 and unit variance is:
 
 X 
1 1 X X
E[∥y (1) ∥2 ] = Var hi = 2  Var(hi ) + Cov(hi , hj ) .
T T
i i̸=j

1 1 + (T − 1)ρ̄1 1 − ρ̄1
E[∥y (1) ∥2 ] = 2
(T · 1 + T (T − 1)ρ̄1 ) = = ρ̄1 + .
T T T
Resulting Similarity:
ρ̄1 T ρ̄1
ρ̄ = 1−ρ̄1 = T ρ̄ + 1 − ρ̄ .
ρ̄1 + T 1 1

Substituting ρ̄1 = 1/π:

T (1/π) T /π T
ρ̄ = = = .
T (1/π) + 1 − 1/π (T + π − 1)/π T +π−1
As T → ∞, this term approaches 1. This demonstrates that while the MLP layer only
moderately correlates the sequences (≈ 0.49), the attention mechanism acts as an amplifier,
driving the correlation to 1 by suppressing the independent noise components relative to the
shared bias induced by the ReLU.

A.3 Proof of Proposition 5

Proof We derive the expression for the mean intra-sequence pairwise cosine similarity
ρ̄′ (T, L) by decomposing the problem into a finite-size correction term and an asymptotic
integral limit.
Decomposition into Diagonal and Off-Diagonal Terms Let Z ∈ RT ×d denote the
sequence representation after L layers. The mean intra-sequence pairwise cosine similarity is
defined as:
T T
1 XX
ρ̄′ (T, L) = 2 E [ρ(zi , zj )] .
T
i=1 j=1

The summation consists of T diagonal terms where i = j (with ρ = 1) and T (T − 1) off-


diagonal terms (i =
̸ j). Let ρ̄′ (L) ≡ limT →∞ ρ̄′ (T, L) denote the expected similarity between

43
any two distinct tokens in the continuous limit. Assuming the sequence is exchangeable or
stationary in the limit, the expectation for off-diagonal terms converges to ρ̄′ (L). Thus:
1 
ρ̄′ (T, L) = 2 T · 1 + T (T − 1)ρ̄′ (L)

T
1 T −1 ′
= + ρ̄ (L)
T T
1
= ρ̄′ (L) + 1 − ρ̄′ (L) .

T
Derivation of the Infinite Limit ρ̄′ (L) In the limit T → ∞, the discrete cumulative
(k) (k−1)
averaging operation at layer k, xt = 1t ts=1 xs , converges to the integral operator
P
1 t
T f (t) = t 0 f (s)ds. For an input of standard Brownian motion (white noise), applying this
R

operator N = L − 1 times yields a Gaussian process with covariance kernel K(s, t). For
ordered positions s < t, the correlation depends only on the ratio r = s/t:
√ L−2 1 k
X  
ρL (r) = r CN,k ln ,
r
k=0
where CN,k are coefficients derived from the Gamma distribution integrals associated with
the iterated kernels. The mean similarity is the expectation over all pairs (s, t). For a uniform
distribution on the triangle 0 ≤ s ≤ t ≤ 1, the probability density function for the ratio r is
f (r) = 2r. Integrating the correlation over this domain:
Z 1
ρ̄′ (L) = ρL (r) · 2r dr.
0
Solving this integral using the substitution u = ln(1/r) transforms the terms into Gamma
functions Γ(k). The exact closed-form solution is given by:
L−2
(L − 2)! X (L − 2 + k)! 2 L−1−k
 

ρ̄ (L) = .
(2L − 4)! k! 3
k=0
This formula captures the precise rate at which the representation contracts toward the
global mean as the network depth L increases.

Appendix B. Experimental Details

B.1 Initialization Details

We initialize the embedding layer and all standard linear layers using a normal distribution
N (0, 0.02). However, for the residual projection layers—specifically the attention output
projection (WO ) and the MLP down-projection (Wdown )—we adopt the scaled initialization
scheme from GPT-2 (Radford et al., 2019). For these layers, weights are initialized with a
standard deviation of √ 0.02
2L
, where L denotes the total number of transformer layers. This
scaling factor mitigates the accumulation of variance along the residual path, thereby ensuring
optimization stability in deep transformer architectures.

44
Transformers Are Born Biased

B.2 Consistent Next-Token Preference Protocol

To rigorously evaluate intrinsic model biases, all experiments adhere to the following protocol
regarding model configuration, input generation, and statistical evaluation.

Model and Weighting Setup To ensure the comparability of token ID preferences across
different random initializations, we fix the weights of the embedding layer and the LM
head across trials for both the RoPE-enhanced GPT-2 (with weight tying) and LLaMA-
2 (without weight tying). By enforcing identical initialization for these token-interaction
layers—regardless of this architectural discrepancy—we isolate the impact of the internal
transformer block dynamics on token preference.

Input Sequences We evaluate the models using a fixed set of N randomly generated
token sequences. By utilizing unstructured random inputs devoid of semantic meaning, we
ensure that any observed output preference arises solely from the model’s inductive biases
(i.e., architecture and initialization) rather than input context.

Evaluation Metrics We quantify prediction bias using two primary metrics:

• Top-1 Token ID: For every input sequence, we identify the token with the maximum
logit (arg max). The Top-1 Token ID is defined as the single token ID that appears
most frequently as the predicted next token across the aggregate of all N sequences.
• Statistical Significance (p-value): To validate that the dominance of the most
frequent token is not a result of random chance, we perform a hypothesis test with
a Bonferroni correction. Let kmax be the observed count of the Top-1 token. Under
the null hypothesis of a uniform distribution (p0 = V1 ), the nominal probability is
calculated via the tail of the Binomial distribution:
N  
X N x
pnominal = P (X ≥ kmax ) = p (1 − p0 )N −x , where X ∼ Binomial(N, p0 ).
x 0
x=kmax

To correct for the multiple comparisons problem inherent in checking a vocabulary of


size V , the rigorous p-value is:
p-value = min(1.0, V × pnominal ).
A small p-value (e.g., < 0.05) indicates that the observed preference is statistically
significant.

B.3 LLM Fingerprinting

Implementation Details and Licenses We train all models with the Hugging Face
transformers Trainer (Wolf et al., 2020), using Accelerate (Gugger et al., 2022) for distributed
runs. All open-source models are loaded from their official Hugging Face releases and used
under their original licenses: Llama models under the Meta Llama Community License, and
other models under Apache-2.0. All datasets are downloaded via the Hugging Face Datasets
library (the library is Apache-2.0); dataset content follows each dataset’s stated license.

45
Correlation Metric We measure the alignment between two rankings using Kendall’s Tau
(τ ) correlation coefficient. This metric is over the intersection of the top-m biased dimensions
from both models to focus on the most salient features.
Let C be the number of concordant pairs and D be the number of discordant pairs. Let n
be the total number of tokens in the intersection. Kendall’s τ is defined as the difference
between the proportion of concordant and discordant pairs relative to the total number of
pairs:
C −D
τ= 1 .
2 n(n − 1)

A value close to 1 indicates that the tokens favored by the model’s behavior are ordered
similarly to those favored by the geometric contraction mechanism.

B.3.1 Validation of the Gaussian Null Distribution

Our hypothesis test evaluates whether two models share lineage by comparing the distribution
of their correlation statistics over random inputs against a null distribution. The null
distribution aims to model the correlation distribution between two independent models when
evaluated on random inputs.

Empirical Null Distribution from Initialized Models The cleanest way to instantiate
complete independence is to compare two independently initialized models (with different
random seeds and no training). We estimate this empirical null by evaluating ∼ 2500 such
pairs and computing Kendall–Tau correlations on 10,000 random inputs. Figure 13 shows
that the empirical null distribution closely matches a Gaussian distribution.

Constructing the Null Distribution in SeedPrints. Approximating the output of


neural networks as Gaussian distributions is a widely adopted and effective practice in
the auditing literature Carlini et al. (2022). This practice is supported by both empirical
evidence and theoretical applications, which show that model output (e.g., logits) often
exhibit Gaussian-like behavior across diverse model architectures, data types, and tasks Lee
et al. (2019). Therefore, in our work, and based on the previous observation of Gaussian-like
Null distirbution (Figure 13), we naturally consider Gaussian as a proxy of model output.
Since two random-initialized models have independent parameter matrices, this makes their
outputs independent conditional on the same inputs. Therefore, sampling two independent
Gaussian matrices is a natural surrogate for two independently initialized models under the
null. This provides a straightforward way to construct the null:

1. Sample two Gaussian matrices of shape RN ×dout ,

2. Extract most biased dimensions by selecting the top-m ranked average dimensions,

3. Take the intersection set S and compute correlations on it.

46
Transformers Are Born Biased

Initial Model Correlation Distribution


1200 Init Model Correlation
Gaussian Fit
1000

800
Density

600

400

200

0
0.0010 0.0005 0.0000 0.0005 0.0010
Kendall Tau Correlation

Figure 13: Empirical correlation distribution over ∼ 2500 pairs of independently initialized
models, evaluated on random inputs, closely matches a Gaussian distribution
centered at zero.

Intuitively, selecting S introduces dependencies among dimensions. However, applying the


same selection procedure to Gaussian samples preserves this dependency structure under the
raw null.

B.3.2 Baselines Setup

We mainly consider four passive fingerprinting baselines (weight- or representation-based).


Intrinsic fingerprint (Yoon et al., 2025) (or PDF in some papers) compares models via the
similarity of the layerwise standard-deviation profiles of attention parameters. REEF (Zhang
et al., 2024) computes centered-kernel-alignment (CKA) similarity between feature repre-
sentations from the same samples across two models. PCS and ICS (Zeng et al., 2024)
(or collectively as HuRef in some papers) are weight-similarity methods: PCS flattens all
parameters and measures cosine similarity; ICS forms invariant terms from the weights and
measures cosine similarity on those invariants. Following Zhang et al. (2024), we use a 0.8
similarity threshold for binary decisions.

B.3.3 Qwen Results

Different initialization seeds produce distinct fingerprints Table 9 presents the


p-values from correlation tests on Qwen-style models initialized with different random seeds
(42, 123, 1000, 2000), evaluated under both the t-test and U -test. In all cases, the p-values

47
Table 8: Full results of methods evaluating fingerprints at initialization and after subsequent
training.

Ours Baselines
Model Pair
t-test u-test Intrinsic REEF PCS ICS

42 vs. s42
sinit 2.20e-8✓ 6.28e-8✓ -0.021× 0.375× 0.580× 0.196×
base

s123 vs. sbase


init
123 7.09e-6✓ 1.37e-5✓ 0.149× 0.369× 0.581× 0.188×
1000 vs. s1000
sinit 5.58e-4✓ 2.81e-3✓ -0.252× 0.381× 0.581× 0.188×
base

2000 vs. s2000


sinit 4.00e-10✓ 1.27e-9✓ -0.337× 0.331× 0.581× 0.188×
base

remain above 0.01, confirming that our approach can consistently tell apart models trained
from different seeds.
Table 9: Comparison of fingerprint behaviors between models initialized with different seeds
for Qwen-style models.

Hidden State
Seed Pair
t-test U -test
s123 vs. s1000 0.094 0.074
s1000 vs. s123 0.125 0.094
s42 vs. s2000 0.451 0.529
s2000 vs. s42 0.130 0.095

Training preserves the initialization fingerprint. We also compare each initial-


ization model sinit with its corresponding trained version sbase on the OpenWebText
dataset (Gokaslan et al., 2019) (≈10B tokens) for Qwen, as shown in Table 10. Con-
sistent with the LLaMA-style results, all seed–model pairs yield p-values below 0.01. This
indicates that training does not erase the initialization fingerprint; instead, the signature is
preserved in the descendant model.

Table 10: Trained models share the same fingerprint behaviors as their initialization models
(p-value < 0.01).

Hidden State
Model Pair
t-test U -test

123 vs. s123 7.36e-15 3.38e-13


sinit base

1000 vs. s1000 4.41e-13 2.01e-11


sinit base

s42 vs. s42 1.06e-24 2.05e-19


init base

2000 vs. s2000 4.87e-24 1.92e-20


sinit base

Identical data and order do not make fingerprints converge Following the setting
in Table 44c, we also test whether fingerprint behavior would be erased or confounded by

48
Transformers Are Born Biased

identical data and order for Qwen-style models. Table 11 shows that fingerprints remain
seed-specific even under identical data and curriculum.

Table 11: The same dataset and training order do not shape fingerprint behaviors to be
identical across different initializations.

Hidden State
Model Pair
t-test U -test

123 vs. s1000 0.286


sinit 0.254
base

s1000 vs. sbase


init
123 0.036 0.040
sinit
42 vs. sbase 0.026
2000 0.043
sinit
2000 vs. sbase 0.112
42 0.123

Continual training on diverse datasets does not confound the fingerprint Table 12
reports the same test results as Table 5 for Qwen. Our method remains stable and reliable
across all cases (even under severe training distribution shifts), whereas many baselines
become confused and produce errors.

Table 12: Fingerprint persistence under continual training on diverse datasets (base model:
seed 1000, corpus openwebtext). U -test refers to the Mann–Whitney U test.

Setting Ours Baselines


Continual corpus (seed) t-test U -test Intrinsic REEF PCS ICS
✓ ✓ ✓ ✓
TinyStories (1000) 8.49e − 214 5.09e − 71 1.000 0.957 0.999 0.996✓

TinyStories (123) 0.256✓ 0.065✓ 0.913× 0.199✓ 0.328✓ 0.039✓


the_stack (1000) 1.16e − 211 2.30e − 76✓

0.999✓ 0.313× 0.995✓ 0.976✓
the_stack (123) 0.610✓ 0.491✓ 0.916× 0.255✓ 0.328✓ 0.038✓

All-stage verifiable fingerprints Our fingerprinting method on Qwen also demonstrates


verifiability at all training stages (Figure 14). Across all variants, the suspect model is
consistently recognized as belonging to the same lineage, with p-values remaining below the
0.01 threshold.

49
33
10
69
10
105
10
141

P-value
10 Hidden State - T-test Logits - Mann-Whitney U
Hidden State - Mann-Whitney U =0.01
177 Logits - T-test
10
213
10
249
10
285
10 4K

8K

K
12

16

20

24

28

32
Checkpoint Index

Figure 14: Fingerprint verifies lineage at every checkpoint (p-values < 0.01) for Qwen
structure model.

B.4 Experimental Setup for Pre-training

To facilitate reproducibility, we provide the detailed configurations used for our pre-training
experiments.

B.4.1 Model Architecture

Our baseline models adopt the standard Llama 2 architecture (Touvron et al., 2023), featuring
RMSNorm for pre-normalization, SwiGLU activation functions, and Rotary Positional
Embeddings (RoPE). The specific architectural hyperparameters for the models used in our
main comparisons are listed in Table 13.

Table 13: Model Architecture Configurations.


Hyperparameter Value
Hidden Size (dmodel ) 768
Intermediate Size (dffn ) 2048
Number of Layers (L) 12
Number of Heads (H) 12
Head Dimension (dk ) 64
Vocabulary Size 32,000
Normalization RMSNorm
Activation Function SwiGLU
Position Embedding RoPE
Context Window 2048

50
Transformers Are Born Biased

B.4.2 Initialization and Optimization

Models are initialized following the standard GPT-2 initialization scheme (Radford et al.,
2019). Please refer to Section B.1 for detailed specifications regarding the standard deviation
and residual weight scaling.
We employ the AdamW optimizer with β1 = 0.9, β2 = 0.95. Training follows a cosine
learning rate schedule with a linear warmup phase. Detailed optimization hyperparameters
are provided in Table 14.

B.4.3 Infrastructure

All experiments were conducted on a single computing node equipped with 2× NVIDIA A100
(80GB) GPUs, utilizing the Distributed Data Parallel (DDP) strategy. The total training
duration was approximately 48 hours.

B.4.4 Dataset

The models are pre-trained on the OpenWebText dataset (Gokaslan et al., 2019), an open-
source reproduction of the WebText corpus. Data tokenization is performed using the
standard Llama 2 tokenizer. Given the target training volume, we train for multiple epochs
over the dataset (repeated sampling) to reach the total iteration count.

Table 14: Pre-training Hyperparameters.


Hyperparameter Value
Peak Learning Rate 6.0 × 10−4
Min Learning Rate 6.0 × 10−5
Warmup Iterations 2,000
Max Iterations 200,000
Global Batch Size 128
Micro Batch Size 32
Gradient Accumulation Steps 2
Weight Decay 0.1
Gradient Clipping 1.0
Precision bfloat16
Optimizer AdamW

51

You might also like