0% found this document useful (0 votes)

25 views14 pages

Dead Neurons in Large Language Models

Uploaded by

Chundong Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views14 pages

Dead Neurons in Large Language Models

Uploaded by

Chundong Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Neurons in Large Language Models: Dead, N-gram, Positional

Elena Voita1Javier Ferrando2∗ Christoforos Nalmpantis1

1
Meta AI
2
TALP Research Center, Universitat Politècnica de Catalunya
{lenavoita, christoforos}@[Link], [Link]@[Link]

Abstract workings of these large language models (LLMs)

and, specifically, their evolution with scale. Unfor-
We analyze a family of large language models
tunately, scaling also increases the entry threshold
in such a lightweight manner that can be done
on a single GPU. Specifically, we focus on the for interpretability researchers since dealing with
arXiv:2309.04827v1 [[Link]] 9 Sep 2023

OPT family of models ranging from 125m to large models requires a lot of computational re-
66b parameters and rely only on whether an sources. In this work, we analyze a family of OPT
FFN neuron is activated or not. First, we find models up to 66b parameters and deliberately keep
that the early part of the network is sparse and our analysis very lightweight so that it could be
represents many discrete features. Here, many done using a single GPU.
neurons (more than 70% in some layers of the
We focus on neurons inside FFNs, i.e. individual
66b model) are “dead”, i.e. they never acti-
vate on a large collection of diverse data. At activations in the representation between the two
the same time, many of the alive neurons are linear layers of the Transformer feedforward blocks
reserved for discrete features and act as token (FFNs). Differently from e.g. neurons in the resid-
and n-gram detectors. Interestingly, their corre- ual stream, FFN neurons are more likely to repre-
sponding FFN updates not only promote next sent meaningful features: the elementwise nonlin-
token candidates as could be expected, but also earity breaks the rotational invariance of this repre-
explicitly focus on removing the information
sentation and encourages features to align with the
about triggering them tokens, i.e., current in-
put. To the best of our knowledge, this is the basis dimensions (Elhage et al., 2021). When such
first example of mechanisms specialized at re- a neuron is activated, it updates the residual stream
moving (rather than adding) information from by pulling out the corresponding row of the second
the residual stream. With scale, models become FFN layer; when it is not activated, it does not up-
more sparse in a sense that they have more dead date the residual stream (Figure 6).1 Therefore, we
neurons and token detectors. Finally, some neu- can interpret functions of these FFN neurons in two
rons are positional: them being activated or not ways: (i) by understanding when they are activated,
depends largely (or solely) on position and less
and (ii) by interpreting the corresponding updates
so (or not at all) on textual data. We find that
smaller models have sets of neurons acting as coming to the residual stream.
position range indicators while larger models First, we find that in the first half of the network,
operate in a less explicit manner. many neurons are “dead”, i.e. they never activate
on a large collection of diverse data. Larger models
are more sparse in this sense: for example, in the
1 Introduction 66b model more that 70% of the neurons in some
layers are dead. At the same time, many of the
The range of capabilities of language models ex- alive neurons in this early part of the network are
pands with scale and at larger scales models be- reserved for discrete features and act as indicator
come so strong and versatile that a single model can functions for tokens and n-grams: they activate if
be integrated into various applications and decision- and only if the input is a certain token or an n-gram.
making processes (Brown et al., 2020; Kaplan et al., The function of the updates coming from these to-
2020; Wei et al., 2022; Ouyang et al., 2022; Ope- ken detectors to the residual stream is also very
nAI, 2023; Anil et al., 2023). This increases inter-
1
est and importance of understanding the internal Since OPT models have the ReLU activation function, the
notion of “activated” or “not activated” is trivial and means
∗
Work done as part of internship at Meta AI. non-zero vs zero.
surprising: at the same time as they promote con- 2 Data and Setting
cepts related to the potential next token candidate
(which is to be expected according to Geva et al. Models. We use OPT (Zhang et al., 2022), a suite
(2021, 2022)), they are explicitly targeted at remov- of decoder-only pre-trained transformers that are
ing information about current input, i.e. their trig- publicly available. We use model sizes ranging
gers. This means that in the bottom-up processing from 125M to 66B parameters and take model
where a representation of the current input token weights from the HuggingFace model hub.2
gets gradually transformed into a representation for Data. We use data from diverse sources con-
the next token, current token identity is removed by taining development splits of the datasets used in
the model explicitly (rather than ends up implicitly OPT training as well as several additional datasets.
“buried” as a result of additive updates useful for Overall, we used (i) subsets of the validation and
the next token). To the best of our knowledge, this test part of the Pile (Gao et al., 2020) includ-
is the first example of mechanisms specialized at ing Wikipedia, DM Mathematics, HackerNews,
removing (rather than adding) information from (ii) Reddit3 (Baumgartner et al., 2020; Roller et al.,
the residual stream. 2021), (iii) code data from Codeparrot4 .
Finally, we find that some neurons are responsi- For the experiments in Section 3 when talking
ble for encoding positional information regardless about dead neurons, we use several times more data.
of textual patterns. Similarly to token and n-gram Specifically, we add more data from Wikipedia,
detectors, many of these neurons act as indicator DM Mathematics and Codeparrot, as well as add
functions of position ranges, i.e. activate for po- new domains from the Pile5 : EuroParl, FreeLaw,
sitions within certain ranges and do not activate PubMed abstracts, Stackexchange.
otherwise. Interestingly, these neurons often collab- Overall, the data used in Section 3 has over 20M
orate. For example, the second layer of the 125m tokens, in the rest of the paper – over 5M tokens.
model has 10 positional neurons whose indicated
positional ranges are in agreement: together, they Single-GPU processing. We use only sets of neu-
efficiently cover all possible positions and no neuron values for some data, i.e. we run only forward
ron is redundant. In a broader picture, positional passes of the full model or its several first layers.
neurons question the key-value memory view of Since large models do not fit in a single GPU, we
the FFN layers stating that “each key correlates load one layer at a time keeping the rest of the
with textual patterns in the training data and each layers on CPU. This allows us to record neuron ac-
value induces a distribution over the output vocab- tivations for large models: all the main experiments
ulary” (Geva et al., 2021, 2022). Neurons that rely in this paper were done on a single GPU.
on position regardless of textual pattern indicate
that FFN layers can be used by the model in ways 3 Dead Neurons
that do not fit the key-value memory view. Overall,
Let us start from simple statistics such as neuron
we argue that the roles played by these layers are
activation frequency (Figure 1).
still poorly understood.
Overall, we find neurons that: Many neurons are “dead”. First, we find that
many neurons never activate on our diverse data, i.e.
• are “dead”, i.e. never activate on a large di-
they can be seen as “dead”. Figure 1a shows that
verse collection of data;
the proportion of dead neurons is very substantial:
• act as token- and n-gram detectors that, in e.g., for the 66b model, the proportion of dead
addition to promoting next token candidates, neurons in some layers is above 70%. We also see
explicitly remove current token information; that larger models are more sparse because (i) they
2
[Link]
• encode position regardless of textual content 3
[Link] Reddit dataset is a previously existing dataset
which indicates that the role of FFN layers extracted and obtained by a third party that contains prepro-
extends beyond the key-value memory view. cessed comments posted on the social network Reddit and
hosted by [Link].
4
With scale, models have more dead neurons and [Link]
codeparrot-clean
token detectors and are less focused on absolute 5
[Link]
position. pile
(a) (b)

Figure 1: (a) Percentage of “dead” neurons; (b) average

neuron activation frequency among non-dead neurons.

have more dead neurons and (ii) the ones that are
alive activate less frequently (Figure 1b).

Only first half of the model is sparse. Next, we

notice that this kind of sparsity is specific only to
early layers. This leads to a clear distinction be- Figure 2: Neurons categorized by the number of uni-
tween the first and the second halves of the network: grams (i.e., tokens) able to trigger them. First half of
while the first half contains a solid proportion of the network, alive neurons only.
dead neurons, the second half is fully “alive”. Ad-
ditionally, layers with most dead neurons are the things: (i) these neurons can never be activated (i.e.
ones where alive neurons activate most rarely. they are “completely dead”) or (ii) they correspond
to patterns so rare that we never encountered them
Packing concepts into neurons. This difference in our large diverse collection of data. While the
in sparsity across layers might be explained by latter is possible, note that this does not change
“concept-to-neuron” ratio being much smaller in the the above discussion about sparsity and types of
early layers than in the higher layers. Intuitively, encoded concepts. On the contrary: it further sup-
the model has to represent sets of encoded in a ports the hypothesis of models assigning dedicated
layer concepts by “spreading” them across avail- neurons to specific concepts.
able neurons. In the early layers, encoded concepts
are largely shallow and are likely to be discrete 4 N-gram-Detecting Neurons
(e.g., lexical) while at the higher layers, networks
Now, let us look more closely into the patterns en-
learn high-level semantics and reasoning (Peters
coded in the lower half of the models and try to un-
et al., 2018; Liu et al., 2019; Jawahar et al., 2019;
derstand the nature of the observed above sparsity.
Tenney et al., 2019; Geva et al., 2021). Since the
Specifically, we analyze how neuron activations de-
number of possible shallow patterns is not large
pend on an input n-gram. For each input text with
and, potentially, enumerable, in the early layers the
tokens x1 , x2 , ..., xS , we record neuron activations
model can (and, as we will see later, does) assign
at each position and if a neuron is activated (i.e.,
dedicated neurons to some features. The more neu-
non-zero) at position k, we say that the n-gram
rons are available to the model, the easier it is to do
(xk−n+1 , . . . , xk ) triggered this neuron.
so – this agrees with the results in Figure 1 show-
In Sections 4.1-4.4 we talk about unigrams (i.e.,
ing that larger models are more sparse. Differently,
tokens) and come to larger n-grams in Section 4.5.
the space of fine-grained semantic concepts is too
large compared to the number of available neurons 4.1 Number of N-grams Triggering a Neuron
which makes it hard to reserve many dedicated
neuron-concept pairs.6 First, let us see how many n-grams are able to
trigger each neuron. For each neuron we evaluate
Are dead neurons completely dead? Note that the number of n-grams that cover at least 95% of
the results in Figure 1a can mean one of the two the neuron’s activations. For the bottom half of
6
the network, Figure 2 shows how neurons in each
There can, however, be a few specialized neurons in the
higher layers. For example, BERT has neurons responsible layer are categorized by the number of covering
for relational facts (Dai et al., 2022). them n-grams (we show unigrams here and larger
(a) (b)

Figure 3: (a) Number of token-detecting neurons; Figure 4: Number of tokens covered in each layer with
(b) number of tokens that have a detecting them neuron: indicated (i) new overall, and (ii) new compared to the
solid line – per layer, dashed – cumulative over layers. previous layer tokens.

n-grams in Appendix A). of the rest. While for smaller models the number
We see that, as anticipated, neurons in larger of token detectors increases then goes down, larger
models are covered by less n-grams. Also, the models operate in three monotonic stages and start
largest models have a substantial proportion of neu- having many token-detecting neurons from the very
rons that are covered by as few as 1 to 5 tokens. first layer (Figures 3). This already shows quali-
This agrees with our hypothesis in the previous sec- tative differences between the models: with more
tion: the model spreads discreet shallow patterns capacity, larger models perform more complicated
across specifically dedicated neurons.7 reasoning with more distinct stages.

4.2 Token-Detecting Neurons 4.3 Ensemble-Like Behaviour of the Layers

Presence of neurons that can be triggered by only Now, let us look at “detected” tokens, i.e. tokens
a few (e.g., 1-5) tokens point to the possibility that that have a specialized detecting them neuron. Fig-
some neurons act as token detectors, i.e. activate ure 3b shows the number of detected tokens in
if and only if the input is one of the corresponding each layer as well as cumulative over layers num-
tokens, regardless of the previous context. To find ber of detected tokens. We see that, e.g., the 66b
such neurons, we (1) pick neurons that can be trig- model focuses on no more than 1.5k tokens in each
gered by only 1-5 tokens, (2) gather tokens that are layer but over 10k tokens overall. This means that
covered by this neuron (if the neuron activates at across layers, token-detecting neurons are respon-
least 95% of the time the token is present), (3) if sible for largely differing tokens. Indeed, Figure 4
altogether, these covered tokens are responsible for shows that in each following layer, detected tokens
at least 95% of neuron activations.8 mostly differ from all the tokens covered by the
layers below. All in all, this points to an ensemble-
Figure 3a shows that there are indeed a lot of
like (as opposed to sequential) behavior of the lay-
token-detecting neurons. As expected, larger mod-
ers: layers collaborate so that token-detecting neu-
els have more such neurons and the 66b model has
rons cover largely different tokens in different lay-
overall 5351 token detectors. Note that each token
ers. This divide-and-conquer-style strategy allows
detector is responsible for a group of several to-
larger models to cover many tokens overall and use
kens that, in most of the cases, are variants of the
their capacity more effectively.
same word (e.g., with differences only in capital-
Originally, such an ensemble-like behavior of
ization, presence of the space-before-word special
deep residual networks was observed in computer
symbol, morphological form, etc.). Figure 5 (top)
vision models (Veit et al., 2016). For transform-
shows examples of groups of tokens detected by
ers, previous evidence includes simple experiments
token-detecting neurons.
showing that e.g. dropping or reordering layers
Interestingly, the behavior of the largest models
does not influence performance much (Fan et al.,
(starting from 13b of parameters) differs from that
2020; Zhao et al., 2021).
7
Note that the 350m model does not follow the same pat-
tern as all the rest: we will discuss this model in Section 6. 4.4 Token Detectors Suppress Their Triggers
8
We exclude the begin-of-sentence token from these com-
putations because for many neurons, this token is responsible Now let us try to understand the role of token-
for the majority of the activations. detecting neurons in the model by interpreting how
Figure 5: Examples of the top promoted and suppressed tokens for token-detecting neurons (Ġ is a special symbol
denoting the space before word – in the OPT tokenizers, it is part of a word); OPT-66b model.

Current token suppression: implicit or explicit?

Previously, this influence was understood only in
terms of the top projections, i.e. tokens that are
promoted (Geva et al., 2021, 2022). This reflects
an existing view supporting implicit rather than
explicit loss of the current token identity over the
course of layers. Namely, the view that the current
identity gets “buried” as a result of updates useful
for the next token as opposed to being removed
by the model explicitly. In contrast, we look not
only at the top projections but also at the bottom:
if these projections are negative, the corresponding
tokens are suppressed by the model (Figure 6).

Figure 6: Intuition behind concept suppression: we look

Explicit token suppression in the model. We
not only at the top projections of an FFN update on
vocabulary but also at the bottom. The concepts that are find that often token-detecting neurons deliberately
added with a negative value are suppressed. suppress the tokens they detect. Figure 5 shows
several examples of token-detecting neurons along
with the top promoted and suppressed concepts.
they update the residual stream. Throughout the
While the top promoted concepts are in line with
layers, token representation in the residual stream
previous work (they are potential next token candi-
gets transformed from the token embedding for
dates which agrees with Geva et al. (2021, 2022)),
the current input token9 to the representation that
the top suppressed concepts are rather unexpected:
encodes a distribution for the next token. This
they are exactly the tokens triggering this neuron.
transformation happens via additive updates com-
This means that vector updates corresponding to
ing from attention and FFN blocks in each layer.
these neurons point in the direction of the next to-
Whenever an FFN neuron is activated, the corre-
ken candidates at the same time as they point away
sponding row of the second FFN layer (multiplied
from the tokens triggering the neuron. Note that
by this neuron’s value) is added to the residual
this is not trivial since these updates play two very
stream (see illustration in Figure 6). By project-
different roles at the same time. Overall, for over
ing this FFN row onto vocabulary, we can get an
80% of token-detecting neurons their correspond-
interpretation of this update (and, thus, the role of
ing updates point in the negative direction from
this neuron) in terms of its influence on the output
the triggering them tokens (although, the triggering
distribution encoded in the residual stream.
tokens are not always at the very top suppressed
9 concepts as in the examples in Figure 6).
For OPT models, along with an absolute positional em-
bedding. Overall, we argue that models can have mech-
Figure 7: Types of positional neurons. Top row – “strong” pattern, bottom row – “weak” pattern.

anisms that are targeted at removing information tion is as follows:10

from the residual stream which can be explored T (pos)
further in future work. 1 X f rn
I(act, pos) = · f rn(pos) · log +
T f rn
pos=1
4.5 Beyond Unigrams
(pos)
(pos) 1 − f rn
In Appendix A, we show results for bigrams and (1 − f rn ) · log .
1 − f rn
trigrams that mirror our observations for unigrams:
(i) larger models have more specialized neurons, Choosing the neurons. We pick neurons with
(ii) in each layer, models cover mostly new n-grams. I(act, pos) > 0.05, i.e. high mutual information
Interestingly, for larger n-grams we see a more with position – this gives neurons whose activation
drastic gap between larger and smaller models. frequency depends on position rather than content.
Indeed, if e.g. a neuron is always activated within
5 Positional Neurons certain position range regardless of data domain,
we can treat this neuron as responsible for position;
When analyzing dead neurons (Section 3), we also
at least, to a certain extent.
noticed some neurons that, consistently across di-
verse data, never activate except for a few first to- 5.2 Types of Positional Neurons
ken positions. This motivates us to look further into
After selecting positional neurons, we categorize
how position is encoded in the model and, specif-
them according to their activation pattern, i.e. acti-
ically, whether some neurons are responsible for
vation frequency depending on position (Figure 7).
encoding positional information.
Oscillatory. These neurons are shown in purple
5.1 Identifying Positional Neurons in Figure 7. When such a pattern is strong (top
Intuitively, we want to find neurons whose activa- row), the activation pattern is an indicator function
tion patterns are defined by or, at least, strongly of position ranges. In other words, such a neuron
depend on token position. Formally, we identify is activated if and only if the position falls into a
neurons whose activations have high mutual infor- certain set. Note that since the activation pattern
mation with position. For each neuron, we evaluate does not change across data domains, it is defined
mutual information between two random variables: solely by position and not the presence of some
lexical or semantic information.
• act – neuron is activated or not ({Y, N }),
Both types of activation extremes. These are the
neurons whose activation pattern is not oscillatory
• pos – token position ({1, 2, . . . , T }).
but still has intervals where activation frequency
Formal setting. We gather neuron activations reaches both “activation extremes”: 0 (never acti-
for full-length data (i.e., T = 2048 tokens) for vated) and 1 (always activated). Most frequently,
Wikipedia, DM Mathematics and Codeparrot. Let such a neuron is activated only for positions less
(pos) than or greater than some value and not activated
f rn be activation frequency of neuron n at posi-
otherwise. Similarly to oscillatory neurons, when
tion pos and f rn be the total activation frequency
10
of this neuron. Then the desired mutual informa- For more details, see appendix B.1.
Figure 8: Positional neurons in each of the models. Each circle corresponds to a single neuron, colors and their
intensity correspond to the types of patterns shown in Figure 7.

such a pattern is strong (Figure 7, top row), it is Small models encode position more explicitly.
also (almost) an indicator function. First, we notice that smaller models rely substan-
tially on oscillatory neurons: this is the most fre-
Only one type of activation extremes. Differ- quent type of positional neurons for models smaller
ently from the previous two types, activation pat- than 6.7b of parameters. In combination with many
terns for these neurons can reach only one of the “red” neurons acting as indicator functions for wider
extreme values 0 or 1 (Figure 7, green). While this position ranges, the model is able to derive token’s
means that they never behave as indicator functions, absolute position rather accurately. Interestingly,
there are position ranges where a neuron being ac- larger models do not have oscillatory neurons and
tivated or not depends solely on token position. rely on more generic patterns shown with red- and
green-colored circles. We can also see that from
Other. Finally, these are the neurons whose ac-
13b to 66b, the model loses two-sided red neu-
tivation patterns strongly depend on position but
rons and uses the one-sided green ones more. This
do not have intervals where activation frequency
hints at one of the qualitative differences between
stays 0 or 1 (Figure 7, yellow). Typically, these
smaller and larger models: while the former en-
activation patterns have lower mutual information
code absolute position more accurately, the latter
with position than the previous three types.
ones are likely to rely on something more meaning-
Strong vs weak pattern. We also distinguish ful than absolute position. This complements re-
“strong” and “weak” versions of each type which cent work showing that absolute position encoding
we will further denote with color intensity (Fig- is harmful for length generalization in reasoning
ure 7, top vs bottom rows). For the first three tasks (Kazemnejad et al., 2023). Differently from
types of positional neurons, the difference between their experiments with same model size but vari-
strong and weak patterns lies in whether on the ous positional encodings, we track changes with
corresponding position ranges activation frequency scale. We see that, despite all models being trained
equals 0 (or 1) or close, but not equals, to 0 (or 1). with absolute positional encodings, stronger mod-
For the last type, this difference lies in how well els tend to abstract away from absolute position.
we can predict activation frequency on a certain Positional neurons work in teams. Interestingly,
position knowing this value for the neighboring positional neurons seem to collaborate to cover the
positions (informally, “thin” vs “thick” graph). full set of positions together. For example, let us
look more closely at the 10 strongly oscillatory neu-
5.3 Positional Neurons Across the Models
rons in the second layer of the 125m model (shown
For each of the models, Figure 8 illustrates the with dark purple circles in Figure 8). Since they act
positional neurons across layers. as indicator functions, we can plot position ranges
compare the types of their positional neurons.

Setup. We trained 125m models with the stan-

dard OPT setup but smaller training dataset: we
used OpenWebText corpus (Gokaslan and Co-
hen, 2019), an open clone of the GPT-2 training
Figure 9: Position ranges indicated by strong oscillatory data (Radford et al., 2019). This dataset contains
neurons in the second layer of the 125m model. 3B tokens (compared 180B for OPT).

Positional neurons without positional encoding.

indicated by each of these neurons. Figure 9 shows Figure 10 shows positional neurons in two 125m
that (i) indicated position ranges for these neurons models: trained with and without positional encod-
are similar up to a shift, (ii) the shifts are organized ing. We see that, indeed, the model without po-
in a “perfect” order in a sense that altogether, these sitional encoding also has many strong positional
ten neurons efficiently cover all positions such that patterns. Note, however, that the NoPos model does
none of these neurons is redundant. not have oscillatory neurons which, in combination
The two stages within the model. Finally, Fig- with other positional neurons, allow encoding ab-
ure 8 reveals two stages of up-and-downs of posi- solute position rather accurately. This means that
tional information within the model: roughly, the the NoPos model relies on more generic patterns,
first third of the model and the rest. Interestingly, e.g. “red” neurons encoding whether a position is
preferences in positional patterns also change be- greater/less than some value.
tween the stages: e.g., preference for “red” neurons
Oscillatory neurons require longer training.
changes to oscillatory purple patterns for the 1.3b
Finally, we found that oscillatory patterns appear
and 2.7b models, and “red” patterns become less
only with long training. Figure 11 shows posi-
important in the upper stage for the 13b and 30b
tional patterns learned by the baseline 125m model
models. Note that the first third of the model cor-
trained for 50k, 150k and 300k training batches.
responds to the sparse stage with the dead neurons
We see that all models have very strong positional
and n-gram detectors (Sections 3, 4). Therefore, we
patterns, but only the last of them has oscillatory
can hypothesize that in these two stages, positional
neurons. Apparently, learning absolute position
information is first used locally to detect shallow
requires longer training time.
patterns, and then more globally to use longer con-
texts and help encode semantic information.
5.5 Doubting FFNs as Key-Value Memories
Previously, the distinct bottom-up stages of pro-
cessing inside language models were observed in Current widely held belief is that feed-forward lay-
Voita et al. (2019a). The authors explained that ers in transformer-based language models operate
the way representations gain and lose information as key-value memories. Specifically, “each key
throughout the layers is defined by the training ob- correlates with textual patterns in the training ex-
jective and why, among other things, positional amples, and each value induces a distribution over
information should (and does) get lost. This agrees the output vocabulary” (Geva et al. (2021, 2022);
with our results in this work: we can see that while Dai et al. (2022); Meng et al. (2022); Ferrando et al.
there are many positional patterns in the second (2023), among others). While in Section 4.4 we
stage, they are weaker than in the first stage. confirmed that this is true for some of the neurons,
results in this section reveal that FFN layers can
5.4 Positional Neurons are Learned Even be used by the model in ways that do not fit the
Without Positional Encoding key-value memory view. In particular, activations
Recently, it turned out that even without positional of strong positional neurons are defined by position
encoding, autoregressive language models still regardless of textual content, and the corresponding
learn positional information (Haviv et al., 2022). values do not seem to encode meaningful distribu-
We hypothesize that the mechanism these “NoPos” tions over vocabulary. This means that the role of
models use to encode position is positional neurons. these neurons is different from matching textual
To confirm this, we train two versions of the 125m patterns to sets of the next token candidates. In a
model, with and without positional encodings, and broader context, this means that the roles played
Figure 10: Positional neurons in 125m models: baseline vs model without positional encoding. Both models were
trained for 300k batches.

(2019); Correia et al. (2019); Martins et al. (2020),

among others), or more extreme approaches with
explicit modular structure that is aimed to be in-
terpretable by construction (Andreas et al. (2016);
Hu et al. (2018); Kirsch et al. (2018); Khot et al.
(2021), to name a few). Intuitively, choosing ReLU
activation function as done in the OPT models can
be seen as having the same motivation as devel-
oping sparse softmax variants: exact zeros in the
Figure 11: Positional neurons in the base 125m model model are inherently interpretable.
trained with 50k, 150k and 300k batches.
7 Additional Related Work
Historically, neurons have been a basic unit of anal-
by Transformer feed-forward layers are still poorly
ysis. Early works started from convolutional net-
understood.
works first for images (Krizhevsky et al., 2012)
and later for convolutional text classifiers (Jacovi
6 The 350m Model: The Odd One Out
et al., 2018). Similar to our work, Jacovi et al.
As we already mentioned above, the 350m model (2018) also find n-gram detectors; although, for
does not follow the same pattern as the rest of the small convolutional text classifiers this is an almost
models. Specifically, it does not have dead neurons trivial observation compared to large Transformer-
(Section 3) and its neuron activations do not seem based language models as in our work. For re-
to be sparse with respect to triggering them n-grams current networks, interpretable neurons include
as we saw for all the other models in Figure 2.11 simple patterns such as line lengths, brackets and
quotes (Karpathy et al., 2015), sentiment neu-
Modeling bits affect interpretability. This be- ron (Radford et al., 2017) and various neurons
comes less surprizing when noticing that the 350m in machine translation models, such as tracking
model is implemented differently from all the brackets, quotes, etc, as well as neurons correlated
rest: it applies LayerNorm after attention and feed- with higher-level concepts e.g. verb tense (Bau
forward blocks, while all the other models – be- et al., 2019). For Transformer-based BERT, Dai
fore.12 Apparently, such seemingly minor im- et al. (2022) find that some neurons inside feed-
plementation details can affect interpretability of forward blocks are responsible for storing factual
model components rather significantly. Indeed, pre- knowledge. Larger units of analysis include at-
vious work also tried choosing certain modeling tention blocks (Voita et al. (2018, 2019b); Clark
aspects to encourage interpretability. Examples of et al. (2019); Kovaleva et al. (2019); Baan et al.
such work include choosing an activation function (2019); Correia et al. (2019), etc), feed-forward
to increase the number of interpretable neurons (El- layers (Geva et al., 2021, 2022) and circuits respon-
hage et al., 2022), large body of work on sparse sible for certain tasks (Wang et al., 2022; Geva
softmax variants to make output distributions or et al., 2023; Hanna et al., 2023).
attention more interpretable (Martins and Astudillo
(2016); Niculae and Blondel (2017); Peters et al. Acknowledgements
11
There are, however, positional neurons; see Figure 16 in The authors thank Nicola Cancedda, Yihong Chen,
Appendix B.2). Igor Tufanov and FAIR London team for fruitful
12
[Link]
blob/main/src/transformers/models/opt/modeling_ discussions and helpful feedback.
[Link]
References Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Gretchen Krueger, Tom Henighan, Rewon Child,
Dan Klein. 2016. Neural module networks. In Pro- Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
ceedings of the IEEE Conference on Computer Vision Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
and Pattern Recognition (CVPR). teusz Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish, Alec
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin John-
Radford, Ilya Sutskever, and Dario Amodei. 2020.
son, Dmitry Lepikhin, Alexandre Passos, Siamak
Language models are few-shot learners. In Ad-
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng
vances in Neural Information Processing Systems,
Chen, Eric Chu, Jonathan H. Clark, Laurent El
volume 33, pages 1877–1901. Curran Associates,
Shafey, Yanping Huang, Kathy Meier-Hellstern, Gau-
Inc.
rav Mishra, Erica Moreira, Mark Omernick, Kevin
Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Kevin Clark, Urvashi Khandelwal, Omer Levy, and
Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Christopher D. Manning. 2019. What does BERT
Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, look at? an analysis of BERT’s attention. In Pro-
Jan Botha, James Bradbury, Siddhartha Brahma, ceedings of the 2019 ACL Workshop BlackboxNLP:
Kevin Brooks, Michele Catasta, Yong Cheng, Colin Analyzing and Interpreting Neural Networks for NLP,
Cherry, Christopher A. Choquette-Choo, Aakanksha pages 276–286, Florence, Italy. Association for Com-
Chowdhery, Clément Crepy, Shachi Dave, Mostafa putational Linguistics.
Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz,
Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Gonçalo M. Correia, Vlad Niculae, and André F. T.
Feng, Vlad Fienber, Markus Freitag, Xavier Gar- Martins. 2019. Adaptively sparse transformers. In
cia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur- Proceedings of the 2019 Conference on Empirical
Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Methods in Natural Language Processing and the
Howland, Andrea Hu, Jeffrey Hui, Jeremy Hur- 9th International Joint Conference on Natural Lan-
witz, Michael Isard, Abe Ittycheriah, Matthew Jagiel- guage Processing (EMNLP-IJCNLP), pages 2174–
ski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, 2184, Hong Kong, China. Association for Computa-
Sneha Kudugunta, Chang Lan, Katherine Lee, Ben- tional Linguistics.
jamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li,
Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao
Frederick Liu, Marcello Maggioni, Aroma Mahendru, Chang, and Furu Wei. 2022. Knowledge neurons in
Joshua Maynez, Vedant Misra, Maysam Moussalem, pretrained transformers. In Proceedings of the 60th
Zachary Nado, John Nham, Eric Ni, Andrew Nys- Annual Meeting of the Association for Computational
trom, Alicia Parrish, Marie Pellat, Martin Polacek, Linguistics (Volume 1: Long Papers), pages 8493–
Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, 8502, Dublin, Ireland. Association for Computational
Bryan Richter, Parker Riley, Alex Castro Ros, Au- Linguistics.
rko Roy, Brennan Saeta, Rajkumar Samuel, Renee
Shelby, Ambrose Slone, Daniel Smilkov, David R. Nelson Elhage, Tristan Hume, Catherine Olsson,
So, Daniel Sohn, Simon Tokumine, Dasha Valter, Neel Nanda, Tom Henighan, Scott Johnston,
Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Sheer ElShowk, Nicholas Joseph, Nova DasSarma,
Pidong Wang, Zirui Wang, Tao Wang, John Wiet- Ben Mann, Danny Hernandez, Amanda Askell,
ing, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Kamal Ndousse, Jones, , Dawn Drain, Anna
Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt,
Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Zac Hatfield-Dodds, Jackson Kernion, Tom Con-
Petrov, and Yonghui Wu. 2023. Palm 2 technical erly, Shauna Kravec, Stanislav Fort, Saurav Ka-
report. davath, Josh Jacobson, Eli Tran-Johnson, Jared
Kaplan, Jack Clark, Tom Brown, Sam Mc-
Joris Baan, Maartje ter Hoeve, Marlies van der Wees, Candlish, Dario Amodei, and Christopher Olah.
Anne Schuth, and Maarten de Rijke. 2019. Under- 2022. Softmax linear units. Https://transformer-
standing multi-head attention in abstractive summa- [Link]/2022/solu/[Link].
rization.
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom
Anthony Bau, Yonatan Belinkov, Hassan Sajjad, Nadir Henighan, Nicholas Joseph, Ben Mann, Amanda
Durrani, Fahim Dalvi, and James Glass. 2019. Iden- Askell, Yuntao Bai, Anna Chen, Tom Conerly,
tifying and controlling important neurons in neural Nova DasSarma, Dawn Drain, Deep Ganguli, Zac
machine translation. In International Conference on Hatfield-Dodds, Danny Hernandez, Andy Jones,
Learning Representations, New Orleans. Jackson Kernion, Liane Lovitt, Kamal Ndousse,
Dario Amodei, Tom Brown, Jack Clark, Jared Ka-
Jason Baumgartner, Savvas Zannettou, Brian Keegan, plan, Sam McCandlish, and Chris Olah. 2021. A
Megan Squire, and Jeremy Blackburn. 2020. The mathematical framework for transformer circuits.
pushshift reddit dataset. Transformer Circuits Thread.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Angela Fan, Edouard Grave, and Armand Joulin. 2020.
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Reducing transformer depth on demand with struc-
tured dropout. In International Conference on Learn- Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.
ing Representations. 2019. What does BERT learn about the structure of
language? In Proceedings of the 57th Annual Meet-
Javier Ferrando, Gerard I. Gállego, Ioannis Tsiamas, ing of the Association for Computational Linguistics,
and Marta R. Costa-jussà. 2023. Explaining how pages 3651–3657, Florence, Italy. Association for
transformers use context to build predictions. In Computational Linguistics.
Proceedings of the 61st Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.
Long Papers), pages 5486–5513, Toronto, Canada. Brown, Benjamin Chess, Rewon Child, Scott Gray,
Association for Computational Linguistics. Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.
Scaling laws for neural language models.
Leo Gao, Stella Biderman, Sid Black, Laurence Gold-
ing, Travis Hoppe, Charles Foster, Jason Phang, Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015.
Horace He, Anish Thite, Noa Nabeshima, Shawn Visualizing and understanding recurrent networks.
Presser, and Connor Leahy. 2020. The pile: An
Amirhossein Kazemnejad, Inkit Padhi,
800gb dataset of diverse text for language modeling.
Karthikeyan Natesan Ramamurthy, Payel Das,
Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir and Siva Reddy. 2023. The impact of positional
Globerson. 2023. Dissecting recall of factual associ- encoding on length generalization in transformers.
ations in auto-regressive language models. Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter
Clark, and Ashish Sabharwal. 2021. Text modular
Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Gold-
networks: Learning to decompose tasks in the lan-
berg. 2022. Transformer feed-forward layers build
guage of existing models. In Proceedings of the 2021
predictions by promoting concepts in the vocabulary
Conference of the North American Chapter of the
space. In Proceedings of the 2022 Conference on
Association for Computational Linguistics: Human
Empirical Methods in Natural Language Process-
Language Technologies, pages 1264–1279, Online.
ing, pages 30–45, Abu Dhabi, United Arab Emirates.
Association for Computational Linguistics.
Association for Computational Linguistics.
Louis Kirsch, Julius Kunze, and David Barber. 2018.
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Modular networks: Learning to decompose neural
Levy. 2021. Transformer feed-forward layers are key- computation. In Advances in Neural Information
value memories. In Proceedings of the 2021 Confer- Processing Systems, volume 31. Curran Associates,
ence on Empirical Methods in Natural Language Pro- Inc.
cessing, pages 5484–5495, Online and Punta Cana,
Dominican Republic. Association for Computational Olga Kovaleva, Alexey Romanov, Anna Rogers, and
Linguistics. Anna Rumshisky. 2019. Revealing the dark secrets
of BERT. In Proceedings of the 2019 Conference on
Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext Empirical Methods in Natural Language Processing
corpus. and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
Michael Hanna, Ollie Liu, and Alexandre Variengien. 4365–4374, Hong Kong, China. Association for Com-
2023. How does gpt-2 compute greater-than?: In- putational Linguistics.
terpreting mathematical abilities in a pre-trained lan-
guage model. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ton. 2012. Imagenet classification with deep con-
Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer volutional neural networks. In Advances in Neural
Levy. 2022. Transformer language models without Information Processing Systems, volume 25. Curran
positional encodings still learn positional informa- Associates, Inc.
tion. In Findings of the Association for Computa-
tional Linguistics: EMNLP 2022, pages 1382–1390, Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Abu Dhabi, United Arab Emirates. Association for Matthew E. Peters, and Noah A. Smith. 2019. Lin-
Computational Linguistics. guistic knowledge and transferability of contextual
representations. In Proceedings of the 2019 Confer-
Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate ence of the North American Chapter of the Associ-
Saenko. 2018. Explainable neural computation via ation for Computational Linguistics: Human Lan-
stack neural module networks. In Proceedings of the guage Technologies, Volume 1 (Long and Short Pa-
European conference on computer vision (ECCV). pers), pages 1073–1094, Minneapolis, Minnesota.
Association for Computational Linguistics.
Alon Jacovi, Oren Sar Shalom, and Yoav Goldberg.
2018. Understanding convolutional neural networks André F. T. Martins and Ramón F. Astudillo. 2016.
for text classification. In Proceedings of the 2018 From softmax to sparsemax: A sparse model of at-
EMNLP Workshop BlackboxNLP: Analyzing and In- tention and multi-label classification. In Proceed-
terpreting Neural Networks for NLP, pages 56–65, ings of the 33rd International Conference on Interna-
Brussels, Belgium. Association for Computational tional Conference on Machine Learning - Volume 48,
Linguistics. ICML’16, page 1614–1623. [Link].
Pedro Henrique Martins, Zita Marinho, and André F. T. Proceedings of the 57th Annual Meeting of the Asso-
Martins. 2020. Sparse text generation. In Proceed- ciation for Computational Linguistics, pages 4593–
ings of the 2020 Conference on Empirical Methods 4601, Florence, Italy. Association for Computational
in Natural Language Processing (EMNLP), pages Linguistics.
4252–4273, Online. Association for Computational
Linguistics. Andreas Veit, Michael J Wilber, and Serge Belongie.
2016. Residual networks behave like ensembles of
Kevin Meng, David Bau, Alex J Andonian, and Yonatan relatively shallow networks. In Advances in Neural
Belinkov. 2022. Locating and editing factual associ- Information Processing Systems, volume 29. Curran
ations in GPT. In Advances in Neural Information Associates, Inc.
Processing Systems.
Elena Voita, Rico Sennrich, and Ivan Titov. 2019a. The
Vlad Niculae and Mathieu Blondel. 2017. A regularized bottom-up evolution of representations in the trans-
framework for sparse and structured neural attention. former: A study with machine translation and lan-
In Advances in Neural Information Processing Sys- guage modeling objectives. In Proceedings of the
tems, volume 30. Curran Associates, Inc. 2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
OpenAI. 2023. Gpt-4 technical report. Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 4396–4406, Hong Kong,
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- China. Association for Computational Linguistics.
roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, John Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Titov. 2018. Context-aware neural machine trans-
Maddie Simens, Amanda Askell, Peter Welinder, lation learns anaphora resolution. In Proceedings
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. of the 56th Annual Meeting of the Association for
Training language models to follow instructions with Computational Linguistics (Volume 1: Long Papers),
human feedback. pages 1264–1274, Melbourne, Australia. Association
for Computational Linguistics.
Ben Peters, Vlad Niculae, and André F. T. Martins. 2019.
Sparse sequence-to-sequence models. In Proceed- Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
ings of the 57th Annual Meeting of the Association for nrich, and Ivan Titov. 2019b. Analyzing multi-head
Computational Linguistics, pages 1504–1519, Flo- self-attention: Specialized heads do the heavy lift-
rence, Italy. Association for Computational Linguis- ing, the rest can be pruned. In Proceedings of the
tics. 57th Annual Meeting of the Association for Computa-
tional Linguistics, pages 5797–5808, Florence, Italy.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Association for Computational Linguistics.
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word repre- Kevin Wang, Alexandre Variengien, Arthur Conmy,
sentations. In Proceedings of the 2018 Conference of Buck Shlegeris, and Jacob Steinhardt. 2022. Inter-
the North American Chapter of the Association for pretability in the wild: a circuit for indirect object
Computational Linguistics: Human Language Tech- identification in gpt-2 small.
nologies, Volume 1 (Long Papers), pages 2227–2237,
New Orleans, Louisiana. Association for Computa- Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
tional Linguistics. Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, Ed H.
Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy
2017. Learning to generate reviews and discovering Liang, Jeff Dean, and William Fedus. 2022. Emer-
sentiment. gent abilities of large language models. Transactions
on Machine Learning Research. Survey Certifica-
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, tion.
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Blog, 1(8):9. Artetxe, Moya Chen, Shuohui Chen, Christopher De-
wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-
Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Simig, Punit Singh Koura, Anjali Sridhar, Tianlu
Eric Michael Smith, Y-Lan Boureau, and Jason We- Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-
ston. 2021. Recipes for building an open-domain trained transformer language models.
chatbot. In Proceedings of the 16th Conference of
Sumu Zhao, Damian Pascual, Gino Brunner, and Roger
the European Chapter of the Association for Compu-
Wattenhofer. 2021. Of non-linearity and commutativ-
tational Linguistics: Main Volume, pages 300–325,
ity in bert.
Online. Association for Computational Linguistics.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.

BERT rediscovers the classical NLP pipeline. In
(a) (b)

Figure 14: (a) Number of trigram-detecting neurons;

(b) number of trigrams that have a detecting them neu-
ron: solid line – per layer, dashed – cumulative over
layers.

Figure 12: Neurons categorized by the number of bi-

grams able to trigger them. First half of the network,
alive neurons only.

Figure 15: Number of trigrams covered in each layer

with indicated (i) new overall, and (ii) new compared to
the previous layer tokens.

tions.
Figure 14 shows the results. Overall, the results
further support our main observations: larger mod-
els have more neurons responsible for n-grams. In-
Figure 13: Neurons categorized by the number of tri- terestingly, when looking at trigrams rather than to-
grams able to trigger them. First half of the network, kens, at 30b of parameters we see a drastic jump in
alive neurons only.
the number of covered n-grams. This indicates that
one of the qualitative differences between larger
A N-gram-Detecting Neurons and smaller models lies in the expansion of the
families of features they are able to represent.
A.1 Number of N-grams Triggering a Neuron
Figure 12 shows how neurons in each layer are A.3 Ensemble-Like Layer Behavior
categorized by the number of covering them bi- Figure 15 shows the number of covered trigrams in
grams, Figure 13 – trigrams. As expected, neurons each layer. We see that in each layer, models cover
in larger models are covered by less n-grams. largely new trigrams.

A.2 Trigram-Detecting Neurons B Positional Neurons

Similarly to token-detecting neurons in Section 4.2,
we also find neurons that are specialized on 3- B.1 Mutual Information
grams. Specifically, we (1) pick neurons that are For each neuron, we evaluate mutual information
covered by only 1-50 trigrams, (2) gather trigrams between two random variables:
that are covered by this neuron (if the neuron ac-
tivated at least 95% of the time the trigram is • act – neuron is activated or not ({Y, N }),
present), (3) if altogether, these covered trigrams
are responsible for at least 95% of neuron activa- • pos – token position ({1, 2, . . . , T }).
Formal setting. We gather neuron activations
for full-length data (i.e., T = 2048 tokens) for
Wikipedia, DM Mathematics and Codeparrot. Let
(pos)
f rn be activation frequency of neuron n at posi-
tion pos and f rn be the total activation frequency
of this neuron.
Then the desired mutual information is as fol-
lows:
I(act, pos) =
T
X X 1 p(act|pos)
= p(act|pos) · log =
act pos=1
p(pos) p(act)

Since we only feed full-length texts, all positions

appear with the same frequency: p(pos) = 1/T .

T
1 X X p(act|pos)
= · p(act|pos)·log =
T p(act)
act∈{Y,N } pos=1

T
1 X p(act = Y |pos)
= · p(act = Y |pos)·log +
T p(act = Y )
pos=1

T
1 X 1−p(act = Y |pos)
· (1−p(act = Y |pos))·log =
T 1 − p(act = Y )
pos=1

Figure 16: Positional neurons in the 350m model. Each

T (pos)
1 X f rn circle corresponds to a single neuron, colors and their
(pos)
= · f rn · log + intensity correspond to the types of patterns shown in
T f rn
pos=1 Figure 7.
(pos)
(pos) 1 − f rn
(1 − f rn ) · log .
1 − f rn

B.2 Positional Neurons for the 350m Model

The results are shown in Figure 16.

IIQ-2026 Arithmetic With Language Models From Memorization To Computation
No ratings yet
IIQ-2026 Arithmetic With Language Models From Memorization To Computation
10 pages
Supervised Training of SNNs in PyTorch
No ratings yet
Supervised Training of SNNs in PyTorch
24 pages
Understanding Sequential Modelling Techniques
No ratings yet
Understanding Sequential Modelling Techniques
25 pages
Neural Language Models
No ratings yet
Neural Language Models
32 pages
The AI Hippocampus: How Far Are We From Human Memory?
No ratings yet
The AI Hippocampus: How Far Are We From Human Memory?
64 pages
Deep Learning Applications Overview
No ratings yet
Deep Learning Applications Overview
35 pages
Language Models and Vector Arithmetic
No ratings yet
Language Models and Vector Arithmetic
19 pages
Introduction to Transformer Models
No ratings yet
Introduction to Transformer Models
119 pages
The Neural Network Zoo
No ratings yet
The Neural Network Zoo
25 pages
LLMs and LRMs Overview
No ratings yet
LLMs and LRMs Overview
28 pages
Memory Networks for Question Answering
No ratings yet
Memory Networks for Question Answering
194 pages
Overview of Transformer Models in NLP
No ratings yet
Overview of Transformer Models in NLP
5 pages
Scallop: Neurosymbolic Programming Language
No ratings yet
Scallop: Neurosymbolic Programming Language
37 pages
CognitiveDog: Multimodal Robot Action System
No ratings yet
CognitiveDog: Multimodal Robot Action System
5 pages
Dong2018 (15 стр)
No ratings yet
Dong2018 (15 стр)
19 pages
Counting Hypothesis - Potential Mechanism of In-Context Learning
No ratings yet
Counting Hypothesis - Potential Mechanism of In-Context Learning
19 pages
Understanding Transformers' Memory Dynamics
No ratings yet
Understanding Transformers' Memory Dynamics
29 pages
Sequence Models in NLP Explained
No ratings yet
Sequence Models in NLP Explained
195 pages
Sequential Data Analysis with RNNs
No ratings yet
Sequential Data Analysis with RNNs
43 pages
History and Development of AI
No ratings yet
History and Development of AI
22 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
915 pages
Understanding Transformers in NLP
No ratings yet
Understanding Transformers in NLP
48 pages
LSTM
No ratings yet
LSTM
42 pages
Weights: Language
No ratings yet
Weights: Language
18 pages
The Bigger Picture
No ratings yet
The Bigger Picture
58 pages
Main
No ratings yet
Main
10 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
32 pages
Mathematics 12 02025 v2
No ratings yet
Mathematics 12 02025 v2
24 pages
Foundational LLMs and Text Generation
100% (3)
Foundational LLMs and Text Generation
75 pages
Role of Activation Functions in Neural Networks
100% (1)
Role of Activation Functions in Neural Networks
11 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
Transformers and Hippocampal Models
No ratings yet
Transformers and Hippocampal Models
20 pages
LNN Integrated 23
No ratings yet
LNN Integrated 23
38 pages
Pres3 Efficiency
No ratings yet
Pres3 Efficiency
12 pages
Frugal LMs for Efficient Arithmetic Reasoning
No ratings yet
Frugal LMs for Efficient Arithmetic Reasoning
14 pages
LSTM Networks in Machine Translation Survey
No ratings yet
LSTM Networks in Machine Translation Survey
11 pages
Mother Net
No ratings yet
Mother Net
17 pages
Language-Driven Travel Planning Framework
No ratings yet
Language-Driven Travel Planning Framework
87 pages
Transformers and NLP: A Comprehensive Guide
No ratings yet
Transformers and NLP: A Comprehensive Guide
114 pages
LLM Lifecycle From Foundation To Frontier
No ratings yet
LLM Lifecycle From Foundation To Frontier
15 pages
Neural Logic Networks for Learning
No ratings yet
Neural Logic Networks for Learning
10 pages
FPGA XOR Neural Network Design
No ratings yet
FPGA XOR Neural Network Design
4 pages
Deep Learning Insights by Yann LeCun
No ratings yet
Deep Learning Insights by Yann LeCun
72 pages
LLM Bootcamp 2023: Future of AI
No ratings yet
LLM Bootcamp 2023: Future of AI
95 pages
Open Pre-trained Transformer Models
No ratings yet
Open Pre-trained Transformer Models
30 pages
Chapter Transformers
No ratings yet
Chapter Transformers
57 pages
Pulse Mode Neural Network with FPGA
No ratings yet
Pulse Mode Neural Network with FPGA
4 pages
Deep Learning in Natural Language Processing
No ratings yet
Deep Learning in Natural Language Processing
46 pages
Sparse Random Networks with LTP Learning
No ratings yet
Sparse Random Networks with LTP Learning
14 pages
Reward-Based Learning with Neural Plasticity
No ratings yet
Reward-Based Learning with Neural Plasticity
6 pages
Transformers and LLM Evolution Overview
No ratings yet
Transformers and LLM Evolution Overview
31 pages
Differentiable Cache Augmentation for LLMs
No ratings yet
Differentiable Cache Augmentation for LLMs
18 pages
NLP Classification Models Overview
No ratings yet
NLP Classification Models Overview
15 pages
Hardware for Efficient Deep Learning
No ratings yet
Hardware for Efficient Deep Learning
68 pages
Efficient SNN with Multi-Compartment Neurons
No ratings yet
Efficient SNN with Multi-Compartment Neurons
11 pages
12 - Attacks On LLM
No ratings yet
12 - Attacks On LLM
66 pages
Generative AI - Unit 2 - Notes
No ratings yet
Generative AI - Unit 2 - Notes
9 pages
Alex Jones: Agent of Discord Explained
No ratings yet
Alex Jones: Agent of Discord Explained
6 pages
Schnuerch Erdfelder 2020
No ratings yet
Schnuerch Erdfelder 2020
21 pages
AI Techniques and LISP Programming
100% (1)
AI Techniques and LISP Programming
8 pages
English Self-Efficacy and Learning Strategies
No ratings yet
English Self-Efficacy and Learning Strategies
4 pages
CH 03
No ratings yet
CH 03
12 pages
Installing and Managing Primavera P6 DataBases
100% (1)
Installing and Managing Primavera P6 DataBases
59 pages
Arctangent Function in MATLAB Guide
No ratings yet
Arctangent Function in MATLAB Guide
41 pages
Introduction To Word Parts: Speaker: Heidi Jones
No ratings yet
Introduction To Word Parts: Speaker: Heidi Jones
8 pages
Mastering English Speaking in 99 Days
No ratings yet
Mastering English Speaking in 99 Days
15 pages
Annales du Patrimoine: Heritage Journal
No ratings yet
Annales du Patrimoine: Heritage Journal
384 pages
Real-Time Multilingual Speech SDK
No ratings yet
Real-Time Multilingual Speech SDK
15 pages
High-Resolution PLL for Radar Systems
No ratings yet
High-Resolution PLL for Radar Systems
7 pages
Guided Writing on Respect in Essays
No ratings yet
Guided Writing on Respect in Essays
8 pages
Keay
No ratings yet
Keay
3 pages
Izuku's Unexpected Villain Encounter
No ratings yet
Izuku's Unexpected Villain Encounter
6 pages
Grade 3 English Daily Lesson Plan
No ratings yet
Grade 3 English Daily Lesson Plan
4 pages
Understanding Pauline Theology and Letters
No ratings yet
Understanding Pauline Theology and Letters
1 page
RMBC Method for Effective Sales Copy
No ratings yet
RMBC Method for Effective Sales Copy
6 pages
SSC Exam Schedule for Nagarkurnool
No ratings yet
SSC Exam Schedule for Nagarkurnool
1 page
Polytechnic Mathematics Resource Guide
No ratings yet
Polytechnic Mathematics Resource Guide
44 pages
Tunantada Festival: Cultural Heritage of Junín
No ratings yet
Tunantada Festival: Cultural Heritage of Junín
3 pages
Enhancing Business Documents with IT
No ratings yet
Enhancing Business Documents with IT
29 pages
Weekend Activities: Affirmative & Negative
No ratings yet
Weekend Activities: Affirmative & Negative
2 pages
Sandhya Namam and Mantras Guide
No ratings yet
Sandhya Namam and Mantras Guide
12 pages
Simplest Solution to Logic Puzzle
No ratings yet
Simplest Solution to Logic Puzzle
7 pages
Purposive Comunication - Final
No ratings yet
Purposive Comunication - Final
68 pages
Speaking Lesson 11: Everyday Conversations
No ratings yet
Speaking Lesson 11: Everyday Conversations
1 page
Advantages of CASE Tools in Software Engineering
No ratings yet
Advantages of CASE Tools in Software Engineering
53 pages
Degrees of Comparison Exercises
No ratings yet
Degrees of Comparison Exercises
1 page
Enhancing English Speaking Skills for Students
No ratings yet
Enhancing English Speaking Skills for Students
5 pages

Dead Neurons in Large Language Models

Uploaded by

Dead Neurons in Large Language Models

Uploaded by

Neurons in Large Language Models: Dead, N-gram, Positional

Elena Voita1Javier Ferrando2∗ Christoforos Nalmpantis1

Abstract workings of these large language models (LLMs)

Figure 1: (a) Percentage of “dead” neurons; (b) average

Only first half of the model is sparse. Next, we

4.2 Token-Detecting Neurons 4.3 Ensemble-Like Behaviour of the Layers

Current token suppression: implicit or explicit?

Figure 6: Intuition behind concept suppression: we look

anisms that are targeted at removing information tion is as follows:10

Setup. We trained 125m models with the stan-

Positional neurons without positional encoding.

(2019); Correia et al. (2019); Martins et al. (2020),

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.

Figure 14: (a) Number of trigram-detecting neurons;

Figure 12: Neurons categorized by the number of bi-

Figure 15: Number of trigrams covered in each layer

A.2 Trigram-Detecting Neurons B Positional Neurons

Since we only feed full-length texts, all positions

Figure 16: Positional neurons in the 350m model. Each

B.2 Positional Neurons for the 350m Model

You might also like