0% found this document useful (0 votes)
7 views83 pages

KNN-LM and the Memorization Dilemma

The document discusses the 'Memorization Dilemma' faced by language models (LMs), highlighting their ability to recall inappropriate information while failing to retrieve relevant data. It addresses issues such as dataset contamination, copyright risks, and the phenomenon of hallucinations in LM outputs. The presentation also explores solutions like retrieval-augmented LMs (RAGs) that aim to mitigate these challenges by integrating external knowledge during inference.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views83 pages

KNN-LM and the Memorization Dilemma

The document discusses the 'Memorization Dilemma' faced by language models (LMs), highlighting their ability to recall inappropriate information while failing to retrieve relevant data. It addresses issues such as dataset contamination, copyright risks, and the phenomenon of hallucinations in LM outputs. The presentation also explores solutions like retrieval-augmented LMs (RAGs) that aim to mitigate these challenges by integrating external knowledge during inference.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Memorization Dilemma of Language Models

Knowing What They Shouldn't, Yet Missing What They Should

Weijia Shi
[Link]/

Slides adapted from Akari Asai’s tutorial on Retrieval-augmented Language Models (ACL 2023)
1
Human-level intelligence?
Dataset Contamination
(Train on test subset unintentionally)
Copyright and Privacy Risks
This Talk
Memorization Dilemma of Language Models

Knowing what they shouldn’t

5
How do such parametric LMs work?
P(xn | x1, x2, ⋯, xn−1)

Tronto is
The capital city of Toronto 0.52
Ontario is Toronto
Ottawa 0.31
L rge-sc le pre- Vancouver 0.13
tr ining corpus Montreal 0.03
(e.g., 1T tokens) Calgary 0.01

Language model (Transformers)

The capital city of Ontario is


x1 x2 ⋯ xn−1
6
a
a
a
How do such parametric LMs work?

The capital city of Ontario is Toronto The capital city of Ontario is _____

LM LM

Training time Test time

7
LMs know what they shouldn’t?
Pretraining Corpus

Public data Copyrighted, Private Benchmark


data

Dataset contamination
LM
Copyright infringement
Privacy Risk
Training time

8
This Talk
Memorization Dilemma of Language Models

Knowing what they shouldn’t

9
This Talk
Memorization Dilemma of Language Models

Knowing what they shouldn’t Yet missing what they should

10
Hallucinations in LM outputs
Catastrophic Errors as Results of LM Hallucinations

12
This Talk
Memorization Dilemma of Language Models

Knowing what they shouldn’t Yet missing what they should

13
This Talk
Memorization Dilemma of Language Models

Knowing what they shouldn’t Yet missing what they should

Detecting when it happens

Solving the dilemma

14
This Talk
Memorization Dilemma of Language Models

Knowing what they shouldn’t Yet missing what they should

Detecting when it happens

Solving the dilemma

15
Detecting when LMs know what they
should not know

Detecting Pretraining Data from Large Language Models


Shi et al., ICLR 2024

16
LMs know what they shouldn’t?
Pretraining Corpus

Others Copyrighted, Private Benchmark


data

Dataset contamination
LM
Copyright infringement
Privacy Risk
Training time

17
Detect Pretraining Data from LLMs

Membership* Inference Attack (MIA)


*Member: has been used in training
Given a piece of text and black-box access to an LLM (only output logits),
can we determine if calibrated
Metrics: the modellosswas/ entropy,
pretrained on the…
likelihood provided text

Is GPT pretrained on
X?

GPT

🕵 Detecting copyrighted or private data from black-box LMs


Min-K% Prob

19
Member

Non-member

A non-member tends to include


outlier words with low prob
Our Method: Min-K% Prob

21
Detecting Copyrighted Books in Pretraining Data
This Talk
Memorization Dilemma of Language Models

Knowing what they shouldn’t Yet missing what they should

Detecting when it happens

Solving the dilemma

23
Memorize knowledge in its parameters

Corpus

Training green
red
Harry felt Greenback collapse light
LM water
against … on the floor as a jet of
enemy
liquid

24
Hallucinations in LM outputs
Memorize knowledge in its parameters
+ external knowledge during inference

Corpus

Training green
red
Harry felt Greenback collapse light
LM water
against … on the floor as a jet of
enemy
liquid
Inference …

Datastore

Retrieval-augmented LM (RAGs)
26
Retrieval-augmented LM overview
Retrieved document di

Jobs cofounded Apple i

Retriever
n his parents' garage

Document Input Reformulation


Retrieval

Test Context x
Jobs is the CEO LM
of _

Apple

27
How RAGs solve the memorization dilemma?
Retrieval-based LM

Hallucinations
Corpus

Training

Privacy and copyright risks

LM
Inference

Datastore
28
How RAGs solve the memorization dilemma?
Retrieval-based LM

Hallucinations
Corpus
Look up the datastore
Training

Privacy and copyright risks

LM
Inference

Datastore
29
Retrieved document di

Jobs cofounded Apple i

Retriever
n his parents' garage

Document Input Reformulation


Retrieval

Test Context x
Jobs is the CEO LM
of _

Apple

30
How RAGs solve the memorization dilemma?
Retrieval-based LM

Hallucinations
Corpus
Look up the datastore
Training

Privacy and copyright risks


Store the sensitive data in the datastore
LM
Inference

Datastore
31
Copyright and private Permissively-licensed

Training

x y

32
Very low legal risk,
but poor performance
(small-size data, domain shift)
Permissively-licensed

Datastore Training
Can trace inherent attribution
• Likely defense fair use
x y • Provide copyright notice
• Allow credits (or payment) to data creators

Inference
Can modify the datastore at any time
• Support removal of data at any time
Copyright and private
Min, Gururangan, Wallace, Shi et al. ICLR 2024, Spotlight. "SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore"
33
Retrieval-augmented LMs (RAGs)
Retrieval-based LM
1. Why do need RAGs?
Corpus
2. Architectures of RAGs

3. Training of the retriever


LM
4. Training of the LMs

Datastore
34
Categorization of retrieval-augmented LMs

What to retrieve? How to use retrieval?

Query
Input

LM

Text chunks (passages)?


Tokens? Output

35
Two representative architectures

What: Text chunks


How: Input
REALM (Guu et al., 2020)
Input augmentation REPLUG (Shi et al., 2023)

What: Tokens
How: Output
Tokens + Output

kNN-LM, kNN-Prompt
Output interpolations (Khandelwal et al., 2020, Shi et al, 2023)

36
REPLUG (Shi et al 2023)
Retrieved document di

Jobs cofounded Apple i

Retriever
n his parents' garage

Document Input Reformulation


Retrieval

Test Context x
Black-box
Jobs is the CEO LM
of _

Apple
37
REPLUG inference: input reformulation

Retrieved document di
di
Jobs cofounded A Jobs was raised
pple in his parents' by adopted…
Retriever garage

Steve Jobs
Input
Document passed away…
Reformulation
Retrieval

Test Context x Jobs cofounded


Apple…
Black-box
Jobs is the LM
CEO of _ Ensemble

Apple

38
REPLUG inference: input reformulation

Retrieved document di
di x
Jobs cofounded A Jobs was raised Jobs is the
pple in his parents' by adopted… CEO of _
Retriever garage

Steve Jobs Jobs is the


Input
Document passed away… CEO of _
Reformulation
Retrieval

Test Context x Jobs cofounded Jobs is the


Apple… CEO of _
Black-box
Jobs is the LM
CEO of _ Ensemble

Apple

39
REPLUG inference: input reformulation

Retrieved document di
di x apple
Jobs cofounded A Jobs was raised Jobs is the pear
pple in his parents' by adopted… CEO of _ not
Retriever garage ...

Steve Jobs Jobs is the apple


Input pear
Document passed away… CEO of _
Reformulation not
Retrieval ...

Test Context x Jobs cofounded Jobs is the


Apple… CEO of _
apple
pear
Black-box
Jobs is the not
LM
CEO of _ Ensemble ...

Apple

40
REPLUG inference: input reformulation

Retrieved document di
di x apple
Jobs cofounded A Jobs was raised Jobs is the pear
pple in his parents' by adopted… CEO of _ not
Retriever garage ...
apple apple
Steve Jobs Jobs is the
Input pear pear
Document passed away… CEO of _
not
Reformulation not
Retrieval ... ...

Test Context x Jobs cofounded Jobs is the


Apple… CEO of _
apple
pear
Black-box
Jobs is the not
LM
CEO of _ Ensemble ...

Apple

41
Two representative architectures

What: Text chunks


How: Input
REALM (Guu et al., 2020)
Input augmentation REPLUG (Shi et al., 2023)

What: Tokens
How: Output
Tokens + Output

kNN-LM, kNN-Prompt
Output interpolations (Khandelwal et al., 2020, Shi et al, 2022)

42
kNN-LM (Khandelwal et al. 2020)

A different way of using retrieval, where the LM outputs a


nonparametric distribution over every token in the data.

Can be seen as an incorporation in the “output” layer

43
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)

44
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)

Parametric distribution

45
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)

… Obama was senator for Illinois from 1997 to


2005, …. Barack is Married to Michelle and their
irst daughter, … Obama was born in Hawaii, and
graduated from Columbia University. … Obama is a
native of Hawaii, ….

46
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
f
kNN-LM (Khandelwal et al. 2020)

Which tokens in a datastore are close to the next token?

47
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)

Which vectors in a datastore are close to the vector we have?

48
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)

Which vectors in a datastore are close to the vector we have?

49
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)
Nonparametric distribution

Parametric distribution

λ: hyperparameter
PkNN−LM(y | x) = (1 − λ)PLM(y | x) + λPkNN(y | x)

50
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
Two representative architectures

What: Text chunks


How: Input
REALM (Guu et al., 2020)
Input augmentation REPLUG (Shi et al., 2023)

What: Tokens
How: Output
Tokens + Output

kNN-LM, kNN-Prompt
Output interpolations (Khandelwal et al., 2020, Shi et al, 2022)

51
Retrieval-augmented LMs (RAGs)
Retrieval-based LM
1. Why do need RAGs?
Corpus
2. Architectures of RAGs

3. Training of the retriever


LM
4. Training of the LMs

Datastore
52
Training of the retriever

One Embedder, Any Task 👨🏫: Instruction-Finetuned Text Embeddings


Su*, Shi* et al., 2023

53
Dense retriever overview

datastore x = Harry felt Greenback collapse… on the floor as a jet of


Encoder
Voldemort cried, “Avada Fast nearest neighbor search
Kedavra!” A jet of green Encoder
light issued …from … vector space

Voldemort’s want just as a


jet of red light … Encoder

“The Boy Who Lived.” He z = Encoder(z)


saw the mouth move and Encoder x = Encoder(x)
a ash of green …

z = argmaxz∈ (sim(x, z))


𝒵
Source: Retrieval-based Language Models Tutorial 54
fl
Previous task-speci c retriever
Input X
Who sings the song “Love Story”?
Problem
Existing task-specific retrievers struggle
Task 1 Task 2
to generalize to new tasks
Retrieve supporting Retrieve similar questions
documents from Wiki

DPR SimCSE
(Karpukhin, et al. 2020) (Gao, et al. 2021)

Who is the singer of the


"Love Story" is a
song “Love Story”?
song by singer
Taylor Swift …

55
fi
Our customizable retriever: Instructor 👨🏫
Problem
Task 3
Task 1 Task 2
Existing task-specific retrievers struggle
Retrieve similar Retrieve docs Classify the
questions from Wiki question to generalize to new tasks

Input X
Who sings the song Our approach
“Love Story”?

INSTRUCTOR A customizable retriever tailored to any


task without further training
News
"Love Story" is a
Who is the singer of the song by singer Music
song “Love Story”? Taylor Swift …
Sports

56
Instructor inference
Simply write an instruction
Query
Encode the Wiki question for
retrieving supporting docs
Who sings the song Love Story?
INSTRUCTOR
cos sim: 0.8
Doc 1
"Love Story" is a song by
singer Taylor Swift …

INSTRUCTOR
cos sim: 0.1

Doc 2
Love Story is a 1970 American
romantic drama film written …
INSTRUCTOR

57
Instructor bene ts

Ef cient and simple: task-aware


embeddings without any further training
INSTRUCTOR
by simply providing the task instruction

58
fi
fi
Training
330 datasets
Text Similiarty
Text Similarity
Measuring the similarity between sentences:
How can I be a good geologist?
What should I do to be a great geologist?

Question Answering
Retrieve documents that can help answer
the question:
Why do rockets look white?

...
Train
Fact Checking INSTRUCTOR
Find documents that can help verify the fact:
The Ten Commandments is an epic lm.

Sentiment Analysis
Classify the sentiment of the sentence:
You should see their decadent dessert menu

Trained on 330 datasets


59
fi
Instruction format
Our instruction format contains three elements:
text type, task objective, domain

Datasets Instruction
Natural Question Encode the Wikipedia question to retrieve supporting documents

60
Instruction format
Our instruction format contains three elements:
text type, task objective, domain

Datasets Instruction
Natural Question Encode the Wikipedia question to retrieve supporting documents

SummEval Encode the Biomedical summary to retrieve duplicate summaries

IMDB Classification Encode the movie review to classify emotions as positive or negative

61
MEDI: large-scale Instruction netuning datasets
330 datasets
Text Similiarty
Text Similarity
Measuring the similarity between sentences:
How can I be a good geologist?
What should I do to be a great geologist?

Question Answering
Retrieve documents that can help answer
the question:
Why do rockets look white?

...
Fact Checking
Find documents that can help verify the fact:
The Ten Commandments is an epic lm.

Sentiment Analysis
Classify the sentiment of the sentence:
You should see their decadent dessert menu

62
fi
fi
Training the Retriever with MEDI
Query x
Represent the Wiki question for
retrieving supporting docs
Who sings the song Love Story?
INSTRUCTOR
Close
Doc 1: y+
s(x,y +)/γ
"Love Story" is a song by e
singer Taylor Swift … Far L=
∑y∈{y+,y,...} e s(x,y)/γ
INSTRUCTOR

Doc 2: y
Love Story is a 1970 American
romantic drama film written …
INSTRUCTOR

63
Evaluation
70 datasets
330 datasets
Text Similiarty
Text Similarity Text Evaluation Prompt Retrieval
4 datasets 11 datasets
Measuring the similarity between sentences:
How can I be a good geologist?
What should I do to be a great geologist?

Question Answering
Retrieve documents that can help answer Classi cation Retrieval Semantic Similarity
the question: 10 datasets
12 datasets 15 datasets
Why do rockets look white?

...
Train Eval
Fact Checking INSTRUCTOR
Find documents that can help verify the fact:
Reranking Pair Classi cation Clustering
The Ten Commandments is an epic lm.
4 datasets 3 datasets 11 datasets
Sentiment Analysis
( , )
Classify the sentiment of the sentence:
You should see their decadent dessert menu ( , )

Trained on 330 datasets ⬆ 7% compared with Evaluation


the best baseline
64
fi
fi
fi
Retrieval-augmented LMs (RAGs)
Retrieval-based LM
1. Why do need RAGs?
Corpus
2. Architectures of RAGs

3. Training of the retriever


LM
4. Training of the LMs

Datastore
65
Training of LMs for RAG

In-Context Pretraining: Language Modeling Beyond Document Boundaries


Shi et al., ICLR 2024 Spotlight

66
Problem
LMs fail to use information Ramon y Cajal
in the context
LM
67 (Liu et al., 2023)
Lack of context understanding
It impacts:
1. Retrieval augmentation
2. In-context learning

Language Model

In-context learning
68
Lack of context understanding
It impacts:
1. Retrieval augmentation
2. In-context learning
3. Multidocument reasoning

Language Model

Multidocument reasoning
69
Challenges of retrieval-based
LMs …

Problem
LMs fail to understand Ramon y Cajal
information in the context
Why does it happen? LM
70
(Liu et al., 2023)
How are LMs pretrained?
• Objective: Predict the next token based on the prior input context

the highest so far


10K context window
Language Model
Doc
For 2022, FIFA set the prize money at $42m,

71
How are LMs pretrained?
• Objective: Predict the next token based on the prior input context
• Input contexts: Concatenate random documents in the same context window

Standard the highest so far


10K context window
Language Model
Doc 1 Doc 2
Paris is bisected by the River Seine, which flows … For 2022, FIFA set the prize money at $42m,

Doc
the prior docs provide no signal
for predicting the next doc Doc

Different color indicates


72 different doc. topic
No training signals from prior documents
• Lack of long documents during pretraining
~13% of CommonCrawl documents contain > 1K tokens
CommonCrawl Sequence Length Distribution

73 (from the blog “In the long (context) run”)


Proposed: In-Context Pretraining
Place related docs in the same context

74
Proposed: In-Context Pretraining
Place related docs in the same context

Standard the highest so far

Language Model
Doc 1 Doc 2
Paris is bisected by the River Seine, which flows … For 2022, FIFA set the prize money at $42m,

Doc

Doc

75
Proposed: In-Context Pretraining
Place related docs in the same context

In-Context Pretraining the highest so far

Language Model
Doc 3 Doc 2
World Cup never awarded > $10M before 2022 … For 2022, FIFA set the prize money at $42m,

Encourage LMs to read and reason


across document boundaries Doc

Doc

76
Pretraining Documents In-Context Pretraining the highest so far

World Cup never award … Language Model


Input Contexts
For 2022, FIFA set the …
World World Cup never awarded > $10M before 2022 … For 2022, FIFA set the prize money at $42m,
Cup
Messi scored seven …

Paris Paris is bisected by … Standard the highest so far

Paris, France's capital …


Language Model
… Input Contexts
Paris is bisected by the River Seine, which flows … For 2022, FIFA set the prize money at $42m,

77
In-Context Pretraining overview

Pretraining Documents path

Doc 2 Doc 0 Doc 9 Doc 1 Doc 3 …


Doc 8
Pretraining
Doc 3
Doc 1
LM
Doc 9
Doc 0

Different color indicates


different doc. topic

Step 1: Find Related Docs Step 2: Create Input Contexts


78
In-Context Pretraining overview

Pretraining Documents path

Doc 2 Doc 0 Doc 9 Doc 1 Doc 3 …


Doc 8
Pretraining
Doc 3
Doc 1
LM
Doc 9
Doc 0

Different color indicates


different doc. topic

Step 1: Find Related Docs Step 2: Create Input Contexts


79
In-Context Pretraining overview
It only changes the document ordering during pretraining

Pretraining Documents path

Doc 2 Doc 0 Doc 9 Doc 1 Doc 3 …


Doc 8
Pretraining
Doc 3
Doc 1
LM
Doc 9
Doc 0

Different color indicates


different doc. topic

Step 1: Find Related Docs Step 2: Create Input Contexts


80
Summary
In-Context
Standard Pretraining
10.5%
Retrieval Augmentation 38% 42%

7.5%
In-Context Learning 66% 71%

Reading Comprehension 37% 14.0%


43%

15.9%
Knowledge conflicts 44% 51%

Long Document Reasoning 32% 7.5% 34%

23 datasets in total
81
Retrieval-augmented LMs (RAGs)
Retrieval-based LM
1. Why do need RAGs?
Corpus
2. Architectures of RAGs

3. Training of the retriever


LM
4. Training of the LMs

Datastore
82
Q &A
Thank you for listening!

Slides adapted from Akari Asai’s tutorial on Retrieval-augmented Language Models (ACL 2023)

83

You might also like