KNN-LM and the Memorization Dilemma
KNN-LM and the Memorization Dilemma
Weijia Shi
[Link]/
Slides adapted from Akari Asai’s tutorial on Retrieval-augmented Language Models (ACL 2023)
1
Human-level intelligence?
Dataset Contamination
(Train on test subset unintentionally)
Copyright and Privacy Risks
This Talk
Memorization Dilemma of Language Models
5
How do such parametric LMs work?
P(xn | x1, x2, ⋯, xn−1)
Tronto is
The capital city of Toronto 0.52
Ontario is Toronto
Ottawa 0.31
L rge-sc le pre- Vancouver 0.13
tr ining corpus Montreal 0.03
(e.g., 1T tokens) Calgary 0.01
…
The capital city of Ontario is Toronto The capital city of Ontario is _____
LM LM
7
LMs know what they shouldn’t?
Pretraining Corpus
Dataset contamination
LM
Copyright infringement
Privacy Risk
Training time
8
This Talk
Memorization Dilemma of Language Models
9
This Talk
Memorization Dilemma of Language Models
10
Hallucinations in LM outputs
Catastrophic Errors as Results of LM Hallucinations
12
This Talk
Memorization Dilemma of Language Models
13
This Talk
Memorization Dilemma of Language Models
14
This Talk
Memorization Dilemma of Language Models
15
Detecting when LMs know what they
should not know
16
LMs know what they shouldn’t?
Pretraining Corpus
Dataset contamination
LM
Copyright infringement
Privacy Risk
Training time
17
Detect Pretraining Data from LLMs
Is GPT pretrained on
X?
GPT
19
Member
Non-member
21
Detecting Copyrighted Books in Pretraining Data
This Talk
Memorization Dilemma of Language Models
23
Memorize knowledge in its parameters
Corpus
Training green
red
Harry felt Greenback collapse light
LM water
against … on the floor as a jet of
enemy
liquid
…
24
Hallucinations in LM outputs
Memorize knowledge in its parameters
+ external knowledge during inference
Corpus
Training green
red
Harry felt Greenback collapse light
LM water
against … on the floor as a jet of
enemy
liquid
Inference …
Datastore
Retrieval-augmented LM (RAGs)
26
Retrieval-augmented LM overview
Retrieved document di
Retriever
n his parents' garage
Test Context x
Jobs is the CEO LM
of _
Apple
27
How RAGs solve the memorization dilemma?
Retrieval-based LM
Hallucinations
Corpus
Training
LM
Inference
Datastore
28
How RAGs solve the memorization dilemma?
Retrieval-based LM
Hallucinations
Corpus
Look up the datastore
Training
LM
Inference
Datastore
29
Retrieved document di
Retriever
n his parents' garage
Test Context x
Jobs is the CEO LM
of _
Apple
30
How RAGs solve the memorization dilemma?
Retrieval-based LM
Hallucinations
Corpus
Look up the datastore
Training
Datastore
31
Copyright and private Permissively-licensed
Training
x y
32
Very low legal risk,
but poor performance
(small-size data, domain shift)
Permissively-licensed
Datastore Training
Can trace inherent attribution
• Likely defense fair use
x y • Provide copyright notice
• Allow credits (or payment) to data creators
Inference
Can modify the datastore at any time
• Support removal of data at any time
Copyright and private
Min, Gururangan, Wallace, Shi et al. ICLR 2024, Spotlight. "SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore"
33
Retrieval-augmented LMs (RAGs)
Retrieval-based LM
1. Why do need RAGs?
Corpus
2. Architectures of RAGs
Datastore
34
Categorization of retrieval-augmented LMs
Query
Input
LM
35
Two representative architectures
What: Tokens
How: Output
Tokens + Output
kNN-LM, kNN-Prompt
Output interpolations (Khandelwal et al., 2020, Shi et al, 2023)
36
REPLUG (Shi et al 2023)
Retrieved document di
Retriever
n his parents' garage
Test Context x
Black-box
Jobs is the CEO LM
of _
Apple
37
REPLUG inference: input reformulation
Retrieved document di
di
Jobs cofounded A Jobs was raised
pple in his parents' by adopted…
Retriever garage
Steve Jobs
Input
Document passed away…
Reformulation
Retrieval
Apple
38
REPLUG inference: input reformulation
Retrieved document di
di x
Jobs cofounded A Jobs was raised Jobs is the
pple in his parents' by adopted… CEO of _
Retriever garage
Apple
39
REPLUG inference: input reformulation
Retrieved document di
di x apple
Jobs cofounded A Jobs was raised Jobs is the pear
pple in his parents' by adopted… CEO of _ not
Retriever garage ...
Apple
40
REPLUG inference: input reformulation
Retrieved document di
di x apple
Jobs cofounded A Jobs was raised Jobs is the pear
pple in his parents' by adopted… CEO of _ not
Retriever garage ...
apple apple
Steve Jobs Jobs is the
Input pear pear
Document passed away… CEO of _
not
Reformulation not
Retrieval ... ...
Apple
41
Two representative architectures
What: Tokens
How: Output
Tokens + Output
kNN-LM, kNN-Prompt
Output interpolations (Khandelwal et al., 2020, Shi et al, 2022)
42
kNN-LM (Khandelwal et al. 2020)
43
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)
44
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)
Parametric distribution
45
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)
46
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
f
kNN-LM (Khandelwal et al. 2020)
47
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)
48
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)
49
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
kNN-LM (Khandelwal et al. 2020)
Nonparametric distribution
Parametric distribution
λ: hyperparameter
PkNN−LM(y | x) = (1 − λ)PLM(y | x) + λPkNN(y | x)
50
Khandelwal et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
Two representative architectures
What: Tokens
How: Output
Tokens + Output
kNN-LM, kNN-Prompt
Output interpolations (Khandelwal et al., 2020, Shi et al, 2022)
51
Retrieval-augmented LMs (RAGs)
Retrieval-based LM
1. Why do need RAGs?
Corpus
2. Architectures of RAGs
Datastore
52
Training of the retriever
53
Dense retriever overview
DPR SimCSE
(Karpukhin, et al. 2020) (Gao, et al. 2021)
55
fi
Our customizable retriever: Instructor 👨🏫
Problem
Task 3
Task 1 Task 2
Existing task-specific retrievers struggle
Retrieve similar Retrieve docs Classify the
questions from Wiki question to generalize to new tasks
Input X
Who sings the song Our approach
“Love Story”?
56
Instructor inference
Simply write an instruction
Query
Encode the Wiki question for
retrieving supporting docs
Who sings the song Love Story?
INSTRUCTOR
cos sim: 0.8
Doc 1
"Love Story" is a song by
singer Taylor Swift …
INSTRUCTOR
cos sim: 0.1
Doc 2
Love Story is a 1970 American
romantic drama film written …
INSTRUCTOR
57
Instructor bene ts
58
fi
fi
Training
330 datasets
Text Similiarty
Text Similarity
Measuring the similarity between sentences:
How can I be a good geologist?
What should I do to be a great geologist?
Question Answering
Retrieve documents that can help answer
the question:
Why do rockets look white?
...
Train
Fact Checking INSTRUCTOR
Find documents that can help verify the fact:
The Ten Commandments is an epic lm.
Sentiment Analysis
Classify the sentiment of the sentence:
You should see their decadent dessert menu
Datasets Instruction
Natural Question Encode the Wikipedia question to retrieve supporting documents
60
Instruction format
Our instruction format contains three elements:
text type, task objective, domain
Datasets Instruction
Natural Question Encode the Wikipedia question to retrieve supporting documents
IMDB Classification Encode the movie review to classify emotions as positive or negative
61
MEDI: large-scale Instruction netuning datasets
330 datasets
Text Similiarty
Text Similarity
Measuring the similarity between sentences:
How can I be a good geologist?
What should I do to be a great geologist?
Question Answering
Retrieve documents that can help answer
the question:
Why do rockets look white?
...
Fact Checking
Find documents that can help verify the fact:
The Ten Commandments is an epic lm.
Sentiment Analysis
Classify the sentiment of the sentence:
You should see their decadent dessert menu
62
fi
fi
Training the Retriever with MEDI
Query x
Represent the Wiki question for
retrieving supporting docs
Who sings the song Love Story?
INSTRUCTOR
Close
Doc 1: y+
s(x,y +)/γ
"Love Story" is a song by e
singer Taylor Swift … Far L=
∑y∈{y+,y,...} e s(x,y)/γ
INSTRUCTOR
Doc 2: y
Love Story is a 1970 American
romantic drama film written …
INSTRUCTOR
63
Evaluation
70 datasets
330 datasets
Text Similiarty
Text Similarity Text Evaluation Prompt Retrieval
4 datasets 11 datasets
Measuring the similarity between sentences:
How can I be a good geologist?
What should I do to be a great geologist?
Question Answering
Retrieve documents that can help answer Classi cation Retrieval Semantic Similarity
the question: 10 datasets
12 datasets 15 datasets
Why do rockets look white?
...
Train Eval
Fact Checking INSTRUCTOR
Find documents that can help verify the fact:
Reranking Pair Classi cation Clustering
The Ten Commandments is an epic lm.
4 datasets 3 datasets 11 datasets
Sentiment Analysis
( , )
Classify the sentiment of the sentence:
You should see their decadent dessert menu ( , )
Datastore
65
Training of LMs for RAG
66
Problem
LMs fail to use information Ramon y Cajal
in the context
LM
67 (Liu et al., 2023)
Lack of context understanding
It impacts:
1. Retrieval augmentation
2. In-context learning
Language Model
In-context learning
68
Lack of context understanding
It impacts:
1. Retrieval augmentation
2. In-context learning
3. Multidocument reasoning
Language Model
Multidocument reasoning
69
Challenges of retrieval-based
LMs …
Problem
LMs fail to understand Ramon y Cajal
information in the context
Why does it happen? LM
70
(Liu et al., 2023)
How are LMs pretrained?
• Objective: Predict the next token based on the prior input context
71
How are LMs pretrained?
• Objective: Predict the next token based on the prior input context
• Input contexts: Concatenate random documents in the same context window
Doc
the prior docs provide no signal
for predicting the next doc Doc
74
Proposed: In-Context Pretraining
Place related docs in the same context
Language Model
Doc 1 Doc 2
Paris is bisected by the River Seine, which flows … For 2022, FIFA set the prize money at $42m,
Doc
Doc
75
Proposed: In-Context Pretraining
Place related docs in the same context
Language Model
Doc 3 Doc 2
World Cup never awarded > $10M before 2022 … For 2022, FIFA set the prize money at $42m,
Doc
76
Pretraining Documents In-Context Pretraining the highest so far
77
In-Context Pretraining overview
7.5%
In-Context Learning 66% 71%
15.9%
Knowledge conflicts 44% 51%
23 datasets in total
81
Retrieval-augmented LMs (RAGs)
Retrieval-based LM
1. Why do need RAGs?
Corpus
2. Architectures of RAGs
Datastore
82
Q &A
Thank you for listening!
Slides adapted from Akari Asai’s tutorial on Retrieval-augmented Language Models (ACL 2023)
83