Information Retrieval and
Web Analytics
Andreas Kaltenbrunner & Ricardo Baeza-Yates
Recommender systems
(Part I)
References
• Mining of Massive Datasets (Chapter 9).
[Link]
• Recommender Systems Handbook
Francesco Ricci, Lior Rokach, Bracha Shapira
Springer
Introduction
• Recommender System is a form of information filtering
system which seeks to predict user ratings of items.
• Recommender System finds application in music,
movies, news, books, research articles and many other.
Application areas
In the Social Web
Why using Recommender Systems?
• Value for the customer
– Find things that are interesting
– Narrow down the set of choices
– Help to explore the space of options
– Discover new things
– Entertainment
– …
• Value for the provider
– Offer unique personalized service for the customer
– Increase trust and customer loyalty
– Increase sales, click through rates, conversion, etc.
– Opportunities for promotion, persuasion
– Obtain more knowledge about customers
– …
Jeff Bezos
• “If I have 3 million
customers on the Web,
I should have 3 million
stores on the Web”
– Jeff Bezos, CEO of
[Link]
– Degree in Computer
Science
– US$83.6 billion
estimated (November
2020), ranked no. 1 in
the Forbes list of
world’s billionaires
Long-Tails Graph in RS
Distribution of popularity among items or products in a
marketplace.
Problem domain
• Recommendation systems (RS) help to match users with items
– Ease information overload
– Sales assistance (guidance, advisory, persuasion, …)
RS are software agents that elicit the interests and preferences of
individual consumers […] and make recommendations accordingly.
They have the potential to support and improve the quality of the
decisions consumers make while searching for and selecting
products online.
• [Xiao & Benbasat, MISQ, 2007]
• Different system designs / paradigms
– Based on availability of exploitable data
– Implicit and explicit user feedback
– Domain characteristics
Recommender systems
• RS seen as a function
• Given:
– User model (e.g., ratings, preferences, demographics,
situational context)
– Items (with or without description of item
characteristics)
• Find:
– Relevance score. Used for ranking.
• Finally:
– Recommend items that are assumed to be relevant
• But:
– Remember that relevance might be context-dependent
– Characteristics of the list itself might be important
(diversity)
Paradigms of recommender systems
Collaborative: "Tell me what's
popular among my peers"
Collaborative filtering models can recommend an item to
user A based on the interests of a similar user B.
Paradigms of recommender systems
Content-based: "Show me
more of the same I've liked"
Content-based filtering uses item features to recommend
other items similar to what the user likes, based on their
previous actions or explicit feedback.
Paradigms of recommender systems
Content-based: "Show me more of the same I've liked"
Content-based filtering uses item features to recommend
other items similar to what the user likes, based on their
previous actions or explicit feedback.
Paradigms of recommender systems
Knowledge-based: "Tell me
what fits based on my needs"
Paradigms of recommender systems
Hybrid: combinations of various
inputs and/or composition of
different mechanism
Recommender systems: basic techniques
Pros Cons
Collaborative No knowledge-engineering Requires some form of rating
effort, serendipity of results, feedback, cold start for new
learns market segments users and new items
Content-based No community required, Content descriptions
comparison between items necessary, cold start for
possible new users, no surprises
Knowledge-based Deterministic Knowledge engineering
recommendations, assured effort to bootstrap, basically
quality, no cold-start, can static, does not react to
resemble sales dialogue short-term trends
COLLABORATIVE FILTERING
Collaborative Filtering (CF)
• The most prominent approach to generate recommendations
– used by large, commercial e-commerce sites
– well-understood, various algorithms and variations exist
– applicable in many domains (book, movies, DVDs, ..)
• Approach
– use the "wisdom of the crowd" to recommend items
• Basic assumption and idea
– Users give ratings to catalog items (implicitly or explicitly)
Explicit
• users specify how much they liked for example a particular movie by
providing a numerical rating.
• this approach usually does not scale because only a small fraction
of users leave ratings and reviews
Implicit:
• E.g., a user watches a movie => system infers that she/he likes it.
– Customers who had similar tastes in the past, will have similar
tastes in the future
User-based nearest-neighbor collaborative
filtering (1)
• The basic technique:
– Given an "active user" (Alice) and an item I not yet
seen by Alice
– Goal is to estimate Alice's rating for item I, e.g., by
• find a set of users (peers) who liked the same items as
Alice in the past and who have rated item I
• use, e.g., the average of their ratings to predict, if Alice
will like item I
• do this for all items Alice has not seen and
recommend the best-rated
Item1 Item2 Item3 Item4 Item5
Alice 5 3 4 4 ?
User1 3 1 2 3 3
User2 4 3 4 3 5
User3 3 3 1 5 4
User4 1 5 5 2 1
User-based nearest-neighbor collaborative
filtering (2)
• Some first questions
– How do we measure similarity?
– How many neighbors should we consider?
– How do we generate a prediction from the
neighbors' ratings?
Item1 Item2 Item3 Item4 Item5
Alice 5 3 4 4 ?
User1 3 1 2 3 3
User2 4 3 4 3 5
User3 3 3 1 5 4
User4 1 5 5 2 1
Measuring user similarity
Item1 Item2 Item3 Item4 Item5
Alice 5 3 4 4 ?
User1 3 1 2 3 3 sim = 0,85
User2 4 3 4 3 5 sim = 0,71
User3 3 3 1 5 4 sim = 0,00
User4 1 5 5 2 1 sim = -0,79
Making predictions
• is the average rating of target user a.
• N is the neighborhood of most similar users.
• d is the average rating of neighbour user b.
• is the similarity between user a and b.
• is the rating assigned by user b to item p.
Item-based collaborative filtering
• Basic idea:
– Use the similarity between items (and not
users) to make predictions
• Example:
– Look for items that are similar to Item5
– Take Alice's ratings for these items to
predict the rating for Item5
Item1 Item2 Item3 Item4 Item5
Alice 5 3 4 4 ?
User1 3 1 2 3 3
User2 4 3 4 3 5
User3 3 3 1 5 4
User4 1 5 5 2 1
Item-based Collaborative Filtering
sij: similarity of items i and j
rxj: rating of user x on item j
N(i;x): set items rated by x similar to i
Item-based CF (|N|=2)
users
movies
- unknown rating - rating between 1 to 5
Item-based CF (|N|=2)
movies
- estimate rating of movie 1 by user 5
Item-based CF (|N|=2)
Here we use Pearson correlation as similarity:
Neighbor selection: 1) Subtract mean rating mi from each movie i
Identify movies similar to m1 = (1+3+5+5+4)/5 = 3.6
row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]
movie 1, rated by user 5 2) Compute Pearson correlation (i.e., cosine similarity if avg
Item-based CF (|N|=2)
Compute similarity weights:
s1,3=0.41, s1,6=0.59
Item-based CF (|N|=2)
Predict by taking weighted average:
r1.5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6
Before:
CF: Common Practice
• Define similarity sij of items i and j
• Select k nearest neighbors N(i; x)
– Items most similar to i, that were rated by x
• Estimate rating rxi as the weighted average:
baseline estimate for rxi ◼ μ = overall mean movie rating
◼ bx = rating deviation of user x
= (avg. rating of user x) – μ
◼ bi = rating deviation of movie i
Item-Item vs. User-User
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
◼ In practice, it has been observed that item-item often works
better than user-user
◼ Why? Items are simpler, users have multiple tastes
Example CF: Item vs Item
ITEMS
1 2 3 4
1 5 1 3
b3=(1+1+4+3)/4-μ=2.25-3= -0.75
U b4=(3+2+3+5)/4-μ=3.25-3= 0.25
2 1 2
S b63=μ+b6+b3=3+1-0.75= 3.25
E b64=μ+b6+b4=3+1+0.25= 4.25
R 3
r61= b61+(s13·(r63-b63)
S +s14·(r64-b64)) /(s13+s14)
4 1 r61=3.5+(0.48·(3-3.25)
+0.32·(5-4.25))/(0.48+0.32)
r61=3.68
5 4 4 3
6 ? 4 3 5
Average ratings: s12=0 s13=0.48 s14=0.32
μ = (5+1+3+2+1+4+3+5+1+4+3+4)/12 = 36/12 = 3
bx= b6=(4+3+5)/3-μ = 12/3-3 = 1
bi = b1=(1+4)/2-μ = 2.5-3 = -0.5
bxi = b61=μ+bx+bi = 3+1-0.5 = 3.5
Memory-based and model-based
approaches
• User-based CF is said to be "memory-based"
– the rating matrix is directly used to find neighbors /
make predictions
– does not scale for most real-world scenarios
– large e-commerce sites have tens of millions of
customers and millions of items
• Model-based approaches
– based on an offline pre-processing or "model-learning"
phase
– at run-time, only the learned model is used to make
predictions
– models are updated / re-trained periodically
– large variety of techniques used
– model-building and updating can be computationally
expensive
Model-based approaches
• Plethora of different techniques proposed in the last
years, e.g.,
– Matrix factorization techniques, statistics
• singular value decomposition, principal component
analysis
– Association rule mining
• compare: shopping basket analysis
– Probabilistic models
• clustering models, Bayesian networks, probabilistic
Latent Semantic Analysis
– Various other machine learning approaches
• Costs of pre-processing
– Usually not discussed
– Incremental updates possible?
The Netflix Prize
•
The Netflix Utility Matrix R
480,000 users
Matrix R 1 3 4
3 5 5
4 5 5
3
17,700
movies 3
2 2 2
5
2 1 1
3 3
1
The Netflix Utility Matrix R
480,000 users
Matrix R 1 3 4
3 5 5
4 5 5
3
17,700
movies 3
Training Data Set 2 ? ? Test Data Set
?
2 1 ?
3 ? True rating of
user x on item i
1
Predicted rating
BellKor Recommender System
• The winner of the Netflix Challenge!
• Multi-scale modeling of the data: Global effects
Combine top level, “regional”
modeling of the data, with
a refined, local view:
– Global: Factorization
• Overall deviations of users/movies
– Factorization:
• Addressing “regional” effects Collaborative
filtering
– Collaborative filtering:
• Extract local patterns
Modeling Local & Global Effects
• Global:
– Mean movie rating: 3.7 stars
– The Sixth Sense is 0.5 stars above avg.
– Joe rates 0.2 stars below avg.
⇒ Baseline estimation:
Joe will rate The Sixth Sense 4 stars
• Local neighborhood (CF/NN):
– Joe didn’t like related movie Signs
– ⇒ Final estimate:
Joe will rate The Sixth Sense 3.8 stars
Recap: Collaborative Filtering (CF)
• Earliest and most popular collaborative
filtering method
• Derive unknown ratings from those of “similar”
movies (item-item variant)
• Define similarity measure sij of items i and j
• Select k-nearest neighbors, compute the rating
– N(i; x): items most similar to i that were rated
by x
sij… similarity of items i and j
rxj…rating of user x on item j
N(i;x)… set of items similar to
item i that were rated by x
Recap: deviations in classic CF
• In practice we get better estimates if we
model deviations:
baseline estimate for rxi
Problems/Issues:
1) Similarity measures are “arbitrary”
2) Pairwise similarities neglect
interdependencies among users
μ = overall mean rating 3) Taking a weighted average can be
bx = rating deviation of user x restricting
= (avg. rating of user x) – μ Solution: Instead of sij use wij that we
bi = (avg. rating of movie i) – μ estimate directly from data
Recommendations via Optimization 1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
• Goal: Make good recommendations 3 3
1
– Quantify goodness using RMSE:
Lower RMSE ⇒ better recommendations
– Want to make good recommendations on
items that user has not yet seen.
– Let’s set build a system such that it works
well on known (user, item) ratings
And hope the system will also predict well the
unknown ratings
Recommendations via Optimization
SSE = sum of squared errors
Predicted rating True
rating
Latent Factor Models
• “SVD” on Netflix data: R ≈ Q · PT SVD: A = U Σ VT
users factors
users
items
factors
≈items
PT
R Q
• For now, let’s assume we can approximate the rating
matrix R as a product of “thin” Q · PT
– R has missing entries but let’s ignore that for now!
• Basically, we will want the reconstruction error to be small on
known ratings and we don’t care about the values on the missing
ones
Latent Factor Models (e.g., SVD)
Serious Braveheart
The Color Amadeus
Purple
Lethal Weapon
Sense and
Sensibility Ocean’s 11
Geared Geared
towards towards
females males
The Lion King
The Princess Independence
Diaries Day
Dumb and
Funny Dumber
Ratings as Products of Factors
• How to estimate the missing rating of
user x for item i ?
qi = row i of Q
px = column x of PT
Ratings as Products of Factors
• How to estimate the missing rating of
user x for item i ?
qi = row i of Q
px = column x of PT
Ratings as Products of Factors
• How to estimate the missing rating of
user x for item i ?
qi = row i of Q
px = column x of PT
ȓxi=-0.5·(-2)+0.6·0.3+0.5·2.4=2.38
Latent Factor Models
Serious Braveheart
The Color Amadeus
Purple
Lethal Weapon
Sense and
Sensibility Ocean’s 11
Geared Factor 1Geared
towards towards
females males
The Lion King
Factor 2
The Princess Independence
Diaries Day
Dumb and
Funn Dumber
y
Latent Factor Models
Serious Braveheart
The Color Amadeus
Purple
Lethal Weapon
Sense and
Sensibility Ocean’s 11
Geared Factor 1Geared
towards towards
females males
The Lion King
Factor 2
The Princess Independence
Diaries Day
Dumb and
Funn Dumber
y
Recap: SVD
n n
• Remember SVD:
– A: Input data matrix
V T
Σ
– U: Left singular vecs m
– V: Right singular vecs
A ≈m
– Σ: Singular values
– U,V are orthonormal. U
We do not need this here, we do not care, we just
want something that works
• In our case:
“SVD” on Netflix data: R ≈ Q · PT
A = R, Q = U, PT = Σ VT
Latent Factor Models
•
FINDING THE LATENT
FACTORS
Back to Our Problem
• Want to minimize SSE for unseen test data
• Idea: Minimize SSE on training data
– Want large k (# of factors) to capture all the
signals
– But SSE on test data begins to rise for k > 2
1 3 4
3 5 5
• This is a classical example of overfitting:
4 5 5
3
3
2 ? ?
– With too much freedom (too many free 2 1
3
?
?
?
parameters) the model starts fitting noise
1
• I.e., it fits too well the training data and thus not
generalizing well to unseen test data
Detour: Minimizing a function
•
Dealing with Missing Entries
1 3 4
3 5 5
4 5 5
3
3
• To solve overfitting, we introduce 2 ?
2 1
?
?
regularization:
3 ?
1
– Allow rich model where there is sufficient
data
– Shrink aggressively where data is scarce
“error” “length”
λ1, λ2 … user set regularization parameters
Note: We do not care about the “raw” value of the objective function,
but we care in P,Q that achieve the minimum of the objective
Summary: Recommender Systems
Main paradigms:
• Collaborative filtering: based on user or
item similarity.
• Content-based filtering: relies on item
features and user profiles.
• Knowledge-based and hybrid
approaches: combine structured knowledge
and multiple methods.
Summary: Recommender Systems
• Collaborative filtering (CF):recommending
items based on similar users or items.
• Two types: user-based and item-based CF.
• CF enables serendipity and captures
evolving community preferences.
•Model-based methods, such as matrix
factorization (SVD), have advanced scalability
and accuracy.
•Key challenges remain:
•Cold-start problem
•Data sparsity
•Balancing accuracy, novelty, and diversity