Machine Learning Labs Overview
Machine Learning Labs Overview
1
LAB 01 – Overview of Machine Learning, Tool & Library
2
LAB 02 – Machine Learning Project
3
LAB 03 – Regression Techniques
Arguments:
x_new - a starting value of x that will get updated based on the
learning rate
x_prev - the previous value of x that is getting updated to the new one
4
y_list.append(function(x_new))
x_min = x_new
print ("Local minimum occurs at: "+ str(x_min))
print ("Number of steps: " + str(len(x_list)))
[Link](x, function(x))
[Link](x_list,y_list,c="r")
[Link](0, color='y', alpha = 0.7)
[Link](0, color='y', alpha = 0.7)
[Link]()
[Link]("Function (x ** 3)-(3 *(x ** 2))+7 ")
[Link]()
[Link](x_list,y_list,c="g")
[Link](x_list,y_list,c="g")
[Link](x,function(x), c="r")
[Link]([1.0,2.1])
[Link]("Zoomed in Gradient descent to Key Area")
[Link]()
[Link]()
def meansquarelogerror(ytrue,ypred):
trainingexamples = [Link][0]
print("Total no of training examples is "+str(trainingexamples))
errorvector = [Link]([Link](ypred))-[Link]([Link](ytrue))
print("Shape of error vector "+str([Link]))
error = (1/trainingexamples)*[Link](errorvector**2)
return error
def sigmoid(arrays):
return 1/(1+[Link]((-1)*arrays))
def crossentropy(ytrue,ypred):
trainingexamples = [Link][0]
errorvector = -1 * ytrue * [Link](ypred)
return (1/trainingexamples)*[Link](errorvector)
)
trainingsets = [Link](100,1)
trainingsets_pred = [Link](100,1)
error = meanabsoluteerror(trainingsets,trainingsets_pred)
print(f"Mean Absolute Error is {error}\n")
error = meansquarelogerror(trainingsets,trainingsets_pred)
print(f"Mean Square Log Error is {error}\n")
sig_train=sigmoid(trainingsets)
sig_pred=sigmoid(trainingsets_pred)
print(trainingsets[:5])
print(sig_train[:5])
error = crossentropy(sig_train,sig_pred)
print(f"Cross Entropy Loss is {error}\n")
6
LAB 04 – Regression Techniques (cont)
[Link]()
# Sigmoid (Logistic Function)
x = [Link](-10, 10, 50)
7
y_elu_manually, y_elu_library = [ELU(i) for i in x] ,
[Link](x)
PlotActivationFuction(x, y_elu_manually, y_elu_library, "ELU Relu
Activation Manually and Library")
In [76]:
x = [Link](-5.0, 5.0, 0.01)
dx = x[1]-x[0]
y_sigmoid = [Link](x)
dydx_sigmoid = [Link](y_sigmoid, dx)
PlotDerivative(x, y_sigmoid, dydx_sigmoid, title = "Sigmoid Functions and Its
Derivatives")
y_relu = [Link](x)
dydx_relu = [Link](y_relu, dx)
PlotDerivative(x, y_relu, dydx_relu, title = "Relu Functions and Its Derivatives")
y_tanh = [Link](x)
dydx_tanh = [Link](y_tanh, dx)
PlotDerivative(x, y_tanh, dydx_tanh, title = "Tanh Functions and Its Derivatives")
y_swish = [Link](x)
dydx_swish = [Link](y_swish, dx)
PlotDerivative(x, y_swish, dydx_swish, title = "Swish Functions and Its
Derivatives")
y_elu = [Link](x)
dydx_elu = [Link](y_elu, dx)
PlotDerivative(x, y_elu, dydx_elu, title = "ELU Functions and Its Derivatives")
8
LAB 05 – Classification Techniques
class KNeighborsClassifier:
def __init__(self, k=3):
self.k = k
def loadDataTest(path):
f = open(path, "r")
data = [Link](f) #csv format
data = [Link](list(data))# covert to matrix
data = [Link](data, 0, 0)# delete header
data = [Link](data, 0, 1) # delete index
[Link](data) # shuffle data
[Link]()
testSet = data[:]# the others is testing data
return testSet
def calcDistancs(pointA, pointB, numOfFeature=4):
tmp = 0
for i in range(numOfFeature):
tmp += (float(pointA[i]) - float(pointB[i])) ** 2
return [Link](tmp)
def kNearestNeighbor(trainSet, point, k):
distances = []
for item in trainSet:
[Link]({
"label": item[-1],
"value": calcDistancs(item, point)
})
[Link](key=lambda x: x["value"])
labels = [item["label"] for item in distances]
return labels[:k]
def findMostOccur(arr):
labels = set(arr) # set label
ans = ""
maxOccur = 0
for label in labels:
num = [Link](label)
if num > maxOccur:
maxOccur = num
ans = label
return ans
10
if __name__ == "__main__":
trainSet, testSet = loadData("[Link]")
testSet1 = loadDataTest("Iris_classification.csv")
for item in testSet:
knn = kNearestNeighbor(trainSet, item, 5)
answer = findMostOccur(knn)
print("label: {} -> predicted: {}".format(item[-1], answer))
# for item in testSet1:
# knn = kNearestNeighbor(trainSet, item, 11)
# answer = findMostOccur(knn)
# print("label: {} -> predicted: {}".format(item[-1], answer))
11
LAB 06 – Classification Techniques (cont)
[Link] Bayes
In this question you will implement a Naive Bayes classifier for a text classification problem. You will
be given a collection of text articles, each coming from either the serious European magazine The
Economist, or from the not-so-serious American magazine The Onion. The goal is to learn a classifier
that can distinguish between articles from each magazine. We have pre-processed the articles so that
they are easier to use in your experiments. We extracted the set of all words that occur in any of the
articles. This set is called the vocabulary and we let V be the number of words in the vocabulary. For
each article, we produced a feature vector X = hX1, . . . , XV i, where Xi is equal to 1 if the i th word
appears in the article and 0 otherwise.
Each article is also accompanied by a class label of either 1 for The Economist or 2 for The Onion.
Later in the question we give instructions for loading this data into Octave. When we apply the Naive
Bayes classification algorithm, we make two assumptions about the data: first, we assume that our
data is drawn iid from a joint probability distribution over the possible feature vectors X and the
corresponding class labels Y ; second, we assume for each pair of features Xi and Xj with i 6= j that Xi
is conditionally independent of Xj given the class label Y (this is the Naive Bayes assumption). Under
these assumptions, a natural classification rule is as follows: Given a new input X, predict the most
probable class label Yˆ given X. Formally,
[Link] Trees
Question 1. How many unique, perfect binary trees of depth 3 can be drawn if we have 5 attributes.
By depth, we mean depth of the splits, not including the nodes that only contain a label. So a tree
that checks just one attribute is a depth 1 tree. By perfect binary tree, we mean every node has either
0 or 2 children, and every leaf is at the same depth. Note also that a tree with the same attributes but
organized at different depths are considered “unique”. Do not include trees that test the same
attribute along the same path in the tree.
Question 2: Consider the following dataset for this problem. Given the five attributes on the left, we
want to predict if the student got an A in the course.
12
Create 2 decision trees for this dataset. For the first, only go to depth 1. For the second go to depth 2.
For all trees, use the ID3 entropy algorithm from class. For each node of the tree, show the decision,
the number of positive and negative examples and show the entropy at that node.
Hint: There are a lot of calculations here. You may want to do this programatically.
(Bonus)Make one more decision tree. Use the same procedure as in (b), but make it depth 3. Now,
given these three trees, which would you prefer if you wanted to predict the grades of 10 new
students who are not included in this dataset? Justify your choice.
Question 3. Recall the definition of the “realizable case.” For some fixed concept class C, such as
decision trees, a realizable case is one where the algorithm gets a sample consistent with some
concept c ∈ C. In other words, for decision trees, a case is realizable if there is some tree that
perfectly classifies the dataset. If the number of attributes A is sufficiently large, under what condition
would a dataset not be realizable for decision trees of no fixed depth? Prove that the dataset is
unrealizable if and only if that condition is true.
13
LAB 07 – Classification Techniques (cont)
Question 2
Given a binary data set:
Plot the points. Sketch the support vectors and the decision boundary for a linear SVM
classifier with maximum margin for this data set.
Question 3
Given the binary classification problem:
14
a) Sketch the points in a scatterplot (preferably with different colors for the different classes).
b) In the plot, sketch the mean values and the decision boundary you would get with a
Gaussian classifier with covariance matrix Σ=σ2I, where I is the identity matrix.
c) What is the error rate of the Gaussian classifier on the training data set?
d) Sketch on the plot the decision boundary you would get using a SVM with linear kernel
and a high cost of misclassifying training data. Indicate the support vectors and the decision
boundary on the plot.
e) What is the error rate of the linear SVM on the training data set?
Question 4
a) Download the two datasets [Link] and [Link].
You can use a library for SVM, e.g. scikit-learn in python.
Hint:
b) Load [Link]. Stick with the linear SVM, but change the C-parameter.
Rerun the experiments a couple of times, and visualize the data using something like the following:
import numpy as np
15
import [Link] as plt
Args:
X: numpy array with shape [n, 2] containing 2d feature vectors.
h: parameter controlling the resolution of the meshgrid
"""
x = X[:, 0]
y = X[: ,1]
x_min, x_max = [Link]() - 1, [Link]() + 1
y_min, y_max = [Link]() - 1, [Link]() + 1
xx, yy = [Link]([Link](x_min, x_max, h), [Link](y_min, y_max, h))
return xx, yy
Args:
X: numpy array of shape [n, 2] where n is the total number of datapoints
Y: numpy array of shape [n] containing the labels {1, 2, 3, ...} of X
xx: meshgrid x
yy: meshgrid y
Z: The result of applying some prediction function on all points in xx and
yy
"""
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
[Link]()
# Color class regions
[Link]().contourf(xx, yy, Z, alpha=0.7)
# Data points
[Link](X[:, 0], X[:, 1], c=Y, cmap=[Link], marker='o',
edgecolors='k')
[Link](x_min, x_max)
[Link](y_min, y_max)
[Link]().set_aspect('equal')
[Link]()
plt.tight_layout()
[Link]()
# Given X and Y as explained above, we can display the data and classification
boundaries as
classifier = [Link](C=100.0, kernel='linear')
[Link](X, Y)
xx, yy = make_meshgrid(X)
Z = [Link](np.c_[[Link](), [Link]()])
Z = [Link]([Link])
scatter(X, Y, xx, yy, Z)
How does the support vectors and the boundary change with the parameter?
16
c) Try to remove some of the non-support-vectors and rerun. Does the solution change?
d) Load [Link]. Try various values of the C−parameter with a linear SVM. Can the linear SVM
classifier make a good separation of the feature space?
e) Change kernel to a RBF (Radial Basis Function), and rerun. Try changing the σ-parameter.
f) Implement a grid search of the C- and σ- parameters based on 10-fold cross-validation of the training data
(the A-dataset). Find the best values of C and σ, retrain on the entire A–dataset, and the test on the B-dataset.
Does the average 10-fold cross-validation estimate of the overall classification error match the result we get
when testing on the independent B-dataset?
You can for example use the parameter ranges suggested in the lecture slides:
17
LAB 09 – Clustering Techniques
1.K-Mean
Question 1: Describle K-Means by handling
Solution:
import numpy as np
import scipy
from [Link] import KMeans
"""
Ma trận đầu vào X có dạng(m_sample, n_feature) có nghĩa là mỗi hàng của X b
iểu diễn 1 điểm dữ liệu
Nếu X chưa thỏa mãn điều kiện trên thì phải reshape lại => Bắt buộc
"""
def init_centers(X, k):
# Chọn ra và trả về k điểm ngẫu nhiên trong X
return X[[Link]([Link][0], k, replace=False)]
18
centers = [init_centers(X, K)]
labels = []
n_iter = 0 # Số vòng lặp
while True:
# Phân cụm dữ liệu
[Link](assign_labels(X, centers[-1]))
# Update lại center sau bước phân cụm
new_center = update_centers(X, labels[-1], K)
# Nếu center với và cũ giống nhau sau khi update thì dừng
if has_converged(centers[-1], new_center):
break
# Nếu center mới và cũ khác nhau cập nhập lại theo center mới
[Link](new_center)
n_iter += 1 # Số vòng lặp cộng thêm 1
# Trả về giá trị của center và label thu được ở vòng lặp cuối
return centers[-1], labels[-1], n_iter
means = [[2, 2], [8, 3], [3, 6]]
cov = [[1, 0], [0, 1]]
N = 500
X0 = [Link].multivariate_normal(means[0], cov, N)
X1 = [Link].multivariate_normal(means[1], cov, N)
X2 = [Link].multivariate_normal(means[2], cov, N)
[Link]
Question 1: Implementing DBSCAN
These exercises walk you through a Python implementation of an algorithm for clustering
called DBSCAN, which is short for density-based spatial clustering for applications with noise.
It addresses a limitation of k-means clustering, as described below.
Although there are existing implementations for Python (e.g., see here), in this notebook we
are asking you to build it from scratch, albeit using a lot of scaffolding that we have provided.
19
Setup
Here are the modules you will need for this problem.
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
import [Link] as plt
20
[Link] (x=x, y=y, data=df, fit_reg=False)
We will work work with a synthetic data set that is an especially bad case for k-
means. It's sometimes called the "crater" data because of its shape. The data points
are stored in a file named [Link].
The data files you'll need should be provided in this repository. To load them, you'll
need to wrap the filenames using the function fn(f), defined below. If you are
working locally, you can obtain a copy at the following URL and may need to
modify fn(f) accordingly: [Link]
def fn(fn_base, dirname='./dbscan/'): # `dirname` set by default to its location in our repository
return '{}{}'.format(dirname, fn_base)
# Demo:
fn('[Link]')
Exercise 0 (1 point). Start by reading the data into a Pandas data frame. The data is
stored locally within this assignment in a file called [Link]. Store your data
frame in a variable named crater.
[]
###
### YOUR CODE HERE
###
21
[]
print ("\n(Passed!)")
The testing code plots these data points, which are 2-D. The colors show the clusters
computed by k-means for k=2. Notice that the "natural" structure is, arguably, a dense ball in
the middle and a ring (donut) on the outside. However, k-means instead split the points
about an arbitrary line that cuts through the middle of the points.
Indeed, this fact is one of the limitations of k-means: it works well when you know the value
of k and the k clusters come from Gaussian distributions of similar shape and size (and,
therefore, density). However, if you don't know k or there is non-uniform shape and density
among the clusters---or some other grouping, as above---then k-means does not work well
(qualitatively).
Elements of DBSCAN
The DBSCAN algorithm takes a different approach. Rather than having to provide the number
of clusters, k, you define parameters related to neighborhoods and target density. Let's see
how DBSCAN works by building it from the ground up.
Neighborhoods
22
Exercise 1 (1 point). Implement a function that computes the ϵ-neighborhood of p for
a data matrix of points, X, defined by our usual convention as
X=⎛⎝⎜⎜⎜⎜⎜x^T0x^T1⋮x^Tm−1⎞⎠⎟⎟⎟⎟⎟.
In particular, complete the function named region_query(p, eps, X) below. Its
inputs are:
It should return a boolean Numpy array adj[:m] with one entry per point (i.e., per row
of X). The entry adj[i] should be True only if X[i, :] lies within the eps-sized ball
centered at p.
Hint: There is a one-line solution of the form, return (boolean array
expression).
[]
###
### YOUR CODE HERE
###
[]
X = crater[['x_1', 'x_2']].values
p = [Link] ([-0.5, 1.2])
in_region = region_query (p, 1.0, X)
crater_ball = [Link] ()
crater_ball['label'] = in_region
make_scatter_plot (crater_ball, centers=p[[Link]])
print ("\n(Passed!)")
[]
###
### YOUR CODE HERE
###
[]
y_test = [Link] ([True, False, False, True, False, True, True, True, False])
i_soln = set ([0, 3, 5, 6, 7])
print ("\n(Passed!)")
Exercise 3 (1 point). Given a value for ϵ and a data matrix X of points, complete the
function below so that it determines the neighborhood of each point.
Your function,
def find_neighbors(eps, X[:m, :]):
...
should return a Python list neighbors[:m] such that neighbors[i] is the index set
of neighbors of point X[i, :].
24
[]
[]
print ("\n(Passed!)")
Density
The next important concept in DBSCAN is that of the density of a neighborhood.
Intuitively, the DBSCAN algorithm will try to "grow" clusters around points whose
neighborhoods are sufficiently dense.
Let's make this idea more precise.
Definition: core points. A point p is a core point if its ϵ-neighborhood has at
least s points.
In other words, the algorithm now has two user-defined parameters: the
neighborhood size, ϵ, and the minimum density, specified using a threshold s on the
number of points in such a neighborhood.
[]
25
core_set = set ()
###
### YOUR CODE HERE
###
return core_set
[]
print ("\n(Passed!)")
26
6. If q is also a core point, then add all of its neighbors to the reachable set, per the
definition of "reachability" above.
7. If q is not yet assigned to any cluster, then add it to p's cluster.
Notice how this procedure explores the points reachable from p (Step 6). Intuitively, it
is trying to join all neighborhoods whose core points are mutually contained.
In this picture, suppose the minimum density parameter is s=3 points. Thus, only
the ϵ-neighborhoods centered at 1, 3, and 6 are core points, since these are the only
points that include at least s=3 points. For instance, Nϵ(1)={0,1,3,7}, making it a
core point since its neighborhood has four (4) points, whereas Nϵ(4)={3,4} is not a
core point since its neighborhood has just two (2) points.
• p is the index of a starting core point. The caller must guarantee that it is indeed a
core point, and furthermore, that it has been assigned to some cluster. (See below.)
27
• neighbors[:] is a list of ϵ-neighborhoods, given as Python sets. For
instance, neighbors[p] is a set of indices of all points in the neighborhood of p. It will
have been computed from find_neighbors() above.
• core_set is a Python set containing the indices of all core points. That is, the
expression, i in core_set, is true only if i is indeed a core point.
• visited is another Python set containing the indices of all points that have already
been visited. That is, the expression i in visited should be True only if i has been
visited. Thus, your expand_cluster() function should update this set when visiting
any previously unvisited point.
• assignment is a Python dictionary. The key is the index of a point; the value is the
cluster label to which that point has been assigned. Consequently, if a point i does
not yet have any cluster assignment, then the expression, i in assignment, will
be False. Your expand_cluster() function should update cluster assignments by
updating this dictionary.
[]
28
[]
print ("\n(Passed!)")
[]
assignment = {}
next_cluster_id = 0
29
visited = set ()
for i in core_set: # for each core point i
if i not in visited:
[Link] (i) # Mark i as visited
assignment[i] = next_cluster_id
expand_cluster (i, neighbors, core_set,
visited, assignment)
next_cluster_id += 1
[]
Fin! If you've reached this point and all tests above pass, you are ready to submit your
solution to this problem. Don't forget to save you work prior to submitting.
30
Bagging và Pasting
Rừng Ngẫu nhiên
Solution
Exercise: Load the MNIST dataset (introduced in chapter 3) and split it into a training set and a test set
(take the first 60,000 instances for training, and the remaining 10,000 for testing).
The MNIST dataset was loaded earlier.
In [73]:
X_train = mnist['data'][:60000]
y_train = mnist['target'][:60000]
X_test = mnist['data'][60000:]
y_test = mnist['target'][60000:]
Exercise: Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the
resulting model on the test set.
In [74]:
from [Link] import RandomForestClassifier
t0 = [Link]()
rnd_clf.fit(X_train, y_train)
t1 = [Link]()
In [76]:
print("Training took {:.2f}s".format(t1 - t0))
Training took 35.27s
In [77]:
from [Link] import accuracy_score
y_pred = rnd_clf.predict(X_test)
accuracy_score(y_test, y_pred)
Out[77]:
0.9705
Exercise: Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.
31
In [78]:
from [Link] import PCA
pca = PCA(n_components=0.95)
X_train_reduced = pca.fit_transform(X_train)
Exercise: Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was
training much faster?
In [79]:
rnd_clf2 = RandomForestClassifier(n_estimators=100, random_state=42)
t0 = [Link]()
rnd_clf2.fit(X_train_reduced, y_train)
t1 = [Link]()
In [80]:
print("Training took {:.2f}s".format(t1 - t0))
Training took 81.03s
X_test_reduced = [Link](X_test)
y_pred = rnd_clf2.predict(X_test_reduced)
accuracy_score(y_test, y_pred)
Out[81]:
0.9481
In [82]:
from sklearn.linear_model import LogisticRegression
2. Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the
result using Matplotlib. You can use a scatterplot using 10 different colors to rep‐
resent each image’s target class. Alternatively, you can write colored digits at the
location of each instance, or even plot scaled-down versions of the digit images
themselves (if you plot all digits, the visualization will be too cluttered, so you
32
should either draw a random sample or plot an instance only if no other instance
has already been plotted at a close distance). You should get a nice visualization
with well-separated clusters of digits. Try using other dimensionality reduction
algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.
Solution
[Link](42)
m = 10000
idx = [Link](60000)[:m]
X = mnist['data'][idx]
y = mnist['target'][idx]
[Link](figsize=(9,9))
cmap = [Link].get_cmap("jet")
for digit in (2, 3, 5):
[Link](X_reduced[y == digit, 0], X_reduced[y == digit, 1],
c=[cmap(digit / 9)])
[Link]('off')
[Link]()
idx = (y == 2) | (y == 3) | (y == 5)
X_subset = X[idx]
y_subset = y[idx]
33
from [Link] import MinMaxScaler
from [Link] import AnnotationBbox, OffsetImage
plot_digits(X_reduced, y)
Let's start with PCA. We will also time how long it takes:
In [98]:
from [Link] import PCA
import time
t0 = [Link]()
X_pca_reduced = PCA(n_components=2, random_state=42).fit_transform(X)
t1 = [Link]()
print("PCA took {:.1f}s.".format(t1 - t0))
plot_digits(X_pca_reduced, y)
34
[Link]()
t0 = [Link]()
X_lle_reduced = LocallyLinearEmbedding(n_components=2,
random_state=42).fit_transform(X)
t1 = [Link]()
print("LLE took {:.1f}s.".format(t1 - t0))
plot_digits(X_lle_reduced, y)
[Link]()
pca_lle = Pipeline([
("pca", PCA(n_components=0.95, random_state=42)),
("lle", LocallyLinearEmbedding(n_components=2, random_state=42)),
])
t0 = [Link]()
X_pca_lle_reduced = pca_lle.fit_transform(X)
t1 = [Link]()
print("PCA+LLE took {:.1f}s.".format(t1 - t0))
plot_digits(X_pca_lle_reduced, y)
[Link]()
m = 2000
t0 = [Link]()
X_mds_reduced = MDS(n_components=2, random_state=42).fit_transform(X[:m])
t1 = [Link]()
print("MDS took {:.1f}s (on just 2,000 MNIST images instead of
10,000).".format(t1 - t0))
plot_digits(X_mds_reduced, y[:m])
[Link]()
35
LAB 12 – Neural Networks and Deep Learning
Perceptron learning algorithm
36
LAB 14 – Recommender Systems
Question 1: Reading Movie-Lens data
Solution:
# Import libraries
import numpy as np
import pandas as pd
In [5]:
[Link]()
Out[5]:
37
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
Also let's count the number of unique users and movies.
In [6]:
n_users = ratings.user_id.unique().shape[0]
n_movies = ratings.movie_id.unique().shape[0]
print 'Number of users = ' + str(n_users) + ' | Number of movies = ' +
str(n_movies)
Number of users = 6040 | Number of movies = 3706
Now I want the format of my ratings matrix to be one row per user and one column per movie. To
do so, I'll pivot ratings to get that and call the new variable Ratings (with a capital *R).
In [7]:
Ratings = [Link](index = 'user_id', columns ='movie_id', values =
'rating').fillna(0)
[Link]()
Out[7]:
.
movi 1 39 39 39 39 39 39 39 39 39 39
1 2 3 4 5 6 7 8 9 .
e_id 0 43 44 45 46 47 48 49 50 51 52
.
user_
id
.
5. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.
.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
2 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.
.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.
38
.
movi 1 39 39 39 39 39 39 39 39 39 39
1 2 3 4 5 6 7 8 9 .
e_id 0 43 44 45 46 47 48 49 50 51 52
.
user_
id
.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
4 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.
.
0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
5 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.
In [15]:
links_small = pd.read_csv('../input/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
In [16]:
md = [Link]([19730, 29503, 35587])
In [17]:
#Check EDA Notebook for how and why I got these indices.
md['id'] = md['id'].astype('int')
In [18]:
39
smd = md[md['id'].isin(links_small)]
[Link]
Out[18]:
(9099, 25)
We have 9099 movies avaiable in our small movies metadata dataset which is 5 times smaller
than our original dataset of 45000 movies.
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0,
stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])
tfidf_matrix.shape
Out[21]:
(9099, 268124)
Cosine Similarity
I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity
between two movies. Mathematically, it is defined as follows:
cosine(x,y)=x.y⊺||x||.||y||cosine(x,y)=x.y⊺||x||.||y||
Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the
Cosine Similarity Score. Therefore, we will use sklearn's linear_kernel instead of
cosine_similarities since it is much faster.
In [22]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
In [23]:
cosine_sim[0]
Out[23]:
array([ 1. , 0.00680476, 0. , ..., 0. ,
0.00344913, 0. ])
We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is
to write a function that returns the 30 most similar movies based on the cosine similarity score.
In [24]:
smd = smd.reset_index()
titles = smd['title']
indices = [Link]([Link], index=smd['title'])
In [25]:
def get_recommendations(title):
idx = indices[title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:31]
40
movie_indices = [i[0] for i in sim_scores]
return [Link][movie_indices]
We're all set. Let us now try and get the top recommendations for a few movies and see how
good the recommendations are.
In [26]:
get_recommendations('The Godfather').head(10)
Out[26]:
973 The Godfather: Part II
8387 The Family
3509 Made
4196 Johnny Dangerously
29 Shanghai Triad
5667 Fury
2412 American Movie
1582 The Godfather: Part III
4221 8 Women
2159 Summer of Sam
Name: title, dtype: object
In [27]:
get_recommendations('The Dark Knight').head(10)
Out[27]:
7931 The Dark Knight Rises
132 Batman Forever
1113 Batman Returns
8227 Batman: The Dark Knight Returns, Part 2
7565 Batman: Under the Red Hood
524 Batman
7901 Batman: Year One
2579 Batman: Mask of the Phantasm
2696 JFK
8165 Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object
import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from [Link] import layers
from pathlib import Path
import [Link] as plt
# Only extract the data the first time the script is run.
if not movielens_dir.exists():
with ZipFile(movielens_zipped_file, "r") as zip:
# Extract files
print("Extracting all the files now...")
[Link](path=keras_datasets_path)
print("Done!")
user_ids = df["userId"].unique().tolist()
user2user_encoded = {x: i for i, x in enumerate(user_ids)}
userencoded2user = {i: x for i, x in enumerate(user_ids)}
movie_ids = df["movieId"].unique().tolist()
movie2movie_encoded = {x: i for i, x in enumerate(movie_ids)}
movie_encoded2movie = {i: x for i, x in enumerate(movie_ids)}
df["user"] = df["userId"].map(user2user_encoded)
df["movie"] = df["movieId"].map(movie2movie_encoded)
num_users = len(user2user_encoded)
num_movies = len(movie_encoded2movie)
df["rating"] = df["rating"].[Link](np.float32)
# min and max ratings will be used to normalize the ratings later
min_rating = min(df["rating"])
max_rating = max(df["rating"])
42
print(
"Number of users: {}, Number of Movies: {}, Min rating: {}, Max rating:
{}".format(
num_users, num_movies, min_rating, max_rating
)
)
Number of users: 610, Number of Movies: 9724, Min rating: 0.5, Max rating: 5.0
The model computes a match score between user and movie embeddings via a dot
product, and adds a per-movie and per-user bias. The match score is scaled to
the [0, 1] interval via a sigmoid (since our ratings are normalized to this range).
EMBEDDING_SIZE = 50
class RecommenderNet([Link]):
def __init__(self, num_users, num_movies, embedding_size, **kwargs):
super(RecommenderNet, self).__init__(**kwargs)
self.num_users = num_users
self.num_movies = num_movies
self.embedding_size = embedding_size
self.user_embedding = [Link](
num_users,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=[Link].l2(1e-6),
)
self.user_bias = [Link](num_users, 1)
self.movie_embedding = [Link](
num_movies,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=[Link].l2(1e-6),
43
)
self.movie_bias = [Link](num_movies, 1)
44
Show top 10 movie recommendations to a user
movie_df = pd.read_csv(movielens_dir / "[Link]")
print("----" * 8)
45
print("Top 10 movie recommendations")
print("----" * 8)
recommended_movies = movie_df[movie_df["movieId"].isin(recommended_movie_ids)]
for row in recommended_movies.itertuples():
print([Link], ":", [Link])
302/302 [==============================] - 0s 800us/step
Showing recommendations for user: 213
====================================
Movies with high ratings from user
--------------------------------
Terminator 2: Judgment Day (1991) : Action|Sci-Fi
Rocky (1976) : Drama
Big Fish (2003) : Drama|Fantasy|Romance
Shrek 2 (2004) : Adventure|Animation|Children|Comedy|Musical|Romance
13 Assassins (Jûsan-nin no shikaku) (2010) : Action
--------------------------------
Top 10 movie recommendations
--------------------------------
Usual Suspects, The (1995) : Crime|Mystery|Thriller
Star Wars: Episode IV - A New Hope (1977) : Action|Adventure|Sci-Fi
Shawshank Redemption, The (1994) : Crime|Drama
Schindler's List (1993) : Drama|War
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) :
Comedy|War
Godfather: Part II, The (1974) : Crime|Drama
Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001) : Comedy|Romance
28 Days Later (2002) : Action|Horror|Sci-Fi
Little Miss Sunshine (2006) : Adventure|Comedy|Drama
Hurt Locker, The (2008) : Action|Drama|Thriller|War
46