0% found this document useful (0 votes)
33 views46 pages

Machine Learning Labs Overview

Uploaded by

Sơn Trịnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views46 pages

Machine Learning Labs Overview

Uploaded by

Sơn Trịnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning Practices

LAB 01 – Overview of Machine Learning, Tool & Library ..................................... 2


LAB 02 – Machine Learning Project ..................................................................... 3
LAB 03 – Regression Techniques ......................................................................... 4
LAB 04 – Regression Techniques (cont) ............................................................... 4
LAB 05 – Classification Techniques ...................................................................... 7
LAB 06 – Classification Techniques (cont)............................................................ 9
LAB 07 – Classification Techniques (cont).......................................................... 12
LAB 08 – Evaluation Labs ................................................................................... 14
LAB 09 – Clustering Techniques......................................................................... 18
LAB 10 – Ensemble Learning and Random Forests ............................................ 18
LAB 11 – Dimensionality Reduction ................................................................... 31
LAB 12 – Neural Networks and Deep Learning .................................................. 36
LAB 13 – Neural Networks and Deep Learning (cont) ........................................ 36
LAB 14 – Recommender Systems ...................................................................... 37
LAB 15 – Evaluation Labs ................................................................................... 37

1
LAB 01 – Overview of Machine Learning, Tool & Library

1. How would you define Machine Learning?


2. Can you name four types of problems where it shines?
3. What is a labeled training set?
4. What are the two most common supervised tasks?
5. Can you name four common unsupervised tasks?
6. What type of Machine Learning algorithm would you use to allow a robot to walk
in various unknown terrains?
7. What type of algorithm would you use to segment your customers into multiple
groups?
8. Would you frame the problem of spam detection as a supervised learning problem
or an unsupervised learning problem?
9. What is an online learning system?
10. What is out-of-core learning?
11. What type of learning algorithm relies on a similarity measure to make predictions?
12. What is the difference between a model parameter and a learning algorithm’s
hyperparameter?
13. What do model-based learning algorithms search for? What is the most common
strategy they use to succeed? How do they make predictions?
14. Can you name four of the main challenges in Machine Learning?
15. If your model performs great on the training data but generalizes poorly to new
instances, what is happening? Can you name three possible solutions?
16. What is a test set and why would you want to use it?
17. What is the purpose of a validation set?
18. What can go wrong if you tune hyperparameters using the test set?
19. What is repeated cross-validation and why would you prefer it to using a single
validation set?

2
LAB 02 – Machine Learning Project

1. Try a Support Vector Machine regressor ([Link]), with various


hypeparameters such as kernel="Linear" (with various values for the C
hyperparameter) or kernel="rbf" (with various values for the C and gamma
hyperparameters). How does the best SVR predictor perform?
2. Try replacing GridSearchCV with RandomizedSearchCV.
3. Try adding a transformer in the preparation pipeline to select only the most important
attributes.
4. Try creating a single pipeline that does the full data preparation plus the final
prediction.
5. Automatically explore some preparation options using GridSearchCV.

3
LAB 03 – Regression Techniques

Question 1: Apply Gradient Descent to find minimum of a function


Solution:
function = lambda x: (x ** 3)-(3 *(x ** 2))+7
x = [Link](-1,3,500)
[Link](x, function(x))
[Link](0, color='y', alpha = 0.7)
[Link](0, color='y', alpha = 0.7)
[Link]()
[Link]("Function (x ** 3)-(3 *(x ** 2))+7 ")
[Link]()
pip install numdifftools
import numdifftools as nd
dydx = lambda x : float([Link](function)([x]))
x_sample = 0.1
print(f"Derivatives of function at {x_sample} is {dydx(0.1)}")
def step(x_new, x_prev, precision, l_r):
'''
Description: This function takes in an initial or previous value for x,
updates it based on
steps taken via the learning rate and outputs the most minimum value of
x that reaches
the precision satisfaction.

Arguments:
x_new - a starting value of x that will get updated based on the
learning rate

x_prev - the previous value of x that is getting updated to the new one

precision - a precision that determines the stop of the stepwise


descent

l_r - the learning rate (size of each descent step)


'''
# create empty lists where the updated values of x and y wil be
appended during each iteration
x_list, y_list = [x_new], [function(x_new)]
# keep looping until your desired precision
while abs(x_new - x_prev) > precision:
# change the value of x
x_prev = x_new
# get the derivation of the old value of x
d_x = - dydx(x_prev)
# get your new value of x by adding the previous, the
multiplication of the derivative and the learning rate
x_new = x_prev + (l_r * d_x)
# append the new value of x to a list of all x-s for later
visualization of path
x_list.append(x_new)
# append the new value of y to a list of all y-s for later
visualization of path

4
y_list.append(function(x_new))
x_min = x_new
print ("Local minimum occurs at: "+ str(x_min))
print ("Number of steps: " + str(len(x_list)))

return x_list, y_list, x_min


#Implement gradient descent (all the arguments are arbitrarily chosen)
x_list, y_list, x_min = step(x_new = 0.5, x_prev = 0, precision = 0.001,
l_r = 0.05)

[Link](x, function(x))
[Link](x_list,y_list,c="r")
[Link](0, color='y', alpha = 0.7)
[Link](0, color='y', alpha = 0.7)
[Link]()
[Link]("Function (x ** 3)-(3 *(x ** 2))+7 ")
[Link]()
[Link](x_list,y_list,c="g")
[Link](x_list,y_list,c="g")
[Link](x,function(x), c="r")
[Link]([1.0,2.1])
[Link]("Zoomed in Gradient descent to Key Area")
[Link]()
[Link]()

Question 2: Draw all the loss functions


range_min = -4
range_max = 4
stepsize = 0.1
m = 1.0

x = [Link](range_min, range_max + stepsize, stepsize)


y_hinge = [Link](0,m-x)
y_log = [Link](1 + [Link](-x))
y_exp = [Link](-x)

fig, ax = [Link](figsize=(15, 8))


[Link](x, y_hinge, c='r', linewidth=4, linestyle=(0,(1,1)), label= "hinge:
max(0,m-z)", alpha=0.7)
[Link](x, y_log, c='g', linewidth=4, linestyle='dashdot', label="log :
log(1+\exp(-z))", alpha=0.7)
[Link](x, y_exp, c='orange', linewidth=4, linestyle=(0,(4,1,4,1,1,1)),
label="exp: exp(-z)", alpha=0.7)
[Link]()
[Link]("z = y * xf(x)")
[Link]("Loss")
[Link](-0.5, 5)
[Link](0, color='black', alpha = 0.7)
[Link](0, color='black', alpha = 0.7)
[Link]()
[Link](
def meanabsoluteerror(ytrue,ypred):
trainingexamples = [Link][0]
print("Total no of training examples is "+str(trainingexamples))
errorvector = ypred-ytrue
print("Shape of error vector "+str([Link]))
5
error = (1/trainingexamples)*[Link]([Link](errorvector))
return error

def meansquarelogerror(ytrue,ypred):
trainingexamples = [Link][0]
print("Total no of training examples is "+str(trainingexamples))
errorvector = [Link]([Link](ypred))-[Link]([Link](ytrue))
print("Shape of error vector "+str([Link]))
error = (1/trainingexamples)*[Link](errorvector**2)
return error

def sigmoid(arrays):
return 1/(1+[Link]((-1)*arrays))

def crossentropy(ytrue,ypred):
trainingexamples = [Link][0]
errorvector = -1 * ytrue * [Link](ypred)
return (1/trainingexamples)*[Link](errorvector)
)
trainingsets = [Link](100,1)
trainingsets_pred = [Link](100,1)
error = meanabsoluteerror(trainingsets,trainingsets_pred)
print(f"Mean Absolute Error is {error}\n")

error = meansquarelogerror(trainingsets,trainingsets_pred)
print(f"Mean Square Log Error is {error}\n")

sig_train=sigmoid(trainingsets)
sig_pred=sigmoid(trainingsets_pred)
print(trainingsets[:5])
print(sig_train[:5])
error = crossentropy(sig_train,sig_pred)
print(f"Cross Entropy Loss is {error}\n")

6
LAB 04 – Regression Techniques (cont)

Question 1: Draw all the activation functions


Solution:
import [Link] as plt
import numpy as np
import math
import tensorflow as tf
def PlotActivationFuction(x, y1, y2, title = "Activation y1 and y2"):
fig, (ax1, ax2) = [Link](2, 1, figsize=(15,7))
[Link](title)

[Link](x, y1, '-ro')


[Link](xlabel='x', ylabel='y1')
[Link]()
[Link](x, y2, '-go')
[Link](xlabel='x', ylabel='y2')
[Link]()

[Link]()
# Sigmoid (Logistic Function)
x = [Link](-10, 10, 50)

sigmoid = lambda z: 1/(1 + [Link](-z))


relu = lambda z: max(0.0, z)
leakyrelu = lambda z, alpha : z if z > 0 else alpha*z
tanh = lambda z : ([Link](z) - [Link](-z))/ ([Link](z) + [Link](-z))
swish = lambda z : z* 1/(1 + [Link](-z))
ELU = lambda z : z if z > 0 else (1*([Link](z)-1))

y_sigmoid_manually, y_sigmoid_library = sigmoid(x),


[Link](x)
PlotActivationFuction(x, y_sigmoid_manually, y_sigmoid_library, "Sigmoid
Activation Manually and Library")

y_relu_manually, y_relu_library = [relu(i) for i in x],


[Link](x)
PlotActivationFuction(x, y_relu_manually, y_relu_library, "Relu Activation
Manually and Library")

y_leakyrelu_manually, y_leakyrelu_library = [leakyrelu(i, 0.1) for i in x]


\
, [Link](x,
alpha= 0.1)
PlotActivationFuction(x, y_leakyrelu_manually, y_leakyrelu_library, "Leaky
Relu Activation Manually and Library")

y_tanh_manually, y_tanh_library = tanh(x), [Link](x)


PlotActivationFuction(x, y_tanh_manually, y_tanh_library, "Tanh Relu
Activation Manually and Library")

y_swish_manually, y_swish_library = swish(x), [Link](x)


PlotActivationFuction(x, y_swish_manually, y_swish_library, "Swish Relu
Activation Manually and Library")

7
y_elu_manually, y_elu_library = [ELU(i) for i in x] ,
[Link](x)
PlotActivationFuction(x, y_elu_manually, y_elu_library, "ELU Relu
Activation Manually and Library")

Question 2: Draw all the derivatives of activation functions


Solution:

def PlotDerivative(x, y, dydx, title = "Functions and Its Derivatives"):


fig, ax = [Link](figsize=(15, 3))
[Link](x,y, color= 'red', linewidth=3, label="function")
[Link](x,dydx, color= 'green', linewidth=3, label="derivative")
[Link](loc="upper right")
[Link](0, color='y', alpha = 0.7)
[Link](0, color='y', alpha = 0.7)
[Link]()
[Link](title)
[Link]()

In [76]:
x = [Link](-5.0, 5.0, 0.01)
dx = x[1]-x[0]

y_sigmoid = [Link](x)
dydx_sigmoid = [Link](y_sigmoid, dx)
PlotDerivative(x, y_sigmoid, dydx_sigmoid, title = "Sigmoid Functions and Its
Derivatives")

y_relu = [Link](x)
dydx_relu = [Link](y_relu, dx)
PlotDerivative(x, y_relu, dydx_relu, title = "Relu Functions and Its Derivatives")

y_leakyrelu = [Link](x, alpha= 0.3)


dydx_leakyrelu = [Link](y_leakyrelu, dx)
PlotDerivative(x, y_leakyrelu, dydx_leakyrelu, title = "Leaky Relu Functions and Its
Derivatives")

y_tanh = [Link](x)
dydx_tanh = [Link](y_tanh, dx)
PlotDerivative(x, y_tanh, dydx_tanh, title = "Tanh Functions and Its Derivatives")
y_swish = [Link](x)
dydx_swish = [Link](y_swish, dx)
PlotDerivative(x, y_swish, dydx_swish, title = "Swish Functions and Its
Derivatives")
y_elu = [Link](x)
dydx_elu = [Link](y_elu, dx)
PlotDerivative(x, y_elu, dydx_elu, title = "ELU Functions and Its Derivatives")

8
LAB 05 – Classification Techniques

Question 1: Descibe KNN by handling


Solution
import numpy as np
from collections import Counter

def euclidian_distance(x1, x2): # x1, x2 là 2 vector.


return [Link](sum((x1 - x2) ** 2))

class KNeighborsClassifier:
def __init__(self, k=3):
self.k = k

def fit(self, X, y): # Store sample and label


self.X_train = X
self.y_train = y

def predict(self, X):


predicted_label = [self._predict(x) for x in X]
return [Link](predicted_label)

def _predict(self, x):


# Compute distances
distances = [euclidian_distance(x, x_train) for x_train in self.X_t
rain]
# Trả về index của mảng BAN ĐẦU sau khi được xắp xếp
k_indices = [Link](distances)[0:self.k]
# Lấy ra label của những sample có khảng cách gần nhất đến điểm đan
g xet
k_nearest_label = [self.y_train[i] for i in k_indices]
# Trả về tupple có dạng (label xuất hiện nhiều nhất, số lần xuất hi
ện)
most_common = Counter(k_nearest_label).most_common(1)
# Trả về label có số lần xuất hiện nhiều nhất
return most_common[0][0]

Question 2: Using K-NN with [Link] dâta


import csv
import numpy as np
import math
9
def loadData(path):
f = open(path, "r")
data = [Link](f) #csv format
data = [Link](list(data))# covert to matrix
data = [Link](data, 0, 0)# delete header
data = [Link](data, 0, 1) # delete index
[Link](data) # shuffle data
[Link]()
trainSet = data[:145] #training data from 1->145
testSet = data[145:]# the others is testing data
return trainSet, testSet

def loadDataTest(path):
f = open(path, "r")
data = [Link](f) #csv format
data = [Link](list(data))# covert to matrix
data = [Link](data, 0, 0)# delete header
data = [Link](data, 0, 1) # delete index
[Link](data) # shuffle data
[Link]()
testSet = data[:]# the others is testing data
return testSet
def calcDistancs(pointA, pointB, numOfFeature=4):
tmp = 0
for i in range(numOfFeature):
tmp += (float(pointA[i]) - float(pointB[i])) ** 2
return [Link](tmp)
def kNearestNeighbor(trainSet, point, k):
distances = []
for item in trainSet:
[Link]({
"label": item[-1],
"value": calcDistancs(item, point)
})
[Link](key=lambda x: x["value"])
labels = [item["label"] for item in distances]
return labels[:k]
def findMostOccur(arr):
labels = set(arr) # set label
ans = ""
maxOccur = 0
for label in labels:
num = [Link](label)
if num > maxOccur:
maxOccur = num
ans = label
return ans

10
if __name__ == "__main__":
trainSet, testSet = loadData("[Link]")
testSet1 = loadDataTest("Iris_classification.csv")
for item in testSet:
knn = kNearestNeighbor(trainSet, item, 5)
answer = findMostOccur(knn)
print("label: {} -> predicted: {}".format(item[-1], answer))
# for item in testSet1:
# knn = kNearestNeighbor(trainSet, item, 11)
# answer = findMostOccur(knn)
# print("label: {} -> predicted: {}".format(item[-1], answer))

11
LAB 06 – Classification Techniques (cont)

[Link] Bayes
In this question you will implement a Naive Bayes classifier for a text classification problem. You will
be given a collection of text articles, each coming from either the serious European magazine The
Economist, or from the not-so-serious American magazine The Onion. The goal is to learn a classifier
that can distinguish between articles from each magazine. We have pre-processed the articles so that
they are easier to use in your experiments. We extracted the set of all words that occur in any of the
articles. This set is called the vocabulary and we let V be the number of words in the vocabulary. For
each article, we produced a feature vector X = hX1, . . . , XV i, where Xi is equal to 1 if the i th word
appears in the article and 0 otherwise.
Each article is also accompanied by a class label of either 1 for The Economist or 2 for The Onion.
Later in the question we give instructions for loading this data into Octave. When we apply the Naive
Bayes classification algorithm, we make two assumptions about the data: first, we assume that our
data is drawn iid from a joint probability distribution over the possible feature vectors X and the
corresponding class labels Y ; second, we assume for each pair of features Xi and Xj with i 6= j that Xi
is conditionally independent of Xj given the class label Y (this is the Naive Bayes assumption). Under
these assumptions, a natural classification rule is as follows: Given a new input X, predict the most
probable class label Yˆ given X. Formally,

[Link] Trees
Question 1. How many unique, perfect binary trees of depth 3 can be drawn if we have 5 attributes.
By depth, we mean depth of the splits, not including the nodes that only contain a label. So a tree
that checks just one attribute is a depth 1 tree. By perfect binary tree, we mean every node has either
0 or 2 children, and every leaf is at the same depth. Note also that a tree with the same attributes but
organized at different depths are considered “unique”. Do not include trees that test the same
attribute along the same path in the tree.

Question 2: Consider the following dataset for this problem. Given the five attributes on the left, we
want to predict if the student got an A in the course.

12
Create 2 decision trees for this dataset. For the first, only go to depth 1. For the second go to depth 2.
For all trees, use the ID3 entropy algorithm from class. For each node of the tree, show the decision,
the number of positive and negative examples and show the entropy at that node.

Hint: There are a lot of calculations here. You may want to do this programatically.
(Bonus)Make one more decision tree. Use the same procedure as in (b), but make it depth 3. Now,
given these three trees, which would you prefer if you wanted to predict the grades of 10 new
students who are not included in this dataset? Justify your choice.

Question 3. Recall the definition of the “realizable case.” For some fixed concept class C, such as
decision trees, a realizable case is one where the algorithm gets a sample consistent with some
concept c ∈ C. In other words, for decision trees, a case is realizable if there is some tree that
perfectly classifies the dataset. If the number of attributes A is sufficiently large, under what condition
would a dataset not be realizable for decision trees of no fixed depth? Prove that the dataset is
unrealizable if and only if that condition is true.

13
LAB 07 – Classification Techniques (cont)

[Link] Vector Machines


Question 1
Show that the criterion
corresponds to correct classification for all n samples in a binary classification problem where

and ℒ𝑘 denotes the index set of class k.

Question 2
Given a binary data set:

Plot the points. Sketch the support vectors and the decision boundary for a linear SVM
classifier with maximum margin for this data set.

Question 3
Given the binary classification problem:

14
a) Sketch the points in a scatterplot (preferably with different colors for the different classes).
b) In the plot, sketch the mean values and the decision boundary you would get with a
Gaussian classifier with covariance matrix Σ=σ2I, where I is the identity matrix.
c) What is the error rate of the Gaussian classifier on the training data set?
d) Sketch on the plot the decision boundary you would get using a SVM with linear kernel
and a high cost of misclassifying training data. Indicate the support vectors and the decision
boundary on the plot.
e) What is the error rate of the linear SVM on the training data set?
Question 4
a) Download the two datasets [Link] and [Link].
You can use a library for SVM, e.g. scikit-learn in python.
Hint:

from sklearn import svm

# In this case we assume 2d features.


# X: numpy array with shape (n, 2), all n data points.
# Y: numpy array of shape (n), labels in {0, 1, 2, 3, ...} corresponding to X

classifier = [Link](C=10, kernel='linear')


[Link](X, Y)

print("The feature [1, 2] is classified as:", [Link]([[1, 2]]))

Familiarize yourself with the data sets by studying scatterplots.

b) Load [Link]. Stick with the linear SVM, but change the C-parameter.

Rerun the experiments a couple of times, and visualize the data using something like the following:

import numpy as np

15
import [Link] as plt

def make_meshgrid(X, h=.02):


"""Make a meshgrid covering the range of X. This will be used to draw
classification regions

Args:
X: numpy array with shape [n, 2] containing 2d feature vectors.
h: parameter controlling the resolution of the meshgrid
"""
x = X[:, 0]
y = X[: ,1]
x_min, x_max = [Link]() - 1, [Link]() + 1
y_min, y_max = [Link]() - 1, [Link]() + 1
xx, yy = [Link]([Link](x_min, x_max, h), [Link](y_min, y_max, h))
return xx, yy

def scatter(X, Y, xx, yy, Z):


"""
Scatter plot with classification regions

Args:
X: numpy array of shape [n, 2] where n is the total number of datapoints
Y: numpy array of shape [n] containing the labels {1, 2, 3, ...} of X
xx: meshgrid x
yy: meshgrid y
Z: The result of applying some prediction function on all points in xx and
yy
"""
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

[Link]()
# Color class regions
[Link]().contourf(xx, yy, Z, alpha=0.7)
# Data points
[Link](X[:, 0], X[:, 1], c=Y, cmap=[Link], marker='o',
edgecolors='k')
[Link](x_min, x_max)
[Link](y_min, y_max)
[Link]().set_aspect('equal')
[Link]()
plt.tight_layout()

[Link]()

# Given X and Y as explained above, we can display the data and classification
boundaries as
classifier = [Link](C=100.0, kernel='linear')
[Link](X, Y)

xx, yy = make_meshgrid(X)
Z = [Link](np.c_[[Link](), [Link]()])
Z = [Link]([Link])
scatter(X, Y, xx, yy, Z)

How does the support vectors and the boundary change with the parameter?

16
c) Try to remove some of the non-support-vectors and rerun. Does the solution change?

d) Load [Link]. Try various values of the C−parameter with a linear SVM. Can the linear SVM
classifier make a good separation of the feature space?

e) Change kernel to a RBF (Radial Basis Function), and rerun. Try changing the σ-parameter.

In the lecture we defined the rbf kernel as

In sklearn, σ is expressed through a γ parameter, and the rbf kernel is given as

f) Implement a grid search of the C- and σ- parameters based on 10-fold cross-validation of the training data
(the A-dataset). Find the best values of C and σ, retrain on the entire A–dataset, and the test on the B-dataset.
Does the average 10-fold cross-validation estimate of the overall classification error match the result we get
when testing on the independent B-dataset?

You can for example use the parameter ranges suggested in the lecture slides:

LAB 08 – Evaluation Labs

17
LAB 09 – Clustering Techniques

1.K-Mean
Question 1: Describle K-Means by handling
Solution:
import numpy as np
import scipy
from [Link] import KMeans
"""
Ma trận đầu vào X có dạng(m_sample, n_feature) có nghĩa là mỗi hàng của X b
iểu diễn 1 điểm dữ liệu
Nếu X chưa thỏa mãn điều kiện trên thì phải reshape lại => Bắt buộc
"""
def init_centers(X, k):
# Chọn ra và trả về k điểm ngẫu nhiên trong X
return X[[Link]([Link][0], k, replace=False)]

def assign_labels(X, centers):


# Tính khoảng cách từ điểm dữ liệu đang xét
Distances = [Link](X, centers)
# Phân nó vào cụm gần nhất
labels = [Link](Distances, axis=1)
[Link]([Link][0], 1)
return labels # Vector cột mà mỗi hàng của nó là label ứng với hàng tr
ong X
def update_centers(X, labels, K):
# Tính trung bình của các điểm dữ liệu trong X có cùng label
centers = [Link]((K, [Link][1]))
for k in range(K):
# Thu thập tất cả các điểm dữ liệu được phân vào cụm thứ k
Xk = X[labels == k, :]
# Tính trung bình của cụm để ra centers
centers[k, :] = [Link](Xk, axis=0)
return centers
def has_converged(centers, new_centers):
# So sánh giữa centers và new_centers, nếu giống nhau thì trả về True,
khác thì False
return set([tuple(a) for a in centers]) == set([tuple(a) for a in new_c
enters])
def kmeans(X, K):
# Khởi tạo center ngẫu nhiên

18
centers = [init_centers(X, K)]
labels = []
n_iter = 0 # Số vòng lặp
while True:
# Phân cụm dữ liệu
[Link](assign_labels(X, centers[-1]))
# Update lại center sau bước phân cụm
new_center = update_centers(X, labels[-1], K)
# Nếu center với và cũ giống nhau sau khi update thì dừng
if has_converged(centers[-1], new_center):
break
# Nếu center mới và cũ khác nhau cập nhập lại theo center mới
[Link](new_center)
n_iter += 1 # Số vòng lặp cộng thêm 1
# Trả về giá trị của center và label thu được ở vòng lặp cuối
return centers[-1], labels[-1], n_iter
means = [[2, 2], [8, 3], [3, 6]]
cov = [[1, 0], [0, 1]]
N = 500
X0 = [Link].multivariate_normal(means[0], cov, N)
X1 = [Link].multivariate_normal(means[1], cov, N)
X2 = [Link].multivariate_normal(means[2], cov, N)

X = [Link]((X0, X1, X2), axis=0)


K = 3
original_label = [Link]([0] * N + [1] * N + [2] * N).T
center, label, n_iter = kmeans(X, K)
kmean = KMeans(n_clusters=3, random_state=0)
[Link](X)
pred_label = [Link](X)
print("Center found by our algorithm: ")
print(center)
print()
print("Center found by sklearn: ")
print(kmean.cluster_centers_)

[Link]
Question 1: Implementing DBSCAN
These exercises walk you through a Python implementation of an algorithm for clustering
called DBSCAN, which is short for density-based spatial clustering for applications with noise.
It addresses a limitation of k-means clustering, as described below.
Although there are existing implementations for Python (e.g., see here), in this notebook we
are asking you to build it from scratch, albeit using a lot of scaffolding that we have provided.

19
Setup
Here are the modules you will need for this problem.

from [Link] import display

import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
import [Link] as plt

# Adapted from: [Link]


def make_crater (inner_rad=4, outer_rad=4.5, donut_len=2, inner_pts=1000, outer_pts=500, label=False):
# Make the inner core
radius_core = inner_rad*[Link] (inner_pts)
direction_core = 2*[Link]*[Link] (size=inner_pts)

# Simulate inner core points


core_x = radius_core*[Link] (direction_core)
core_y = radius_core*[Link] (direction_core)
crater_core = [Link] ({'x_1': core_x, 'x_2': core_y})
if label: crater_core['label'] = 0

# Make the outer ring


radius_ring = outer_rad + donut_len*[Link](outer_pts)
direction_ring = 2*[Link]*[Link](size = outer_pts)

# Simulate ring points


ring_x = radius_ring*[Link](direction_ring)
ring_y = radius_ring*[Link](direction_ring)
crater_ring = [Link] ({'x_1': ring_x, 'x_2': ring_y})
if label: crater_ring['label'] = 1

return [Link] ([crater_core, crater_ring])

def make_scatter_plot (df, x="x_1", y="x_2", hue="label",


palette={0: "red", 1: "olive", 2: "blue", 3: "green"},
size=5,
centers=None):
if (hue is not None) and (hue in [Link]):
[Link] (x=x, y=y, hue=hue, data=df, palette=palette,
fit_reg=False)
else:

20
[Link] (x=x, y=y, data=df, fit_reg=False)

if centers is not None:


[Link] (centers[:,0], centers[:,1],
marker=u'*', s=500,
c=[palette[0], palette[1]])

def make_scatter_plot2 (df, x="x_1", y="x_2", hue="label", size=5):


if (hue is not None) and (hue in [Link]):
[Link] (x=x, y=y, hue=hue, data=df,
fit_reg=False)
else:
[Link] (x=x, y=y, data=df, fit_reg=False)

Loading the data

We will work work with a synthetic data set that is an especially bad case for k-
means. It's sometimes called the "crater" data because of its shape. The data points
are stored in a file named [Link].
The data files you'll need should be provided in this repository. To load them, you'll
need to wrap the filenames using the function fn(f), defined below. If you are
working locally, you can obtain a copy at the following URL and may need to
modify fn(f) accordingly: [Link]

def fn(fn_base, dirname='./dbscan/'): # `dirname` set by default to its location in our repository
return '{}{}'.format(dirname, fn_base)

# Demo:
fn('[Link]')

Exercise 0 (1 point). Start by reading the data into a Pandas data frame. The data is
stored locally within this assignment in a file called [Link]. Store your data
frame in a variable named crater.

[]

###
### YOUR CODE HERE
###

display ([Link] (3))


print ("...")
display ([Link] (3))

21
[]

assert len (crater) == 1500


assert set ([Link]) == set (['x_1', 'x_2', 'kmeans_label'])

with open (fn('crater_counts.txt'), 'rt') as fp:


true_counts = [int (c) for c in [Link] ().split (',')]
assert sum (crater['kmeans_label'] == 0) == true_counts[0]
assert sum (crater['kmeans_label'] == 1) == true_counts[1]

make_scatter_plot (crater, hue='kmeans_label')

print ("\n(Passed!)")

The testing code plots these data points, which are 2-D. The colors show the clusters
computed by k-means for k=2. Notice that the "natural" structure is, arguably, a dense ball in
the middle and a ring (donut) on the outside. However, k-means instead split the points
about an arbitrary line that cuts through the middle of the points.
Indeed, this fact is one of the limitations of k-means: it works well when you know the value
of k and the k clusters come from Gaussian distributions of similar shape and size (and,
therefore, density). However, if you don't know k or there is non-uniform shape and density
among the clusters---or some other grouping, as above---then k-means does not work well
(qualitatively).

Elements of DBSCAN
The DBSCAN algorithm takes a different approach. Rather than having to provide the number
of clusters, k, you define parameters related to neighborhoods and target density. Let's see
how DBSCAN works by building it from the ground up.

Neighborhoods

The first important concept in DBSCAN is that of an ϵ-neighborhood.


Consider any point p. The ϵ-neighborhood of p is the set of all points within a
distance of ϵ from p. That is, if {x^0,x^1,…,x^m−1} is a collection of m data points,
then the ϵ-neighborhood centered at p is
Nϵ(p)={x^i:∥x^i−p∥2≤ϵ},
where the measure of distance is Euclidean (i.e., the two-norm). Notice that this
definition would include the point p if p is one of the data points.

22
Exercise 1 (1 point). Implement a function that computes the ϵ-neighborhood of p for
a data matrix of points, X, defined by our usual convention as
X=⎛⎝⎜⎜⎜⎜⎜x^T0x^T1⋮x^Tm−1⎞⎠⎟⎟⎟⎟⎟.
In particular, complete the function named region_query(p, eps, X) below. Its
inputs are:

• p[:d]: The query point, of dimension d.


• eps: The size of the ball around p to search.
• X[:m, :d]: The set of points, i.e., data matrix.

It should return a boolean Numpy array adj[:m] with one entry per point (i.e., per row
of X). The entry adj[i] should be True only if X[i, :] lies within the eps-sized ball
centered at p.
Hint: There is a one-line solution of the form, return (boolean array
expression).

[]

def region_query (p, eps, X):


# These lines check that the inputs `p` and `X` have
# the right shape.
_, dim = [Link]
assert ([Link] == (dim,)) or ([Link] == (1, dim)) or ([Link] == (dim, 1))

###
### YOUR CODE HERE
###

Here is the test code for region_query(). In addition to sanity-checking your


solution, it plots the original points, a query point (marked by a red star), and
highlights all points in an ϵ-neighborhood computed by your function so you can
visually verify the result. (In this test, p=(−0.5,1.2) and ϵ=1.0.)

[]

X = crater[['x_1', 'x_2']].values
p = [Link] ([-0.5, 1.2])
in_region = region_query (p, 1.0, X)

crater_ball = [Link] ()
crater_ball['label'] = in_region
make_scatter_plot (crater_ball, centers=p[[Link]])

with open (fn('region_query_soln.txt'), 'rt') as fp:


23
assert int ([Link] ()) == sum (in_region)

print ("\n(Passed!)")

Exercise 2 (1 point). Suppose you are given a vector y[:] of boolean


(True and False) values, such as the one computed above. Write a function
named index_set(y) that returns the index locations of all
of y's True elements. Your function must return these index values as a Python set.

[]

def index_set (y):


"""
Given a boolean vector, this function returns
the indices of all True elements.
"""
assert len ([Link]) == 1

###
### YOUR CODE HERE
###

[]

y_test = [Link] ([True, False, False, True, False, True, True, True, False])
i_soln = set ([0, 3, 5, 6, 7])

i_test = index_set (y_test)


assert type (i_test) is set
assert len (i_test) == len (i_soln)
assert i_test == i_soln

print ("\n(Passed!)")

Exercise 3 (1 point). Given a value for ϵ and a data matrix X of points, complete the
function below so that it determines the neighborhood of each point.
Your function,
def find_neighbors(eps, X[:m, :]):
...

should return a Python list neighbors[:m] such that neighbors[i] is the index set
of neighbors of point X[i, :].

24
[]

def find_neighbors (eps, X):


m, d = [Link]
neighbors = [] # Empty list to start
###
### YOUR CODE HERE
###
assert len (neighbors) == m
return neighbors

[]

with open (fn('find_neighbors_soln.csv'), 'rt') as fp:


neighbors = find_neighbors (0.25, X)
for i, n_i in enumerate (neighbors):
j_, c_j_ = [Link] ().split (',')
assert (i == int (j_)) and (len (n_i) == int (c_j_))

print ("\n(Passed!)")

Density
The next important concept in DBSCAN is that of the density of a neighborhood.
Intuitively, the DBSCAN algorithm will try to "grow" clusters around points whose
neighborhoods are sufficiently dense.
Let's make this idea more precise.
Definition: core points. A point p is a core point if its ϵ-neighborhood has at
least s points.
In other words, the algorithm now has two user-defined parameters: the
neighborhood size, ϵ, and the minimum density, specified using a threshold s on the
number of points in such a neighborhood.

Exercise 4 (2 points). Complete the function, find_core_points(s, neighbors),


below. It takes as input a minimum-points threshold, s, and a list of point
neighborhoods, neighbors[:], such that neighbors[i] is the (index) set of
neighbors of point i. It should return a Python set, core_set, such that an index j in
core_set only if the size of the neighborhood at j is at least s.

[]

def find_core_points (s, neighbors):


assert type (neighbors) is list
assert all ([type (n) is set for n in neighbors])

25
core_set = set ()
###
### YOUR CODE HERE
###
return core_set

[]

core_set = find_core_points (5, neighbors)


print ("Found {} core points.".format (len (core_set)))

def plot_core_set (df, core_set):


df_labeled = [Link] ()
df_labeled['label'] = False
df_labeled.loc[list (core_set), 'label'] = True
make_scatter_plot (df_labeled)

plot_core_set (crater, core_set)

with open (fn('find_core_points_soln.txt'), 'rt') as fp:


core_set_soln = set ([int (i) for i in [Link] ().split ()])
assert core_set == core_set_soln

print ("\n(Passed!)")

Growing clusters via "reachable" points


The last concept needed for DBSCAN is the idea of growing a cluster around a core
point. It depends on the notion of reachability.
Definition: Reachability. A point q is reachable from another point p if there exists a
sequence of points p=p1,p2,…,pk=q such that every pi is a core point, possibly
except for pk=q, and pi∈Nϵ(pi−1) for all 1<i<k.
This procedure works as follows.
"Expand Cluster" procedure:

1. Consider any point p that is not yet assigned to a cluster.


2. If p is a core point, then start a new cluster for it.
3. Maintain a "reachable" set, which will be used to hold points that are reachable
from p as they are encountered. Initially, the reachable points are just p's ϵ-
neighbors.
4. Remove any point q from the reachable set.
5. If q has not yet been visited, then mark it as being visited.

26
6. If q is also a core point, then add all of its neighbors to the reachable set, per the
definition of "reachability" above.
7. If q is not yet assigned to any cluster, then add it to p's cluster.

Notice how this procedure explores the points reachable from p (Step 6). Intuitively, it
is trying to join all neighborhoods whose core points are mutually contained.

Here is a brief illustration of these concepts:

In this picture, suppose the minimum density parameter is s=3 points. Thus, only
the ϵ-neighborhoods centered at 1, 3, and 6 are core points, since these are the only
points that include at least s=3 points. For instance, Nϵ(1)={0,1,3,7}, making it a
core point since its neighborhood has four (4) points, whereas Nϵ(4)={3,4} is not a
core point since its neighborhood has just two (2) points.

Exercise 5 (4 points). Implement the "cluster growth" procedure described above in


the function, expand_cluster(), below.
To simplify your task, we will give you all the lines of code that you need. However,
you need to figure out in what order these lines must execute, as well as how to
indent them!
The signature of the function is:
def expand_cluster (p, neighbors, core_set, visited, assignment):
...

Its parameters are:

• p is the index of a starting core point. The caller must guarantee that it is indeed a
core point, and furthermore, that it has been assigned to some cluster. (See below.)

27
• neighbors[:] is a list of ϵ-neighborhoods, given as Python sets. For
instance, neighbors[p] is a set of indices of all points in the neighborhood of p. It will
have been computed from find_neighbors() above.
• core_set is a Python set containing the indices of all core points. That is, the
expression, i in core_set, is true only if i is indeed a core point.
• visited is another Python set containing the indices of all points that have already
been visited. That is, the expression i in visited should be True only if i has been
visited. Thus, your expand_cluster() function should update this set when visiting
any previously unvisited point.
• assignment is a Python dictionary. The key is the index of a point; the value is the
cluster label to which that point has been assigned. Consequently, if a point i does
not yet have any cluster assignment, then the expression, i in assignment, will
be False. Your expand_cluster() function should update cluster assignments by
updating this dictionary.

The skeleton of expand_cluster() does everything up to and including Step 4 of the


procedure above. It first initializes the reachable set as a Python set, reachable,
containing the neighbors of p. It then removes one of those reachable points, storing
it in q. You just need to perform steps 5-7. In fact, we will even give you all of the lines
of code that you need! But you have to to incorporate them into the skeleton, ordered
and indented correctly.
assignment[q] = assignment[p]
if q in core_set:
if q not in assignment:
if q not in visited:
reachable |= neighbors[q]
[Link] (q)

[]

def expand_cluster (p, neighbors, core_set, visited, assignment):


# Assume the caller performs Steps 1 and 2 of the procedure.
# That means 'p' must be a core point that is part of a cluster.
assert (p in core_set) and (p in visited) and (p in assignment)

reachable = set (neighbors[p]) # Step 3


while reachable:
q = [Link] () # Step 4

# Put your reordered and correctly indented statements here:


###
### YOUR CODE HERE
###

# This procedure does not return anything


# except via updates to `visited` and
# `assignment`.

28
[]

# This test is based on the illustration above.


p_test = 3
neighbors_test = [set ([0, 1]),
set ([0, 1, 3, 7]),
set ([2, 3]),
set ([1, 2, 3, 4, 6]),
set ([3, 4]),
set ([5]),
set ([3, 6, 7]),
set ([1, 7])
]
core_set_test = set ([1, 3, 6])
visited_test = set ([p_test])
assignment_test = {p_test: 0}
expand_cluster (p_test, neighbors_test, core_set_test,
visited_test, assignment_test)
assert visited_test == set ([0, 1, 2, 3, 4, 6, 7]) # All but 5
assert set (assignment_test.keys ()) == visited_test
assert set (assignment_test.values ()) == set ([0])

print ("\n(Passed!)")

Putting it all together


If you've successfully completed all steps above, then you have everything you need
to run the final DBSCAN algorithm, which we've provided below. The second code cell
below shows a picture of the clusters discovered for a particular setting of
neighborhood size, ϵ, and density threshold, s.
And there is no additional code for you to write below! However, you should make
sure the remaining cells execute without error.

[]

def dbscan (eps, s, X):


clusters = []
point_to_cluster = {}

neighbors = find_neighbors (eps, X)


core_set = find_core_points (s, neighbors)

assignment = {}
next_cluster_id = 0

29
visited = set ()
for i in core_set: # for each core point i
if i not in visited:
[Link] (i) # Mark i as visited
assignment[i] = next_cluster_id
expand_cluster (i, neighbors, core_set,
visited, assignment)
next_cluster_id += 1

return assignment, core_set

[]

assignment, core_set = dbscan (0.5, 10, X)

print ("Number of core points:", len (core_set))


print ("Number of clusters:", max ([Link] ()))
print ("Number of unclassified points:", len (X) - len (assignment))

def plot_labels (df, labels):


df_labeled = [Link] ()
df_labeled['label'] = labels
make_scatter_plot2 (df_labeled)

labels = [-1] * len (X)


for i, c in [Link] ():
labels[i] = c
plot_labels (crater, labels)

with open (fn('dbscan_soln.csv'), 'rt') as fp:


num_cores, num_clusters, num_outliers = [Link] ().split (',')
assert int (num_cores) == len (core_set)
assert int (num_clusters) == max ([Link] ())
assert int (num_outliers) == (len (X) - len (assignment))
print ("\n(Passed!)")

Fin! If you've reached this point and all tests above pass, you are ready to submit your
solution to this problem. Don't forget to save you work prior to submitting.

Phát hiện bất thường dùng hỗn hợp Gauss

LAB 10 – Ensemble Learning and Random Forests


Bộ phân loại Biểu quyết

30
Bagging và Pasting
Rừng Ngẫu nhiên

LAB 11 – Dimensionality Reduction


PCA, PCA hạt nhân
1. Load the MNIST dataset and split it into a training set
and a test set (take the first 60,000 instances for training, and the remaining
10,000 for testing). Train a Random Forest classifier on the dataset and time how
long it takes, then evaluate the resulting model on the test set. Next, use PCA to
reduce the dataset’s dimensionality, with an explained variance ratio of 95%.
Train a new Random Forest classifier on the reduced dataset and see how long it
takes. Was training much faster? Next evaluate the classifier on the test set: how
does it compare to the previous classifier?

Solution
Exercise: Load the MNIST dataset (introduced in chapter 3) and split it into a training set and a test set
(take the first 60,000 instances for training, and the remaining 10,000 for testing).
The MNIST dataset was loaded earlier.
In [73]:
X_train = mnist['data'][:60000]
y_train = mnist['target'][:60000]

X_test = mnist['data'][60000:]
y_test = mnist['target'][60000:]
Exercise: Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the
resulting model on the test set.
In [74]:
from [Link] import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)


In [75]:
import time

t0 = [Link]()
rnd_clf.fit(X_train, y_train)
t1 = [Link]()
In [76]:
print("Training took {:.2f}s".format(t1 - t0))
Training took 35.27s
In [77]:
from [Link] import accuracy_score

y_pred = rnd_clf.predict(X_test)
accuracy_score(y_test, y_pred)
Out[77]:
0.9705
Exercise: Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.

31
In [78]:
from [Link] import PCA

pca = PCA(n_components=0.95)
X_train_reduced = pca.fit_transform(X_train)
Exercise: Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was
training much faster?
In [79]:
rnd_clf2 = RandomForestClassifier(n_estimators=100, random_state=42)
t0 = [Link]()
rnd_clf2.fit(X_train_reduced, y_train)
t1 = [Link]()
In [80]:
print("Training took {:.2f}s".format(t1 - t0))
Training took 81.03s

X_test_reduced = [Link](X_test)

y_pred = rnd_clf2.predict(X_test_reduced)
accuracy_score(y_test, y_pred)
Out[81]:
0.9481
In [82]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(multi_class="multinomial", solver="lbfgs",


random_state=42)
t0 = [Link]()
log_clf.fit(X_train, y_train)
t1 = [Link]()

print("Training took {:.2f}s".format(t1 - t0))


Training took 6.94s
Nice! Reducing dimensionality led to over 2× speedup. :) Let's check the model's accuracy:
In [87]:
y_pred = log_clf2.predict(X_test_reduced)
accuracy_score(y_test, y_pred)
Out[87]:
0.9201
A very slight drop in performance, which might be a reasonable price to pay for a 2× speedup, depending on
the application.
So there you have it: PCA can give you a formidable speedup... but not always!

2. Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the
result using Matplotlib. You can use a scatterplot using 10 different colors to rep‐
resent each image’s target class. Alternatively, you can write colored digits at the
location of each instance, or even plot scaled-down versions of the digit images
themselves (if you plot all digits, the visualization will be too cluttered, so you
32
should either draw a random sample or plot an instance only if no other instance
has already been plotted at a close distance). You should get a nice visualization
with well-separated clusters of digits. Try using other dimensionality reduction
algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.

Solution

[Link](42)

m = 10000
idx = [Link](60000)[:m]

X = mnist['data'][idx]
y = mnist['target'][idx]

from [Link] import TSNE

tsne = TSNE(n_components=2, random_state=42)


X_reduced = tsne.fit_transform(X)
Now let's use Matplotlib's scatter() function to plot a scatterplot, using a different color for each digit:
In [90]:
[Link](figsize=(13,10))
[Link](X_reduced[:, 0], X_reduced[:, 1], c=y, cmap="jet")
[Link]('off')
[Link]()
[Link]()

[Link](figsize=(9,9))
cmap = [Link].get_cmap("jet")
for digit in (2, 3, 5):
[Link](X_reduced[y == digit, 0], X_reduced[y == digit, 1],
c=[cmap(digit / 9)])
[Link]('off')
[Link]()

idx = (y == 2) | (y == 3) | (y == 5)
X_subset = X[idx]
y_subset = y[idx]

tsne_subset = TSNE(n_components=2, random_state=42)


X_subset_reduced = tsne_subset.fit_transform(X_subset)
In [93]:
[Link](figsize=(9,9))
for digit in (2, 3, 5):
[Link](X_subset_reduced[y_subset == digit, 0],
X_subset_reduced[y_subset == digit, 1], c=[cmap(digit / 9)])
[Link]('off')
[Link]()

33
from [Link] import MinMaxScaler
from [Link] import AnnotationBbox, OffsetImage

def plot_digits(X, y, min_distance=0.05, images=None, figsize=(13, 10)):


# Let's scale the input features so that they range from 0 to 1
X_normalized = MinMaxScaler().fit_transform(X)
# Now we create the list of coordinates of the digits plotted so far.
# We pretend that one is already plotted far away at the start, to
# avoid `if` statements in the loop below
neighbors = [Link]([[10., 10.]])
# The rest should be self-explanatory
[Link](figsize=figsize)
cmap = [Link].get_cmap("jet")
digits = [Link](y)
for digit in digits:
[Link](X_normalized[y == digit, 0], X_normalized[y == digit,
1], c=[cmap(digit / 9)])
[Link]("off")
ax = [Link]().gca() # get current axes in current figure
for index, image_coord in enumerate(X_normalized):
closest_distance = [Link](neighbors - image_coord,
axis=1).min()
if closest_distance > min_distance:
neighbors = np.r_[neighbors, [image_coord]]
if images is None:
[Link](image_coord[0], image_coord[1],
str(int(y[index])),
color=cmap(y[index] / 9), fontdict={"weight":
"bold", "size": 16})
else:
image = images[index].reshape(28, 28)
imagebox = AnnotationBbox(OffsetImage(image,
cmap="binary"), image_coord)
ax.add_artist(imagebox)

plot_digits(X_reduced, y)

plot_digits(X_reduced, y, images=X, figsize=(35, 25))

plot_digits(X_subset_reduced, y_subset, images=X_subset, figsize=(22, 22))

Let's start with PCA. We will also time how long it takes:
In [98]:
from [Link] import PCA
import time

t0 = [Link]()
X_pca_reduced = PCA(n_components=2, random_state=42).fit_transform(X)
t1 = [Link]()
print("PCA took {:.1f}s.".format(t1 - t0))
plot_digits(X_pca_reduced, y)

34
[Link]()

from [Link] import LocallyLinearEmbedding

t0 = [Link]()
X_lle_reduced = LocallyLinearEmbedding(n_components=2,
random_state=42).fit_transform(X)
t1 = [Link]()
print("LLE took {:.1f}s.".format(t1 - t0))
plot_digits(X_lle_reduced, y)
[Link]()

from [Link] import Pipeline

pca_lle = Pipeline([
("pca", PCA(n_components=0.95, random_state=42)),
("lle", LocallyLinearEmbedding(n_components=2, random_state=42)),
])
t0 = [Link]()
X_pca_lle_reduced = pca_lle.fit_transform(X)
t1 = [Link]()
print("PCA+LLE took {:.1f}s.".format(t1 - t0))
plot_digits(X_pca_lle_reduced, y)
[Link]()

from [Link] import MDS

m = 2000
t0 = [Link]()
X_mds_reduced = MDS(n_components=2, random_state=42).fit_transform(X[:m])
t1 = [Link]()
print("MDS took {:.1f}s (on just 2,000 MNIST images instead of
10,000).".format(t1 - t0))
plot_digits(X_mds_reduced, y[:m])
[Link]()

35
LAB 12 – Neural Networks and Deep Learning
Perceptron learning algorithm

LAB 13 – Neural Networks and Deep Learning (cont)


Multilayer neural network

36
LAB 14 – Recommender Systems
Question 1: Reading Movie-Lens data
Solution:
# Import libraries
import numpy as np
import pandas as pd

# Reading ratings file


ratings = pd.read_csv(path + '[Link]', sep='\t', encoding='latin-1',
usecols=['user_id', 'movie_id', 'rating', 'timestamp'])

# Reading users file


users = pd.read_csv(path + '[Link]', sep='\t', encoding='latin-1',
usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

# Reading movies file


movies = pd.read_csv(path + '[Link]', sep='\t', encoding='latin-1',
usecols=['movie_id', 'title', 'genres'])
Let's take a look at the movies and ratings dataframes.
In [4]:
[Link]()
Out[4]:

movie_id title genres

0 1 Toy Story (1995) Animation|Children's|Comedy

1 2 Jumanji (1995) Adventure|Children's|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama

4 5 Father of the Bride Part II (1995) Comedy

In [5]:
[Link]()
Out[5]:

user_id movie_id rating timestamp

37
user_id movie_id rating timestamp

0 1 1193 5 978300760

1 1 661 3 978302109

2 1 914 3 978301968

3 1 3408 4 978300275

4 1 2355 5 978824291
Also let's count the number of unique users and movies.
In [6]:
n_users = ratings.user_id.unique().shape[0]
n_movies = ratings.movie_id.unique().shape[0]
print 'Number of users = ' + str(n_users) + ' | Number of movies = ' +
str(n_movies)
Number of users = 6040 | Number of movies = 3706
Now I want the format of my ratings matrix to be one row per user and one column per movie. To
do so, I'll pivot ratings to get that and call the new variable Ratings (with a capital *R).
In [7]:
Ratings = [Link](index = 'user_id', columns ='movie_id', values =
'rating').fillna(0)
[Link]()
Out[7]:

.
movi 1 39 39 39 39 39 39 39 39 39 39
1 2 3 4 5 6 7 8 9 .
e_id 0 43 44 45 46 47 48 49 50 51 52
.

user_
id

.
5. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.

.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
2 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.

.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.

38
.
movi 1 39 39 39 39 39 39 39 39 39 39
1 2 3 4 5 6 7 8 9 .
e_id 0 43 44 45 46 47 48 49 50 51 52
.

user_
id

.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
4 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.

.
0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
5 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.

5 rows × 3706 columns


Last but not least, I need to de-normalize the data (normalize by each users mean) and convert it
from a dataframe to a numpy array.
In [8]:
R = Ratings.as_matrix()
user_ratings_mean = [Link](R, axis = 1)
Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1)
/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:1:
FutureWarning: Method .as_matrix will be removed in a future version.
Use .values instead.
"""Entry point for launching an IPython kernel.

Hệ khuyến nghị dựa trên nội dung


Build two Content Based Recommenders based on:

• Movie Overviews and Taglines


• Movie Cast, Crew, Keywords and Genre
Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due
to limiting computing power available to me.

In [15]:
links_small = pd.read_csv('../input/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
In [16]:
md = [Link]([19730, 29503, 35587])
In [17]:
#Check EDA Notebook for how and why I got these indices.
md['id'] = md['id'].astype('int')
In [18]:
39
smd = md[md['id'].isin(links_small)]
[Link]
Out[18]:
(9099, 25)
We have 9099 movies avaiable in our small movies metadata dataset which is 5 times smaller
than our original dataset of 45000 movies.

Movie Description Based Recommender


Let us first try to build a recommender using movie descriptions and taglines. We do not have a
quantitative metric to judge our machine's performance so this will have to be done qualitatively.

smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0,
stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])
tfidf_matrix.shape
Out[21]:
(9099, 268124)
Cosine Similarity
I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity
between two movies. Mathematically, it is defined as follows:

cosine(x,y)=x.y⊺||x||.||y||cosine(x,y)=x.y⊺||x||.||y||
Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the
Cosine Similarity Score. Therefore, we will use sklearn's linear_kernel instead of
cosine_similarities since it is much faster.

In [22]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
In [23]:
cosine_sim[0]
Out[23]:
array([ 1. , 0.00680476, 0. , ..., 0. ,
0.00344913, 0. ])
We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is
to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [24]:
smd = smd.reset_index()
titles = smd['title']
indices = [Link]([Link], index=smd['title'])
In [25]:
def get_recommendations(title):
idx = indices[title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:31]
40
movie_indices = [i[0] for i in sim_scores]
return [Link][movie_indices]
We're all set. Let us now try and get the top recommendations for a few movies and see how
good the recommendations are.

In [26]:
get_recommendations('The Godfather').head(10)
Out[26]:
973 The Godfather: Part II
8387 The Family
3509 Made
4196 Johnny Dangerously
29 Shanghai Triad
5667 Fury
2412 American Movie
1582 The Godfather: Part III
4221 8 Women
2159 Summer of Sam
Name: title, dtype: object
In [27]:
get_recommendations('The Dark Knight').head(10)
Out[27]:
7931 The Dark Knight Rises
132 Batman Forever
1113 Batman Returns
8227 Batman: The Dark Knight Returns, Part 2
7565 Batman: Under the Red Hood
524 Batman
7901 Batman: Year One
2579 Batman: Mask of the Phantasm
2696 JFK
8165 Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

Hệ khuyến nghị dựa trên lọc cộng tác


Introduction

This example demonstrates Collaborative filtering using the Movielens dataset to


recommend movies to users. The MovieLens ratings dataset lists the ratings given by
a set of users to a set of movies. Our goal is to be able to predict ratings for movies a
user has not yet watched. The movies with the highest predicted ratings can then be
recommended to the user.

The steps in the model are as follows:

1. Map user ID to a "user vector" via an embedding matrix


2. Map movie ID to a "movie vector" via an embedding matrix
41
3. Compute the dot product between the user vector and movie vector, to obtain
the a match score between the user and the movie (predicted rating).
4. Train the embeddings via gradient descent using all known user-movie pairs.

import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from [Link] import layers
from pathlib import Path
import [Link] as plt

First, load the data and apply preprocessing


# Download the actual data from [Link]
[Link]"
# Use the [Link] file
movielens_data_file_url = (
"[Link]
)
movielens_zipped_file = [Link].get_file(
"[Link]", movielens_data_file_url, extract=False
)
keras_datasets_path = Path(movielens_zipped_file).parents[0]
movielens_dir = keras_datasets_path / "ml-latest-small"

# Only extract the data the first time the script is run.
if not movielens_dir.exists():
with ZipFile(movielens_zipped_file, "r") as zip:
# Extract files
print("Extracting all the files now...")
[Link](path=keras_datasets_path)
print("Done!")

ratings_file = movielens_dir / "[Link]"


df = pd.read_csv(ratings_file)
First, need to perform some preprocessing to encode users and movies as integer
indices.

user_ids = df["userId"].unique().tolist()
user2user_encoded = {x: i for i, x in enumerate(user_ids)}
userencoded2user = {i: x for i, x in enumerate(user_ids)}
movie_ids = df["movieId"].unique().tolist()
movie2movie_encoded = {x: i for i, x in enumerate(movie_ids)}
movie_encoded2movie = {i: x for i, x in enumerate(movie_ids)}
df["user"] = df["userId"].map(user2user_encoded)
df["movie"] = df["movieId"].map(movie2movie_encoded)

num_users = len(user2user_encoded)
num_movies = len(movie_encoded2movie)
df["rating"] = df["rating"].[Link](np.float32)
# min and max ratings will be used to normalize the ratings later
min_rating = min(df["rating"])
max_rating = max(df["rating"])
42
print(
"Number of users: {}, Number of Movies: {}, Min rating: {}, Max rating:
{}".format(
num_users, num_movies, min_rating, max_rating
)
)
Number of users: 610, Number of Movies: 9724, Min rating: 0.5, Max rating: 5.0

Prepare training and validation data


df = [Link](frac=1, random_state=42)
x = df[["user", "movie"]].values
# Normalize the targets between 0 and 1. Makes it easy to train.
y = df["rating"].apply(lambda x: (x - min_rating) / (max_rating -
min_rating)).values
# Assuming training on 90% of the data and validating on 10%.
train_indices = int(0.9 * [Link][0])
x_train, x_val, y_train, y_val = (
x[:train_indices],
x[train_indices:],
y[:train_indices],
y[train_indices:],
)

Create the model

We embed both users and movies in to 50-dimensional vectors.

The model computes a match score between user and movie embeddings via a dot
product, and adds a per-movie and per-user bias. The match score is scaled to
the [0, 1] interval via a sigmoid (since our ratings are normalized to this range).

EMBEDDING_SIZE = 50

class RecommenderNet([Link]):
def __init__(self, num_users, num_movies, embedding_size, **kwargs):
super(RecommenderNet, self).__init__(**kwargs)
self.num_users = num_users
self.num_movies = num_movies
self.embedding_size = embedding_size
self.user_embedding = [Link](
num_users,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=[Link].l2(1e-6),
)
self.user_bias = [Link](num_users, 1)
self.movie_embedding = [Link](
num_movies,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=[Link].l2(1e-6),

43
)
self.movie_bias = [Link](num_movies, 1)

def call(self, inputs):


user_vector = self.user_embedding(inputs[:, 0])
user_bias = self.user_bias(inputs[:, 0])
movie_vector = self.movie_embedding(inputs[:, 1])
movie_bias = self.movie_bias(inputs[:, 1])
dot_user_movie = [Link](user_vector, movie_vector, 2)
# Add all the components (including bias)
x = dot_user_movie + user_bias + movie_bias
# The sigmoid activation forces the rating to between 0 and 1
return [Link](x)

model = RecommenderNet(num_users, num_movies, EMBEDDING_SIZE)


[Link](
loss=[Link](),
optimizer=[Link](learning_rate=0.001),
)

Train the model based on the data split


history = [Link](
x=x_train,
y=y_train,
batch_size=64,
epochs=5,
verbose=1,
validation_data=(x_val, y_val),
)
Epoch 1/5
1418/1418 [==============================] - 9s 6ms/step - loss: 0.6358 -
val_loss: 0.6200
Epoch 2/5
1418/1418 [==============================] - 11s 8ms/step - loss: 0.6134 -
val_loss: 0.6173
Epoch 3/5
1418/1418 [==============================] - 14s 10ms/step - loss: 0.6082 -
val_loss: 0.6156
Epoch 4/5
1418/1418 [==============================] - 15s 10ms/step - loss: 0.6072 -
val_loss: 0.6144
Epoch 5/5
1418/1418 [==============================] - 13s 9ms/step - loss: 0.6074 -
val_loss: 0.6147

Plot training and validation loss


[Link]([Link]["loss"])
[Link]([Link]["val_loss"])
[Link]("model loss")
[Link]("loss")
[Link]("epoch")
[Link](["train", "test"], loc="upper left")
[Link]()

44
Show top 10 movie recommendations to a user
movie_df = pd.read_csv(movielens_dir / "[Link]")

# Let us get a user and see the top recommendations.


user_id = [Link](1).iloc[0]
movies_watched_by_user = df[[Link] == user_id]
movies_not_watched = movie_df[
~movie_df["movieId"].isin(movies_watched_by_user.[Link])
]["movieId"]
movies_not_watched = list(
set(movies_not_watched).intersection(set(movie2movie_encoded.keys()))
)
movies_not_watched = [[movie2movie_encoded.get(x)] for x in movies_not_watched]
user_encoder = user2user_encoded.get(user_id)
user_movie_array = [Link](
([[user_encoder]] * len(movies_not_watched), movies_not_watched)
)
ratings = [Link](user_movie_array).flatten()
top_ratings_indices = [Link]()[-10:][::-1]
recommended_movie_ids = [
movie_encoded2movie.get(movies_not_watched[x][0]) for x in top_ratings_indices
]

print("Showing recommendations for user: {}".format(user_id))


print("====" * 9)
print("Movies with high ratings from user")
print("----" * 8)
top_movies_user = (
movies_watched_by_user.sort_values(by="rating", ascending=False)
.head(5)
.[Link]
)
movie_df_rows = movie_df[movie_df["movieId"].isin(top_movies_user)]
for row in movie_df_rows.itertuples():
print([Link], ":", [Link])

print("----" * 8)
45
print("Top 10 movie recommendations")
print("----" * 8)
recommended_movies = movie_df[movie_df["movieId"].isin(recommended_movie_ids)]
for row in recommended_movies.itertuples():
print([Link], ":", [Link])
302/302 [==============================] - 0s 800us/step
Showing recommendations for user: 213
====================================
Movies with high ratings from user
--------------------------------
Terminator 2: Judgment Day (1991) : Action|Sci-Fi
Rocky (1976) : Drama
Big Fish (2003) : Drama|Fantasy|Romance
Shrek 2 (2004) : Adventure|Animation|Children|Comedy|Musical|Romance
13 Assassins (Jûsan-nin no shikaku) (2010) : Action
--------------------------------
Top 10 movie recommendations
--------------------------------
Usual Suspects, The (1995) : Crime|Mystery|Thriller
Star Wars: Episode IV - A New Hope (1977) : Action|Adventure|Sci-Fi
Shawshank Redemption, The (1994) : Crime|Drama
Schindler's List (1993) : Drama|War
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) :
Comedy|War
Godfather: Part II, The (1974) : Crime|Drama
Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001) : Comedy|Romance
28 Days Later (2002) : Action|Horror|Sci-Fi
Little Miss Sunshine (2006) : Adventure|Comedy|Drama
Hurt Locker, The (2008) : Action|Drama|Thriller|War

LAB 15 – Evaluation Labs


===== // =====

46

You might also like