0% found this document useful (0 votes)

33 views46 pages

Machine Learning Labs Overview

Uploaded by

Sơn Trịnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views46 pages

Machine Learning Labs Overview

Uploaded by

Sơn Trịnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Machine Learning Practices

LAB 01 – Overview of Machine Learning, Tool & Library ..................................... 2

LAB 02 – Machine Learning Project ..................................................................... 3
LAB 03 – Regression Techniques ......................................................................... 4
LAB 04 – Regression Techniques (cont) ............................................................... 4
LAB 05 – Classification Techniques ...................................................................... 7
LAB 06 – Classification Techniques (cont)............................................................ 9
LAB 07 – Classification Techniques (cont).......................................................... 12
LAB 08 – Evaluation Labs ................................................................................... 14
LAB 09 – Clustering Techniques......................................................................... 18
LAB 10 – Ensemble Learning and Random Forests ............................................ 18
LAB 11 – Dimensionality Reduction ................................................................... 31
LAB 12 – Neural Networks and Deep Learning .................................................. 36
LAB 13 – Neural Networks and Deep Learning (cont) ........................................ 36
LAB 14 – Recommender Systems ...................................................................... 37
LAB 15 – Evaluation Labs ................................................................................... 37

1
LAB 01 – Overview of Machine Learning, Tool & Library

1. How would you define Machine Learning?

2. Can you name four types of problems where it shines?
3. What is a labeled training set?
4. What are the two most common supervised tasks?
5. Can you name four common unsupervised tasks?
6. What type of Machine Learning algorithm would you use to allow a robot to walk
in various unknown terrains?
7. What type of algorithm would you use to segment your customers into multiple
groups?
8. Would you frame the problem of spam detection as a supervised learning problem
or an unsupervised learning problem?
9. What is an online learning system?
10. What is out-of-core learning?
11. What type of learning algorithm relies on a similarity measure to make predictions?
12. What is the difference between a model parameter and a learning algorithm’s
hyperparameter?
13. What do model-based learning algorithms search for? What is the most common
strategy they use to succeed? How do they make predictions?
14. Can you name four of the main challenges in Machine Learning?
15. If your model performs great on the training data but generalizes poorly to new
instances, what is happening? Can you name three possible solutions?
16. What is a test set and why would you want to use it?
17. What is the purpose of a validation set?
18. What can go wrong if you tune hyperparameters using the test set?
19. What is repeated cross-validation and why would you prefer it to using a single
validation set?

2
LAB 02 – Machine Learning Project

1. Try a Support Vector Machine regressor ([Link]), with various

hypeparameters such as kernel="Linear" (with various values for the C
hyperparameter) or kernel="rbf" (with various values for the C and gamma
hyperparameters). How does the best SVR predictor perform?
2. Try replacing GridSearchCV with RandomizedSearchCV.
3. Try adding a transformer in the preparation pipeline to select only the most important
attributes.
4. Try creating a single pipeline that does the full data preparation plus the final
prediction.
5. Automatically explore some preparation options using GridSearchCV.

3
LAB 03 – Regression Techniques

Question 1: Apply Gradient Descent to find minimum of a function

Solution:
function = lambda x: (x ** 3)-(3 *(x ** 2))+7
x = [Link](-1,3,500)
[Link](x, function(x))
[Link](0, color='y', alpha = 0.7)
[Link](0, color='y', alpha = 0.7)
[Link]()
[Link]("Function (x ** 3)-(3 *(x ** 2))+7 ")
[Link]()
pip install numdifftools
import numdifftools as nd
dydx = lambda x : float([Link](function)([x]))
x_sample = 0.1
print(f"Derivatives of function at {x_sample} is {dydx(0.1)}")
def step(x_new, x_prev, precision, l_r):
'''
Description: This function takes in an initial or previous value for x,
updates it based on
steps taken via the learning rate and outputs the most minimum value of
x that reaches
the precision satisfaction.

Arguments:
x_new - a starting value of x that will get updated based on the
learning rate

x_prev - the previous value of x that is getting updated to the new one

precision - a precision that determines the stop of the stepwise

descent

l_r - the learning rate (size of each descent step)

'''
# create empty lists where the updated values of x and y wil be
appended during each iteration
x_list, y_list = [x_new], [function(x_new)]
# keep looping until your desired precision
while abs(x_new - x_prev) > precision:
# change the value of x
x_prev = x_new
# get the derivation of the old value of x
d_x = - dydx(x_prev)
# get your new value of x by adding the previous, the
multiplication of the derivative and the learning rate
x_new = x_prev + (l_r * d_x)
# append the new value of x to a list of all x-s for later
visualization of path
x_list.append(x_new)
# append the new value of y to a list of all y-s for later
visualization of path

4
y_list.append(function(x_new))
x_min = x_new
print ("Local minimum occurs at: "+ str(x_min))
print ("Number of steps: " + str(len(x_list)))

return x_list, y_list, x_min

#Implement gradient descent (all the arguments are arbitrarily chosen)
x_list, y_list, x_min = step(x_new = 0.5, x_prev = 0, precision = 0.001,
l_r = 0.05)

[Link](x, function(x))
[Link](x_list,y_list,c="r")
[Link](0, color='y', alpha = 0.7)
[Link](0, color='y', alpha = 0.7)
[Link]()
[Link]("Function (x ** 3)-(3 *(x ** 2))+7 ")
[Link]()
[Link](x_list,y_list,c="g")
[Link](x_list,y_list,c="g")
[Link](x,function(x), c="r")
[Link]([1.0,2.1])
[Link]("Zoomed in Gradient descent to Key Area")
[Link]()
[Link]()

Question 2: Draw all the loss functions

range_min = -4
range_max = 4
stepsize = 0.1
m = 1.0

x = [Link](range_min, range_max + stepsize, stepsize)

y_hinge = [Link](0,m-x)
y_log = [Link](1 + [Link](-x))
y_exp = [Link](-x)

fig, ax = [Link](figsize=(15, 8))

[Link](x, y_hinge, c='r', linewidth=4, linestyle=(0,(1,1)), label= "hinge:
max(0,m-z)", alpha=0.7)
[Link](x, y_log, c='g', linewidth=4, linestyle='dashdot', label="log :
log(1+\exp(-z))", alpha=0.7)
[Link](x, y_exp, c='orange', linewidth=4, linestyle=(0,(4,1,4,1,1,1)),
label="exp: exp(-z)", alpha=0.7)
[Link]()
[Link]("z = y * xf(x)")
[Link]("Loss")
[Link](-0.5, 5)
[Link](0, color='black', alpha = 0.7)
[Link](0, color='black', alpha = 0.7)
[Link]()
[Link](
def meanabsoluteerror(ytrue,ypred):
trainingexamples = [Link][0]
print("Total no of training examples is "+str(trainingexamples))
errorvector = ypred-ytrue
print("Shape of error vector "+str([Link]))
5
error = (1/trainingexamples)*[Link]([Link](errorvector))
return error

def meansquarelogerror(ytrue,ypred):
trainingexamples = [Link][0]
print("Total no of training examples is "+str(trainingexamples))
errorvector = [Link]([Link](ypred))-[Link]([Link](ytrue))
print("Shape of error vector "+str([Link]))
error = (1/trainingexamples)*[Link](errorvector**2)
return error

def sigmoid(arrays):
return 1/(1+[Link]((-1)*arrays))

def crossentropy(ytrue,ypred):
trainingexamples = [Link][0]
errorvector = -1 * ytrue * [Link](ypred)
return (1/trainingexamples)*[Link](errorvector)
)
trainingsets = [Link](100,1)
trainingsets_pred = [Link](100,1)
error = meanabsoluteerror(trainingsets,trainingsets_pred)
print(f"Mean Absolute Error is {error}\n")

error = meansquarelogerror(trainingsets,trainingsets_pred)
print(f"Mean Square Log Error is {error}\n")

sig_train=sigmoid(trainingsets)
sig_pred=sigmoid(trainingsets_pred)
print(trainingsets[:5])
print(sig_train[:5])
error = crossentropy(sig_train,sig_pred)
print(f"Cross Entropy Loss is {error}\n")

6
LAB 04 – Regression Techniques (cont)

Question 1: Draw all the activation functions

Solution:
import [Link] as plt
import numpy as np
import math
import tensorflow as tf
def PlotActivationFuction(x, y1, y2, title = "Activation y1 and y2"):
fig, (ax1, ax2) = [Link](2, 1, figsize=(15,7))
[Link](title)

[Link](x, y1, '-ro')

[Link](xlabel='x', ylabel='y1')
[Link]()
[Link](x, y2, '-go')
[Link](xlabel='x', ylabel='y2')
[Link]()

[Link]()
# Sigmoid (Logistic Function)
x = [Link](-10, 10, 50)

sigmoid = lambda z: 1/(1 + [Link](-z))

relu = lambda z: max(0.0, z)
leakyrelu = lambda z, alpha : z if z > 0 else alpha*z
tanh = lambda z : ([Link](z) - [Link](-z))/ ([Link](z) + [Link](-z))
swish = lambda z : z* 1/(1 + [Link](-z))
ELU = lambda z : z if z > 0 else (1*([Link](z)-1))

y_sigmoid_manually, y_sigmoid_library = sigmoid(x),

[Link](x)
PlotActivationFuction(x, y_sigmoid_manually, y_sigmoid_library, "Sigmoid
Activation Manually and Library")

y_relu_manually, y_relu_library = [relu(i) for i in x],

[Link](x)
PlotActivationFuction(x, y_relu_manually, y_relu_library, "Relu Activation
Manually and Library")

y_leakyrelu_manually, y_leakyrelu_library = [leakyrelu(i, 0.1) for i in x]

\
, [Link](x,
alpha= 0.1)
PlotActivationFuction(x, y_leakyrelu_manually, y_leakyrelu_library, "Leaky
Relu Activation Manually and Library")

y_tanh_manually, y_tanh_library = tanh(x), [Link](x)

PlotActivationFuction(x, y_tanh_manually, y_tanh_library, "Tanh Relu
Activation Manually and Library")

y_swish_manually, y_swish_library = swish(x), [Link](x)

PlotActivationFuction(x, y_swish_manually, y_swish_library, "Swish Relu
Activation Manually and Library")

7
y_elu_manually, y_elu_library = [ELU(i) for i in x] ,
[Link](x)
PlotActivationFuction(x, y_elu_manually, y_elu_library, "ELU Relu
Activation Manually and Library")

Question 2: Draw all the derivatives of activation functions

Solution:

def PlotDerivative(x, y, dydx, title = "Functions and Its Derivatives"):

fig, ax = [Link](figsize=(15, 3))
[Link](x,y, color= 'red', linewidth=3, label="function")
[Link](x,dydx, color= 'green', linewidth=3, label="derivative")
[Link](loc="upper right")
[Link](0, color='y', alpha = 0.7)
[Link](0, color='y', alpha = 0.7)
[Link]()
[Link](title)
[Link]()

In [76]:
x = [Link](-5.0, 5.0, 0.01)
dx = x[1]-x[0]

y_sigmoid = [Link](x)
dydx_sigmoid = [Link](y_sigmoid, dx)
PlotDerivative(x, y_sigmoid, dydx_sigmoid, title = "Sigmoid Functions and Its
Derivatives")

y_relu = [Link](x)
dydx_relu = [Link](y_relu, dx)
PlotDerivative(x, y_relu, dydx_relu, title = "Relu Functions and Its Derivatives")

y_leakyrelu = [Link](x, alpha= 0.3)

dydx_leakyrelu = [Link](y_leakyrelu, dx)
PlotDerivative(x, y_leakyrelu, dydx_leakyrelu, title = "Leaky Relu Functions and Its
Derivatives")

y_tanh = [Link](x)
dydx_tanh = [Link](y_tanh, dx)
PlotDerivative(x, y_tanh, dydx_tanh, title = "Tanh Functions and Its Derivatives")
y_swish = [Link](x)
dydx_swish = [Link](y_swish, dx)
PlotDerivative(x, y_swish, dydx_swish, title = "Swish Functions and Its
Derivatives")
y_elu = [Link](x)
dydx_elu = [Link](y_elu, dx)
PlotDerivative(x, y_elu, dydx_elu, title = "ELU Functions and Its Derivatives")

8
LAB 05 – Classification Techniques

Question 1: Descibe KNN by handling

Solution
import numpy as np
from collections import Counter

def euclidian_distance(x1, x2): # x1, x2 là 2 vector.

return [Link](sum((x1 - x2) ** 2))

class KNeighborsClassifier:
def __init__(self, k=3):
self.k = k

def fit(self, X, y): # Store sample and label

self.X_train = X
self.y_train = y

def predict(self, X):

predicted_label = [self._predict(x) for x in X]
return [Link](predicted_label)

def _predict(self, x):

# Compute distances
distances = [euclidian_distance(x, x_train) for x_train in self.X_t
rain]
# Trả về index của mảng BAN ĐẦU sau khi được xắp xếp
k_indices = [Link](distances)[0:self.k]
# Lấy ra label của những sample có khảng cách gần nhất đến điểm đan
g xet
k_nearest_label = [self.y_train[i] for i in k_indices]
# Trả về tupple có dạng (label xuất hiện nhiều nhất, số lần xuất hi
ện)
most_common = Counter(k_nearest_label).most_common(1)
# Trả về label có số lần xuất hiện nhiều nhất
return most_common[0][0]

def loadDataTest(path):
f = open(path, "r")
data = [Link](f) #csv format
data = [Link](list(data))# covert to matrix
data = [Link](data, 0, 0)# delete header
data = [Link](data, 0, 1) # delete index
[Link](data) # shuffle data
[Link]()
testSet = data[:]# the others is testing data
return testSet
def calcDistancs(pointA, pointB, numOfFeature=4):
tmp = 0
for i in range(numOfFeature):
tmp += (float(pointA[i]) - float(pointB[i])) ** 2
return [Link](tmp)
def kNearestNeighbor(trainSet, point, k):
distances = []
for item in trainSet:
[Link]({
"label": item[-1],
"value": calcDistancs(item, point)
})
[Link](key=lambda x: x["value"])
labels = [item["label"] for item in distances]
return labels[:k]
def findMostOccur(arr):
labels = set(arr) # set label
ans = ""
maxOccur = 0
for label in labels:
num = [Link](label)
if num > maxOccur:
maxOccur = num
ans = label
return ans

10
if __name__ == "__main__":
trainSet, testSet = loadData("[Link]")
testSet1 = loadDataTest("Iris_classification.csv")
for item in testSet:
knn = kNearestNeighbor(trainSet, item, 5)
answer = findMostOccur(knn)
print("label: {} -> predicted: {}".format(item[-1], answer))
# for item in testSet1:
# knn = kNearestNeighbor(trainSet, item, 11)
# answer = findMostOccur(knn)
# print("label: {} -> predicted: {}".format(item[-1], answer))

11
LAB 06 – Classification Techniques (cont)

[Link] Bayes
In this question you will implement a Naive Bayes classifier for a text classification problem. You will
be given a collection of text articles, each coming from either the serious European magazine The
Economist, or from the not-so-serious American magazine The Onion. The goal is to learn a classifier
that can distinguish between articles from each magazine. We have pre-processed the articles so that
they are easier to use in your experiments. We extracted the set of all words that occur in any of the
articles. This set is called the vocabulary and we let V be the number of words in the vocabulary. For
each article, we produced a feature vector X = hX1, . . . , XV i, where Xi is equal to 1 if the i th word
appears in the article and 0 otherwise.
Each article is also accompanied by a class label of either 1 for The Economist or 2 for The Onion.
Later in the question we give instructions for loading this data into Octave. When we apply the Naive
Bayes classification algorithm, we make two assumptions about the data: first, we assume that our
data is drawn iid from a joint probability distribution over the possible feature vectors X and the
corresponding class labels Y ; second, we assume for each pair of features Xi and Xj with i 6= j that Xi
is conditionally independent of Xj given the class label Y (this is the Naive Bayes assumption). Under
these assumptions, a natural classification rule is as follows: Given a new input X, predict the most
probable class label Yˆ given X. Formally,

[Link] Trees
Question 1. How many unique, perfect binary trees of depth 3 can be drawn if we have 5 attributes.
By depth, we mean depth of the splits, not including the nodes that only contain a label. So a tree
that checks just one attribute is a depth 1 tree. By perfect binary tree, we mean every node has either
0 or 2 children, and every leaf is at the same depth. Note also that a tree with the same attributes but
organized at different depths are considered “unique”. Do not include trees that test the same
attribute along the same path in the tree.

Question 2: Consider the following dataset for this problem. Given the five attributes on the left, we
want to predict if the student got an A in the course.

12
Create 2 decision trees for this dataset. For the first, only go to depth 1. For the second go to depth 2.
For all trees, use the ID3 entropy algorithm from class. For each node of the tree, show the decision,
the number of positive and negative examples and show the entropy at that node.

Hint: There are a lot of calculations here. You may want to do this programatically.
(Bonus)Make one more decision tree. Use the same procedure as in (b), but make it depth 3. Now,
given these three trees, which would you prefer if you wanted to predict the grades of 10 new
students who are not included in this dataset? Justify your choice.

Question 3. Recall the definition of the “realizable case.” For some fixed concept class C, such as
decision trees, a realizable case is one where the algorithm gets a sample consistent with some
concept c ∈ C. In other words, for decision trees, a case is realizable if there is some tree that
perfectly classifies the dataset. If the number of attributes A is sufficiently large, under what condition
would a dataset not be realizable for decision trees of no fixed depth? Prove that the dataset is
unrealizable if and only if that condition is true.

13
LAB 07 – Classification Techniques (cont)

[Link] Vector Machines

Question 1
Show that the criterion
corresponds to correct classification for all n samples in a binary classification problem where

and ℒ𝑘 denotes the index set of class k.

Question 2
Given a binary data set:

Plot the points. Sketch the support vectors and the decision boundary for a linear SVM
classifier with maximum margin for this data set.

Question 3
Given the binary classification problem:

14
a) Sketch the points in a scatterplot (preferably with different colors for the different classes).
b) In the plot, sketch the mean values and the decision boundary you would get with a
Gaussian classifier with covariance matrix Σ=σ2I, where I is the identity matrix.
c) What is the error rate of the Gaussian classifier on the training data set?
d) Sketch on the plot the decision boundary you would get using a SVM with linear kernel
and a high cost of misclassifying training data. Indicate the support vectors and the decision
boundary on the plot.
e) What is the error rate of the linear SVM on the training data set?
Question 4
a) Download the two datasets [Link] and [Link].
You can use a library for SVM, e.g. scikit-learn in python.
Hint:

from sklearn import svm

# In this case we assume 2d features.

# X: numpy array with shape (n, 2), all n data points.
# Y: numpy array of shape (n), labels in {0, 1, 2, 3, ...} corresponding to X

classifier = [Link](C=10, kernel='linear')

[Link](X, Y)

print("The feature [1, 2] is classified as:", [Link]([[1, 2]]))

Familiarize yourself with the data sets by studying scatterplots.

b) Load [Link]. Stick with the linear SVM, but change the C-parameter.

Rerun the experiments a couple of times, and visualize the data using something like the following:

import numpy as np

15
import [Link] as plt

def make_meshgrid(X, h=.02):

"""Make a meshgrid covering the range of X. This will be used to draw
classification regions

Args:
X: numpy array with shape [n, 2] containing 2d feature vectors.
h: parameter controlling the resolution of the meshgrid
"""
x = X[:, 0]
y = X[: ,1]
x_min, x_max = [Link]() - 1, [Link]() + 1
y_min, y_max = [Link]() - 1, [Link]() + 1
xx, yy = [Link]([Link](x_min, x_max, h), [Link](y_min, y_max, h))
return xx, yy

def scatter(X, Y, xx, yy, Z):

"""
Scatter plot with classification regions

Args:
X: numpy array of shape [n, 2] where n is the total number of datapoints
Y: numpy array of shape [n] containing the labels {1, 2, 3, ...} of X
xx: meshgrid x
yy: meshgrid y
Z: The result of applying some prediction function on all points in xx and
yy
"""
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

[Link]()
# Color class regions
[Link]().contourf(xx, yy, Z, alpha=0.7)
# Data points
[Link](X[:, 0], X[:, 1], c=Y, cmap=[Link], marker='o',
edgecolors='k')
[Link](x_min, x_max)
[Link](y_min, y_max)
[Link]().set_aspect('equal')
[Link]()
plt.tight_layout()

[Link]()

# Given X and Y as explained above, we can display the data and classification
boundaries as
classifier = [Link](C=100.0, kernel='linear')
[Link](X, Y)

xx, yy = make_meshgrid(X)
Z = [Link](np.c_[[Link](), [Link]()])
Z = [Link]([Link])
scatter(X, Y, xx, yy, Z)

How does the support vectors and the boundary change with the parameter?

16
c) Try to remove some of the non-support-vectors and rerun. Does the solution change?

d) Load [Link]. Try various values of the C−parameter with a linear SVM. Can the linear SVM
classifier make a good separation of the feature space?

e) Change kernel to a RBF (Radial Basis Function), and rerun. Try changing the σ-parameter.

In the lecture we defined the rbf kernel as

In sklearn, σ is expressed through a γ parameter, and the rbf kernel is given as

f) Implement a grid search of the C- and σ- parameters based on 10-fold cross-validation of the training data
(the A-dataset). Find the best values of C and σ, retrain on the entire A–dataset, and the test on the B-dataset.
Does the average 10-fold cross-validation estimate of the overall classification error match the result we get
when testing on the independent B-dataset?

You can for example use the parameter ranges suggested in the lecture slides:

LAB 08 – Evaluation Labs

17
LAB 09 – Clustering Techniques

1.K-Mean
Question 1: Describle K-Means by handling
Solution:
import numpy as np
import scipy
from [Link] import KMeans
"""
Ma trận đầu vào X có dạng(m_sample, n_feature) có nghĩa là mỗi hàng của X b
iểu diễn 1 điểm dữ liệu
Nếu X chưa thỏa mãn điều kiện trên thì phải reshape lại => Bắt buộc
"""
def init_centers(X, k):
# Chọn ra và trả về k điểm ngẫu nhiên trong X
return X[[Link]([Link][0], k, replace=False)]

def assign_labels(X, centers):

# Tính khoảng cách từ điểm dữ liệu đang xét
Distances = [Link](X, centers)
# Phân nó vào cụm gần nhất
labels = [Link](Distances, axis=1)
[Link]([Link][0], 1)
return labels # Vector cột mà mỗi hàng của nó là label ứng với hàng tr
ong X
def update_centers(X, labels, K):
# Tính trung bình của các điểm dữ liệu trong X có cùng label
centers = [Link]((K, [Link][1]))
for k in range(K):
# Thu thập tất cả các điểm dữ liệu được phân vào cụm thứ k
Xk = X[labels == k, :]
# Tính trung bình của cụm để ra centers
centers[k, :] = [Link](Xk, axis=0)
return centers
def has_converged(centers, new_centers):
# So sánh giữa centers và new_centers, nếu giống nhau thì trả về True,
khác thì False
return set([tuple(a) for a in centers]) == set([tuple(a) for a in new_c
enters])
def kmeans(X, K):
# Khởi tạo center ngẫu nhiên

18
centers = [init_centers(X, K)]
labels = []
n_iter = 0 # Số vòng lặp
while True:
# Phân cụm dữ liệu
[Link](assign_labels(X, centers[-1]))
# Update lại center sau bước phân cụm
new_center = update_centers(X, labels[-1], K)
# Nếu center với và cũ giống nhau sau khi update thì dừng
if has_converged(centers[-1], new_center):
break
# Nếu center mới và cũ khác nhau cập nhập lại theo center mới
[Link](new_center)
n_iter += 1 # Số vòng lặp cộng thêm 1
# Trả về giá trị của center và label thu được ở vòng lặp cuối
return centers[-1], labels[-1], n_iter
means = [[2, 2], [8, 3], [3, 6]]
cov = [[1, 0], [0, 1]]
N = 500
X0 = [Link].multivariate_normal(means[0], cov, N)
X1 = [Link].multivariate_normal(means[1], cov, N)
X2 = [Link].multivariate_normal(means[2], cov, N)

[Link]
Question 1: Implementing DBSCAN
These exercises walk you through a Python implementation of an algorithm for clustering
called DBSCAN, which is short for density-based spatial clustering for applications with noise.
It addresses a limitation of k-means clustering, as described below.
Although there are existing implementations for Python (e.g., see here), in this notebook we
are asking you to build it from scratch, albeit using a lot of scaffolding that we have provided.

19
Setup
Here are the modules you will need for this problem.

from [Link] import display

import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
import [Link] as plt

# Adapted from: [Link]

def make_crater (inner_rad=4, outer_rad=4.5, donut_len=2, inner_pts=1000, outer_pts=500, label=False):
# Make the inner core
radius_core = inner_rad*[Link] (inner_pts)
direction_core = 2*[Link]*[Link] (size=inner_pts)

# Simulate inner core points

core_x = radius_core*[Link] (direction_core)
core_y = radius_core*[Link] (direction_core)
crater_core = [Link] ({'x_1': core_x, 'x_2': core_y})
if label: crater_core['label'] = 0

# Make the outer ring

radius_ring = outer_rad + donut_len*[Link](outer_pts)
direction_ring = 2*[Link]*[Link](size = outer_pts)

# Simulate ring points

ring_x = radius_ring*[Link](direction_ring)
ring_y = radius_ring*[Link](direction_ring)
crater_ring = [Link] ({'x_1': ring_x, 'x_2': ring_y})
if label: crater_ring['label'] = 1

return [Link] ([crater_core, crater_ring])

def make_scatter_plot (df, x="x_1", y="x_2", hue="label",

palette={0: "red", 1: "olive", 2: "blue", 3: "green"},
size=5,
centers=None):
if (hue is not None) and (hue in [Link]):
[Link] (x=x, y=y, hue=hue, data=df, palette=palette,
fit_reg=False)
else:

20
[Link] (x=x, y=y, data=df, fit_reg=False)

if centers is not None:

[Link] (centers[:,0], centers[:,1],
marker=u'*', s=500,
c=[palette[0], palette[1]])

def make_scatter_plot2 (df, x="x_1", y="x_2", hue="label", size=5):

if (hue is not None) and (hue in [Link]):
[Link] (x=x, y=y, hue=hue, data=df,
fit_reg=False)
else:
[Link] (x=x, y=y, data=df, fit_reg=False)

Loading the data

We will work work with a synthetic data set that is an especially bad case for k-
means. It's sometimes called the "crater" data because of its shape. The data points
are stored in a file named [Link].
The data files you'll need should be provided in this repository. To load them, you'll
need to wrap the filenames using the function fn(f), defined below. If you are
working locally, you can obtain a copy at the following URL and may need to
modify fn(f) accordingly: [Link]

def fn(fn_base, dirname='./dbscan/'): # `dirname` set by default to its location in our repository
return '{}{}'.format(dirname, fn_base)

# Demo:
fn('[Link]')

Exercise 0 (1 point). Start by reading the data into a Pandas data frame. The data is
stored locally within this assignment in a file called [Link]. Store your data
frame in a variable named crater.

[]

###
### YOUR CODE HERE
###

display ([Link] (3))

print ("...")
display ([Link] (3))

21
[]

assert len (crater) == 1500

assert set ([Link]) == set (['x_1', 'x_2', 'kmeans_label'])

with open (fn('crater_counts.txt'), 'rt') as fp:

true_counts = [int (c) for c in [Link] ().split (',')]
assert sum (crater['kmeans_label'] == 0) == true_counts[0]
assert sum (crater['kmeans_label'] == 1) == true_counts[1]

make_scatter_plot (crater, hue='kmeans_label')

print ("\n(Passed!)")

The testing code plots these data points, which are 2-D. The colors show the clusters
computed by k-means for k=2. Notice that the "natural" structure is, arguably, a dense ball in
the middle and a ring (donut) on the outside. However, k-means instead split the points
about an arbitrary line that cuts through the middle of the points.
Indeed, this fact is one of the limitations of k-means: it works well when you know the value
of k and the k clusters come from Gaussian distributions of similar shape and size (and,
therefore, density). However, if you don't know k or there is non-uniform shape and density
among the clusters---or some other grouping, as above---then k-means does not work well
(qualitatively).

Elements of DBSCAN
The DBSCAN algorithm takes a different approach. Rather than having to provide the number
of clusters, k, you define parameters related to neighborhoods and target density. Let's see
how DBSCAN works by building it from the ground up.

Neighborhoods

The first important concept in DBSCAN is that of an ϵ-neighborhood.

Consider any point p. The ϵ-neighborhood of p is the set of all points within a
distance of ϵ from p. That is, if {x^0,x^1,…,x^m−1} is a collection of m data points,
then the ϵ-neighborhood centered at p is
Nϵ(p)={x^i:∥x^i−p∥2≤ϵ},
where the measure of distance is Euclidean (i.e., the two-norm). Notice that this
definition would include the point p if p is one of the data points.

22
Exercise 1 (1 point). Implement a function that computes the ϵ-neighborhood of p for
a data matrix of points, X, defined by our usual convention as
X=⎛⎝⎜⎜⎜⎜⎜x^T0x^T1⋮x^Tm−1⎞⎠⎟⎟⎟⎟⎟.
In particular, complete the function named region_query(p, eps, X) below. Its
inputs are:

• p[:d]: The query point, of dimension d.

• eps: The size of the ball around p to search.
• X[:m, :d]: The set of points, i.e., data matrix.

It should return a boolean Numpy array adj[:m] with one entry per point (i.e., per row
of X). The entry adj[i] should be True only if X[i, :] lies within the eps-sized ball
centered at p.
Hint: There is a one-line solution of the form, return (boolean array
expression).

[]

def region_query (p, eps, X):

# These lines check that the inputs `p` and `X` have
# the right shape.
_, dim = [Link]
assert ([Link] == (dim,)) or ([Link] == (1, dim)) or ([Link] == (dim, 1))

###
### YOUR CODE HERE
###

Here is the test code for region_query(). In addition to sanity-checking your

solution, it plots the original points, a query point (marked by a red star), and
highlights all points in an ϵ-neighborhood computed by your function so you can
visually verify the result. (In this test, p=(−0.5,1.2) and ϵ=1.0.)

[]

X = crater[['x_1', 'x_2']].values
p = [Link] ([-0.5, 1.2])
in_region = region_query (p, 1.0, X)

crater_ball = [Link] ()
crater_ball['label'] = in_region
make_scatter_plot (crater_ball, centers=p[[Link]])

with open (fn('region_query_soln.txt'), 'rt') as fp:

23
assert int ([Link] ()) == sum (in_region)

print ("\n(Passed!)")

Exercise 2 (1 point). Suppose you are given a vector y[:] of boolean

(True and False) values, such as the one computed above. Write a function
named index_set(y) that returns the index locations of all
of y's True elements. Your function must return these index values as a Python set.

[]

def index_set (y):

"""
Given a boolean vector, this function returns
the indices of all True elements.
"""
assert len ([Link]) == 1

###
### YOUR CODE HERE
###

[]

y_test = [Link] ([True, False, False, True, False, True, True, True, False])
i_soln = set ([0, 3, 5, 6, 7])

i_test = index_set (y_test)

assert type (i_test) is set
assert len (i_test) == len (i_soln)
assert i_test == i_soln

print ("\n(Passed!)")

Exercise 3 (1 point). Given a value for ϵ and a data matrix X of points, complete the
function below so that it determines the neighborhood of each point.
Your function,
def find_neighbors(eps, X[:m, :]):
...

should return a Python list neighbors[:m] such that neighbors[i] is the index set
of neighbors of point X[i, :].

24
[]

def find_neighbors (eps, X):

m, d = [Link]
neighbors = [] # Empty list to start
###
### YOUR CODE HERE
###
assert len (neighbors) == m
return neighbors

[]

with open (fn('find_neighbors_soln.csv'), 'rt') as fp:

neighbors = find_neighbors (0.25, X)
for i, n_i in enumerate (neighbors):
j_, c_j_ = [Link] ().split (',')
assert (i == int (j_)) and (len (n_i) == int (c_j_))

print ("\n(Passed!)")

Density
The next important concept in DBSCAN is that of the density of a neighborhood.
Intuitively, the DBSCAN algorithm will try to "grow" clusters around points whose
neighborhoods are sufficiently dense.
Let's make this idea more precise.
Definition: core points. A point p is a core point if its ϵ-neighborhood has at
least s points.
In other words, the algorithm now has two user-defined parameters: the
neighborhood size, ϵ, and the minimum density, specified using a threshold s on the
number of points in such a neighborhood.

Exercise 4 (2 points). Complete the function, find_core_points(s, neighbors),

below. It takes as input a minimum-points threshold, s, and a list of point
neighborhoods, neighbors[:], such that neighbors[i] is the (index) set of
neighbors of point i. It should return a Python set, core_set, such that an index j in
core_set only if the size of the neighborhood at j is at least s.

[]

def find_core_points (s, neighbors):

assert type (neighbors) is list
assert all ([type (n) is set for n in neighbors])

25
core_set = set ()
###
### YOUR CODE HERE
###
return core_set

[]

core_set = find_core_points (5, neighbors)

print ("Found {} core points.".format (len (core_set)))

def plot_core_set (df, core_set):

df_labeled = [Link] ()
df_labeled['label'] = False
df_labeled.loc[list (core_set), 'label'] = True
make_scatter_plot (df_labeled)

plot_core_set (crater, core_set)

with open (fn('find_core_points_soln.txt'), 'rt') as fp:

core_set_soln = set ([int (i) for i in [Link] ().split ()])
assert core_set == core_set_soln

print ("\n(Passed!)")

Growing clusters via "reachable" points

The last concept needed for DBSCAN is the idea of growing a cluster around a core
point. It depends on the notion of reachability.
Definition: Reachability. A point q is reachable from another point p if there exists a
sequence of points p=p1,p2,…,pk=q such that every pi is a core point, possibly
except for pk=q, and pi∈Nϵ(pi−1) for all 1<i<k.
This procedure works as follows.
"Expand Cluster" procedure:

1. Consider any point p that is not yet assigned to a cluster.

2. If p is a core point, then start a new cluster for it.
3. Maintain a "reachable" set, which will be used to hold points that are reachable
from p as they are encountered. Initially, the reachable points are just p's ϵ-
neighbors.
4. Remove any point q from the reachable set.
5. If q has not yet been visited, then mark it as being visited.

26
6. If q is also a core point, then add all of its neighbors to the reachable set, per the
definition of "reachability" above.
7. If q is not yet assigned to any cluster, then add it to p's cluster.

Notice how this procedure explores the points reachable from p (Step 6). Intuitively, it
is trying to join all neighborhoods whose core points are mutually contained.

Here is a brief illustration of these concepts:

In this picture, suppose the minimum density parameter is s=3 points. Thus, only
the ϵ-neighborhoods centered at 1, 3, and 6 are core points, since these are the only
points that include at least s=3 points. For instance, Nϵ(1)={0,1,3,7}, making it a
core point since its neighborhood has four (4) points, whereas Nϵ(4)={3,4} is not a
core point since its neighborhood has just two (2) points.

Exercise 5 (4 points). Implement the "cluster growth" procedure described above in

the function, expand_cluster(), below.
To simplify your task, we will give you all the lines of code that you need. However,
you need to figure out in what order these lines must execute, as well as how to
indent them!
The signature of the function is:
def expand_cluster (p, neighbors, core_set, visited, assignment):
...

Its parameters are:

• p is the index of a starting core point. The caller must guarantee that it is indeed a
core point, and furthermore, that it has been assigned to some cluster. (See below.)

27
• neighbors[:] is a list of ϵ-neighborhoods, given as Python sets. For
instance, neighbors[p] is a set of indices of all points in the neighborhood of p. It will
have been computed from find_neighbors() above.
• core_set is a Python set containing the indices of all core points. That is, the
expression, i in core_set, is true only if i is indeed a core point.
• visited is another Python set containing the indices of all points that have already
been visited. That is, the expression i in visited should be True only if i has been
visited. Thus, your expand_cluster() function should update this set when visiting
any previously unvisited point.
• assignment is a Python dictionary. The key is the index of a point; the value is the
cluster label to which that point has been assigned. Consequently, if a point i does
not yet have any cluster assignment, then the expression, i in assignment, will
be False. Your expand_cluster() function should update cluster assignments by
updating this dictionary.

The skeleton of expand_cluster() does everything up to and including Step 4 of the

procedure above. It first initializes the reachable set as a Python set, reachable,
containing the neighbors of p. It then removes one of those reachable points, storing
it in q. You just need to perform steps 5-7. In fact, we will even give you all of the lines
of code that you need! But you have to to incorporate them into the skeleton, ordered
and indented correctly.
assignment[q] = assignment[p]
if q in core_set:
if q not in assignment:
if q not in visited:
reachable |= neighbors[q]
[Link] (q)

[]

def expand_cluster (p, neighbors, core_set, visited, assignment):

# Assume the caller performs Steps 1 and 2 of the procedure.
# That means 'p' must be a core point that is part of a cluster.
assert (p in core_set) and (p in visited) and (p in assignment)

reachable = set (neighbors[p]) # Step 3

while reachable:
q = [Link] () # Step 4

# Put your reordered and correctly indented statements here:

###
### YOUR CODE HERE
###

# This procedure does not return anything

# except via updates to `visited` and
# `assignment`.

28
[]

# This test is based on the illustration above.

p_test = 3
neighbors_test = [set ([0, 1]),
set ([0, 1, 3, 7]),
set ([2, 3]),
set ([1, 2, 3, 4, 6]),
set ([3, 4]),
set ([5]),
set ([3, 6, 7]),
set ([1, 7])
]
core_set_test = set ([1, 3, 6])
visited_test = set ([p_test])
assignment_test = {p_test: 0}
expand_cluster (p_test, neighbors_test, core_set_test,
visited_test, assignment_test)
assert visited_test == set ([0, 1, 2, 3, 4, 6, 7]) # All but 5
assert set (assignment_test.keys ()) == visited_test
assert set (assignment_test.values ()) == set ([0])

print ("\n(Passed!)")

Putting it all together

If you've successfully completed all steps above, then you have everything you need
to run the final DBSCAN algorithm, which we've provided below. The second code cell
below shows a picture of the clusters discovered for a particular setting of
neighborhood size, ϵ, and density threshold, s.
And there is no additional code for you to write below! However, you should make
sure the remaining cells execute without error.

[]

def dbscan (eps, s, X):

clusters = []
point_to_cluster = {}

neighbors = find_neighbors (eps, X)

core_set = find_core_points (s, neighbors)

assignment = {}
next_cluster_id = 0

29
visited = set ()
for i in core_set: # for each core point i
if i not in visited:
[Link] (i) # Mark i as visited
assignment[i] = next_cluster_id
expand_cluster (i, neighbors, core_set,
visited, assignment)
next_cluster_id += 1

return assignment, core_set

[]

assignment, core_set = dbscan (0.5, 10, X)

print ("Number of core points:", len (core_set))

print ("Number of clusters:", max ([Link] ()))
print ("Number of unclassified points:", len (X) - len (assignment))

def plot_labels (df, labels):

df_labeled = [Link] ()
df_labeled['label'] = labels
make_scatter_plot2 (df_labeled)

labels = [-1] * len (X)

for i, c in [Link] ():
labels[i] = c
plot_labels (crater, labels)

with open (fn('dbscan_soln.csv'), 'rt') as fp:

num_cores, num_clusters, num_outliers = [Link] ().split (',')
assert int (num_cores) == len (core_set)
assert int (num_clusters) == max ([Link] ())
assert int (num_outliers) == (len (X) - len (assignment))
print ("\n(Passed!)")

Fin! If you've reached this point and all tests above pass, you are ready to submit your
solution to this problem. Don't forget to save you work prior to submitting.

Phát hiện bất thường dùng hỗn hợp Gauss

LAB 10 – Ensemble Learning and Random Forests

Bộ phân loại Biểu quyết

30
Bagging và Pasting
Rừng Ngẫu nhiên

LAB 11 – Dimensionality Reduction

PCA, PCA hạt nhân
1. Load the MNIST dataset and split it into a training set
and a test set (take the first 60,000 instances for training, and the remaining
10,000 for testing). Train a Random Forest classifier on the dataset and time how
long it takes, then evaluate the resulting model on the test set. Next, use PCA to
reduce the dataset’s dimensionality, with an explained variance ratio of 95%.
Train a new Random Forest classifier on the reduced dataset and see how long it
takes. Was training much faster? Next evaluate the classifier on the test set: how
does it compare to the previous classifier?

Solution
Exercise: Load the MNIST dataset (introduced in chapter 3) and split it into a training set and a test set
(take the first 60,000 instances for training, and the remaining 10,000 for testing).
The MNIST dataset was loaded earlier.
In [73]:
X_train = mnist['data'][:60000]
y_train = mnist['target'][:60000]

X_test = mnist['data'][60000:]
y_test = mnist['target'][60000:]
Exercise: Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the
resulting model on the test set.
In [74]:
from [Link] import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)

In [75]:
import time

t0 = [Link]()
rnd_clf.fit(X_train, y_train)
t1 = [Link]()
In [76]:
print("Training took {:.2f}s".format(t1 - t0))
Training took 35.27s
In [77]:
from [Link] import accuracy_score

y_pred = rnd_clf.predict(X_test)
accuracy_score(y_test, y_pred)
Out[77]:
0.9705
Exercise: Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.

31
In [78]:
from [Link] import PCA

pca = PCA(n_components=0.95)
X_train_reduced = pca.fit_transform(X_train)
Exercise: Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was
training much faster?
In [79]:
rnd_clf2 = RandomForestClassifier(n_estimators=100, random_state=42)
t0 = [Link]()
rnd_clf2.fit(X_train_reduced, y_train)
t1 = [Link]()
In [80]:
print("Training took {:.2f}s".format(t1 - t0))
Training took 81.03s

X_test_reduced = [Link](X_test)

y_pred = rnd_clf2.predict(X_test_reduced)
accuracy_score(y_test, y_pred)
Out[81]:
0.9481
In [82]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(multi_class="multinomial", solver="lbfgs",

random_state=42)
t0 = [Link]()
log_clf.fit(X_train, y_train)
t1 = [Link]()

print("Training took {:.2f}s".format(t1 - t0))

Training took 6.94s
Nice! Reducing dimensionality led to over 2× speedup. :) Let's check the model's accuracy:
In [87]:
y_pred = log_clf2.predict(X_test_reduced)
accuracy_score(y_test, y_pred)
Out[87]:
0.9201
A very slight drop in performance, which might be a reasonable price to pay for a 2× speedup, depending on
the application.
So there you have it: PCA can give you a formidable speedup... but not always!

2. Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the
result using Matplotlib. You can use a scatterplot using 10 different colors to rep‐
resent each image’s target class. Alternatively, you can write colored digits at the
location of each instance, or even plot scaled-down versions of the digit images
themselves (if you plot all digits, the visualization will be too cluttered, so you
32
should either draw a random sample or plot an instance only if no other instance
has already been plotted at a close distance). You should get a nice visualization
with well-separated clusters of digits. Try using other dimensionality reduction
algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.

Solution

[Link](42)

m = 10000
idx = [Link](60000)[:m]

X = mnist['data'][idx]
y = mnist['target'][idx]

from [Link] import TSNE

tsne = TSNE(n_components=2, random_state=42)

X_reduced = tsne.fit_transform(X)
Now let's use Matplotlib's scatter() function to plot a scatterplot, using a different color for each digit:
In [90]:
[Link](figsize=(13,10))
[Link](X_reduced[:, 0], X_reduced[:, 1], c=y, cmap="jet")
[Link]('off')
[Link]()
[Link]()

[Link](figsize=(9,9))
cmap = [Link].get_cmap("jet")
for digit in (2, 3, 5):
[Link](X_reduced[y == digit, 0], X_reduced[y == digit, 1],
c=[cmap(digit / 9)])
[Link]('off')
[Link]()

idx = (y == 2) | (y == 3) | (y == 5)
X_subset = X[idx]
y_subset = y[idx]

tsne_subset = TSNE(n_components=2, random_state=42)

X_subset_reduced = tsne_subset.fit_transform(X_subset)
In [93]:
[Link](figsize=(9,9))
for digit in (2, 3, 5):
[Link](X_subset_reduced[y_subset == digit, 0],
X_subset_reduced[y_subset == digit, 1], c=[cmap(digit / 9)])
[Link]('off')
[Link]()

33
from [Link] import MinMaxScaler
from [Link] import AnnotationBbox, OffsetImage

def plot_digits(X, y, min_distance=0.05, images=None, figsize=(13, 10)):

# Let's scale the input features so that they range from 0 to 1
X_normalized = MinMaxScaler().fit_transform(X)
# Now we create the list of coordinates of the digits plotted so far.
# We pretend that one is already plotted far away at the start, to
# avoid `if` statements in the loop below
neighbors = [Link]([[10., 10.]])
# The rest should be self-explanatory
[Link](figsize=figsize)
cmap = [Link].get_cmap("jet")
digits = [Link](y)
for digit in digits:
[Link](X_normalized[y == digit, 0], X_normalized[y == digit,
1], c=[cmap(digit / 9)])
[Link]("off")
ax = [Link]().gca() # get current axes in current figure
for index, image_coord in enumerate(X_normalized):
closest_distance = [Link](neighbors - image_coord,
axis=1).min()
if closest_distance > min_distance:
neighbors = np.r_[neighbors, [image_coord]]
if images is None:
[Link](image_coord[0], image_coord[1],
str(int(y[index])),
color=cmap(y[index] / 9), fontdict={"weight":
"bold", "size": 16})
else:
image = images[index].reshape(28, 28)
imagebox = AnnotationBbox(OffsetImage(image,
cmap="binary"), image_coord)
ax.add_artist(imagebox)

plot_digits(X_reduced, y)

plot_digits(X_reduced, y, images=X, figsize=(35, 25))

plot_digits(X_subset_reduced, y_subset, images=X_subset, figsize=(22, 22))

Let's start with PCA. We will also time how long it takes:
In [98]:
from [Link] import PCA
import time

t0 = [Link]()
X_pca_reduced = PCA(n_components=2, random_state=42).fit_transform(X)
t1 = [Link]()
print("PCA took {:.1f}s.".format(t1 - t0))
plot_digits(X_pca_reduced, y)

34
[Link]()

from [Link] import LocallyLinearEmbedding

t0 = [Link]()
X_lle_reduced = LocallyLinearEmbedding(n_components=2,
random_state=42).fit_transform(X)
t1 = [Link]()
print("LLE took {:.1f}s.".format(t1 - t0))
plot_digits(X_lle_reduced, y)
[Link]()

from [Link] import Pipeline

pca_lle = Pipeline([
("pca", PCA(n_components=0.95, random_state=42)),
("lle", LocallyLinearEmbedding(n_components=2, random_state=42)),
])
t0 = [Link]()
X_pca_lle_reduced = pca_lle.fit_transform(X)
t1 = [Link]()
print("PCA+LLE took {:.1f}s.".format(t1 - t0))
plot_digits(X_pca_lle_reduced, y)
[Link]()

from [Link] import MDS

m = 2000
t0 = [Link]()
X_mds_reduced = MDS(n_components=2, random_state=42).fit_transform(X[:m])
t1 = [Link]()
print("MDS took {:.1f}s (on just 2,000 MNIST images instead of
10,000).".format(t1 - t0))
plot_digits(X_mds_reduced, y[:m])
[Link]()

35
LAB 12 – Neural Networks and Deep Learning
Perceptron learning algorithm

LAB 13 – Neural Networks and Deep Learning (cont)

Multilayer neural network

36
LAB 14 – Recommender Systems
Question 1: Reading Movie-Lens data
Solution:
# Import libraries
import numpy as np
import pandas as pd

# Reading ratings file

ratings = pd.read_csv(path + '[Link]', sep='\t', encoding='latin-1',
usecols=['user_id', 'movie_id', 'rating', 'timestamp'])

# Reading users file

users = pd.read_csv(path + '[Link]', sep='\t', encoding='latin-1',
usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

# Reading movies file

movies = pd.read_csv(path + '[Link]', sep='\t', encoding='latin-1',
usecols=['movie_id', 'title', 'genres'])
Let's take a look at the movies and ratings dataframes.
In [4]:
[Link]()
Out[4]:

movie_id title genres

0 1 Toy Story (1995) Animation|Children's|Comedy

1 2 Jumanji (1995) Adventure|Children's|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama

4 5 Father of the Bride Part II (1995) Comedy

In [5]:
[Link]()
Out[5]:

user_id movie_id rating timestamp

37
user_id movie_id rating timestamp

0 1 1193 5 978300760

1 1 661 3 978302109

2 1 914 3 978301968

3 1 3408 4 978300275

4 1 2355 5 978824291
Also let's count the number of unique users and movies.
In [6]:
n_users = ratings.user_id.unique().shape[0]
n_movies = ratings.movie_id.unique().shape[0]
print 'Number of users = ' + str(n_users) + ' | Number of movies = ' +
str(n_movies)
Number of users = 6040 | Number of movies = 3706
Now I want the format of my ratings matrix to be one row per user and one column per movie. To
do so, I'll pivot ratings to get that and call the new variable Ratings (with a capital *R).
In [7]:
Ratings = [Link](index = 'user_id', columns ='movie_id', values =
'rating').fillna(0)
[Link]()
Out[7]:

.
movi 1 39 39 39 39 39 39 39 39 39 39
1 2 3 4 5 6 7 8 9 .
e_id 0 43 44 45 46 47 48 49 50 51 52
.

user_
id

.
5. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.

.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
2 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.

.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.

38
.
movi 1 39 39 39 39 39 39 39 39 39 39
1 2 3 4 5 6 7 8 9 .
e_id 0 43 44 45 46 47 48 49 50 51 52
.

user_
id

.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
4 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.

.
0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
5 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.

5 rows × 3706 columns

Last but not least, I need to de-normalize the data (normalize by each users mean) and convert it
from a dataframe to a numpy array.
In [8]:
R = Ratings.as_matrix()
user_ratings_mean = [Link](R, axis = 1)
Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1)
/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:1:
FutureWarning: Method .as_matrix will be removed in a future version.
Use .values instead.
"""Entry point for launching an IPython kernel.

Hệ khuyến nghị dựa trên nội dung

Build two Content Based Recommenders based on:

• Movie Overviews and Taglines

• Movie Cast, Crew, Keywords and Genre
Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due
to limiting computing power available to me.

In [15]:
links_small = pd.read_csv('../input/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
In [16]:
md = [Link]([19730, 29503, 35587])
In [17]:
#Check EDA Notebook for how and why I got these indices.
md['id'] = md['id'].astype('int')
In [18]:
39
smd = md[md['id'].isin(links_small)]
[Link]
Out[18]:
(9099, 25)
We have 9099 movies avaiable in our small movies metadata dataset which is 5 times smaller
than our original dataset of 45000 movies.

Movie Description Based Recommender

Let us first try to build a recommender using movie descriptions and taglines. We do not have a
quantitative metric to judge our machine's performance so this will have to be done qualitatively.

smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0,
stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])
tfidf_matrix.shape
Out[21]:
(9099, 268124)
Cosine Similarity
I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity
between two movies. Mathematically, it is defined as follows:

cosine(x,y)=x.y⊺||x||.||y||cosine(x,y)=x.y⊺||x||.||y||
Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the
Cosine Similarity Score. Therefore, we will use sklearn's linear_kernel instead of
cosine_similarities since it is much faster.

In [22]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
In [23]:
cosine_sim[0]
Out[23]:
array([ 1. , 0.00680476, 0. , ..., 0. ,
0.00344913, 0. ])
We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is
to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [24]:
smd = smd.reset_index()
titles = smd['title']
indices = [Link]([Link], index=smd['title'])
In [25]:
def get_recommendations(title):
idx = indices[title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:31]
40
movie_indices = [i[0] for i in sim_scores]
return [Link][movie_indices]
We're all set. Let us now try and get the top recommendations for a few movies and see how
good the recommendations are.

In [26]:
get_recommendations('The Godfather').head(10)
Out[26]:
973 The Godfather: Part II
8387 The Family
3509 Made
4196 Johnny Dangerously
29 Shanghai Triad
5667 Fury
2412 American Movie
1582 The Godfather: Part III
4221 8 Women
2159 Summer of Sam
Name: title, dtype: object
In [27]:
get_recommendations('The Dark Knight').head(10)
Out[27]:
7931 The Dark Knight Rises
132 Batman Forever
1113 Batman Returns
8227 Batman: The Dark Knight Returns, Part 2
7565 Batman: Under the Red Hood
524 Batman
7901 Batman: Year One
2579 Batman: Mask of the Phantasm
2696 JFK
8165 Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

Hệ khuyến nghị dựa trên lọc cộng tác

Introduction

This example demonstrates Collaborative filtering using the Movielens dataset to

recommend movies to users. The MovieLens ratings dataset lists the ratings given by
a set of users to a set of movies. Our goal is to be able to predict ratings for movies a
user has not yet watched. The movies with the highest predicted ratings can then be
recommended to the user.

The steps in the model are as follows:

1. Map user ID to a "user vector" via an embedding matrix

2. Map movie ID to a "movie vector" via an embedding matrix
41
3. Compute the dot product between the user vector and movie vector, to obtain
the a match score between the user and the movie (predicted rating).
4. Train the embeddings via gradient descent using all known user-movie pairs.

import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from [Link] import layers
from pathlib import Path
import [Link] as plt

First, load the data and apply preprocessing

# Download the actual data from [Link]
[Link]"
# Use the [Link] file
movielens_data_file_url = (
"[Link]
)
movielens_zipped_file = [Link].get_file(
"[Link]", movielens_data_file_url, extract=False
)
keras_datasets_path = Path(movielens_zipped_file).parents[0]
movielens_dir = keras_datasets_path / "ml-latest-small"

# Only extract the data the first time the script is run.
if not movielens_dir.exists():
with ZipFile(movielens_zipped_file, "r") as zip:
# Extract files
print("Extracting all the files now...")
[Link](path=keras_datasets_path)
print("Done!")

ratings_file = movielens_dir / "[Link]"

df = pd.read_csv(ratings_file)
First, need to perform some preprocessing to encode users and movies as integer
indices.

user_ids = df["userId"].unique().tolist()
user2user_encoded = {x: i for i, x in enumerate(user_ids)}
userencoded2user = {i: x for i, x in enumerate(user_ids)}
movie_ids = df["movieId"].unique().tolist()
movie2movie_encoded = {x: i for i, x in enumerate(movie_ids)}
movie_encoded2movie = {i: x for i, x in enumerate(movie_ids)}
df["user"] = df["userId"].map(user2user_encoded)
df["movie"] = df["movieId"].map(movie2movie_encoded)

num_users = len(user2user_encoded)
num_movies = len(movie_encoded2movie)
df["rating"] = df["rating"].[Link](np.float32)
# min and max ratings will be used to normalize the ratings later
min_rating = min(df["rating"])
max_rating = max(df["rating"])
42
print(
"Number of users: {}, Number of Movies: {}, Min rating: {}, Max rating:
{}".format(
num_users, num_movies, min_rating, max_rating
)
)
Number of users: 610, Number of Movies: 9724, Min rating: 0.5, Max rating: 5.0

Prepare training and validation data

df = [Link](frac=1, random_state=42)
x = df[["user", "movie"]].values
# Normalize the targets between 0 and 1. Makes it easy to train.
y = df["rating"].apply(lambda x: (x - min_rating) / (max_rating -
min_rating)).values
# Assuming training on 90% of the data and validating on 10%.
train_indices = int(0.9 * [Link][0])
x_train, x_val, y_train, y_val = (
x[:train_indices],
x[train_indices:],
y[:train_indices],
y[train_indices:],
)

Create the model

We embed both users and movies in to 50-dimensional vectors.

The model computes a match score between user and movie embeddings via a dot
product, and adds a per-movie and per-user bias. The match score is scaled to
the [0, 1] interval via a sigmoid (since our ratings are normalized to this range).

EMBEDDING_SIZE = 50

class RecommenderNet([Link]):
def __init__(self, num_users, num_movies, embedding_size, **kwargs):
super(RecommenderNet, self).__init__(**kwargs)
self.num_users = num_users
self.num_movies = num_movies
self.embedding_size = embedding_size
self.user_embedding = [Link](
num_users,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=[Link].l2(1e-6),
)
self.user_bias = [Link](num_users, 1)
self.movie_embedding = [Link](
num_movies,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=[Link].l2(1e-6),

43
)
self.movie_bias = [Link](num_movies, 1)

def call(self, inputs):

user_vector = self.user_embedding(inputs[:, 0])
user_bias = self.user_bias(inputs[:, 0])
movie_vector = self.movie_embedding(inputs[:, 1])
movie_bias = self.movie_bias(inputs[:, 1])
dot_user_movie = [Link](user_vector, movie_vector, 2)
# Add all the components (including bias)
x = dot_user_movie + user_bias + movie_bias
# The sigmoid activation forces the rating to between 0 and 1
return [Link](x)

model = RecommenderNet(num_users, num_movies, EMBEDDING_SIZE)

[Link](
loss=[Link](),
optimizer=[Link](learning_rate=0.001),
)

Train the model based on the data split

history = [Link](
x=x_train,
y=y_train,
batch_size=64,
epochs=5,
verbose=1,
validation_data=(x_val, y_val),
)
Epoch 1/5
1418/1418 [==============================] - 9s 6ms/step - loss: 0.6358 -
val_loss: 0.6200
Epoch 2/5
1418/1418 [==============================] - 11s 8ms/step - loss: 0.6134 -
val_loss: 0.6173
Epoch 3/5
1418/1418 [==============================] - 14s 10ms/step - loss: 0.6082 -
val_loss: 0.6156
Epoch 4/5
1418/1418 [==============================] - 15s 10ms/step - loss: 0.6072 -
val_loss: 0.6144
Epoch 5/5
1418/1418 [==============================] - 13s 9ms/step - loss: 0.6074 -
val_loss: 0.6147

Plot training and validation loss

[Link]([Link]["loss"])
[Link]([Link]["val_loss"])
[Link]("model loss")
[Link]("loss")
[Link]("epoch")
[Link](["train", "test"], loc="upper left")
[Link]()

44
Show top 10 movie recommendations to a user
movie_df = pd.read_csv(movielens_dir / "[Link]")

# Let us get a user and see the top recommendations.

user_id = [Link](1).iloc[0]
movies_watched_by_user = df[[Link] == user_id]
movies_not_watched = movie_df[
~movie_df["movieId"].isin(movies_watched_by_user.[Link])
]["movieId"]
movies_not_watched = list(
set(movies_not_watched).intersection(set(movie2movie_encoded.keys()))
)
movies_not_watched = [[movie2movie_encoded.get(x)] for x in movies_not_watched]
user_encoder = user2user_encoded.get(user_id)
user_movie_array = [Link](
([[user_encoder]] * len(movies_not_watched), movies_not_watched)
)
ratings = [Link](user_movie_array).flatten()
top_ratings_indices = [Link]()[-10:][::-1]
recommended_movie_ids = [
movie_encoded2movie.get(movies_not_watched[x][0]) for x in top_ratings_indices
]

print("Showing recommendations for user: {}".format(user_id))

print("====" * 9)
print("Movies with high ratings from user")
print("----" * 8)
top_movies_user = (
movies_watched_by_user.sort_values(by="rating", ascending=False)
.head(5)
.[Link]
)
movie_df_rows = movie_df[movie_df["movieId"].isin(top_movies_user)]
for row in movie_df_rows.itertuples():
print([Link], ":", [Link])

print("----" * 8)
45
print("Top 10 movie recommendations")
print("----" * 8)
recommended_movies = movie_df[movie_df["movieId"].isin(recommended_movie_ids)]
for row in recommended_movies.itertuples():
print([Link], ":", [Link])
302/302 [==============================] - 0s 800us/step
Showing recommendations for user: 213
====================================
Movies with high ratings from user
--------------------------------
Terminator 2: Judgment Day (1991) : Action|Sci-Fi
Rocky (1976) : Drama
Big Fish (2003) : Drama|Fantasy|Romance
Shrek 2 (2004) : Adventure|Animation|Children|Comedy|Musical|Romance
13 Assassins (Jûsan-nin no shikaku) (2010) : Action
--------------------------------
Top 10 movie recommendations
--------------------------------
Usual Suspects, The (1995) : Crime|Mystery|Thriller
Star Wars: Episode IV - A New Hope (1977) : Action|Adventure|Sci-Fi
Shawshank Redemption, The (1994) : Crime|Drama
Schindler's List (1993) : Drama|War
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) :
Comedy|War
Godfather: Part II, The (1974) : Crime|Drama
Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001) : Comedy|Romance
28 Days Later (2002) : Action|Horror|Sci-Fi
Little Miss Sunshine (2006) : Adventure|Comedy|Drama
Hurt Locker, The (2008) : Action|Drama|Thriller|War

LAB 15 – Evaluation Labs

===== // =====

Machine Learning Basics and Applications
No ratings yet
Machine Learning Basics and Applications
21 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
59 pages
Machine Learning Techniques in PyTorch
100% (1)
Machine Learning Techniques in PyTorch
48 pages
Supervised Machine Learning: Linear Regression
No ratings yet
Supervised Machine Learning: Linear Regression
36 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
19 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
127 pages
Supervised Learning & Logistic Regression Guide
No ratings yet
Supervised Learning & Logistic Regression Guide
10 pages
Advanced Neural Network Techniques
No ratings yet
Advanced Neural Network Techniques
16 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
22 pages
Machine Learning Fundamentals Explained
No ratings yet
Machine Learning Fundamentals Explained
52 pages
Regression Analysis Techniques Overview
No ratings yet
Regression Analysis Techniques Overview
28 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
34 pages
Introduction to Deep Learning Concepts
No ratings yet
Introduction to Deep Learning Concepts
64 pages
Supervised Learning: Linear Regression Guide
No ratings yet
Supervised Learning: Linear Regression Guide
47 pages
SLIntro 14
No ratings yet
SLIntro 14
32 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
4 pages
Supervised Learning Regression: Course Description
No ratings yet
Supervised Learning Regression: Course Description
2 pages
Questions
No ratings yet
Questions
6 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
20 pages
Machine Learning Parameter Estimation Guide
No ratings yet
Machine Learning Parameter Estimation Guide
25 pages
Machine Learning: Model Building Complete Study Notes
No ratings yet
Machine Learning: Model Building Complete Study Notes
18 pages
Machine Learning 7
No ratings yet
Machine Learning 7
28 pages
Advanced Machine Learning Course Outline
No ratings yet
Advanced Machine Learning Course Outline
70 pages
AI & ML Programming Experiments Guide
No ratings yet
AI & ML Programming Experiments Guide
30 pages
Linear Regression with Gradient Descent
No ratings yet
Linear Regression with Gradient Descent
11 pages
04 Exercise Gradient Descent
No ratings yet
04 Exercise Gradient Descent
24 pages
Unit2 - Lecturenotes
No ratings yet
Unit2 - Lecturenotes
33 pages
HCIA-AI Machine Learning Lab Guide
No ratings yet
HCIA-AI Machine Learning Lab Guide
82 pages
Programming Assignment 1
No ratings yet
Programming Assignment 1
12 pages
Implementing Machine Learning Algorithms
No ratings yet
Implementing Machine Learning Algorithms
20 pages
Unit 1
No ratings yet
Unit 1
82 pages
Data Science Engineering Overview
No ratings yet
Data Science Engineering Overview
53 pages
MATLAB ANN for Heat Transfer Prediction
No ratings yet
MATLAB ANN for Heat Transfer Prediction
13 pages
ML Important Topic With Answers
No ratings yet
ML Important Topic With Answers
29 pages
Machine Learning Basics with Python
No ratings yet
Machine Learning Basics with Python
46 pages
Learning Algorithm Anatomy and Gradient Descent
No ratings yet
Learning Algorithm Anatomy and Gradient Descent
43 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
48 pages
Supervised vs Unsupervised Learning Explained
No ratings yet
Supervised vs Unsupervised Learning Explained
11 pages
Optimizing Feedforward Networks: Computational Systems Biology Deep Learning in The Life Sciences
No ratings yet
Optimizing Feedforward Networks: Computational Systems Biology Deep Learning in The Life Sciences
168 pages
Machine Learning Lab Manual for MCA
No ratings yet
Machine Learning Lab Manual for MCA
27 pages
2nd Unit
No ratings yet
2nd Unit
126 pages
Essential Machine Learning Algorithms in Python & R
100% (5)
Essential Machine Learning Algorithms in Python & R
46 pages
Python Libraries for Data Science Experiments
No ratings yet
Python Libraries for Data Science Experiments
24 pages
Machine Learning Assignment No 3
No ratings yet
Machine Learning Assignment No 3
14 pages
Machine Learning Exercises Overview
No ratings yet
Machine Learning Exercises Overview
43 pages
Supervised Machine Learning1 (AutoRecovered)
No ratings yet
Supervised Machine Learning1 (AutoRecovered)
53 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
7 pages
Supervised vs Unsupervised Learning Guide
No ratings yet
Supervised vs Unsupervised Learning Guide
63 pages
Overview of Machine Learning Types
No ratings yet
Overview of Machine Learning Types
8 pages
Common Machine Learning Algorithms Guide
No ratings yet
Common Machine Learning Algorithms Guide
38 pages
60% Design Package Deliverables Guide
No ratings yet
60% Design Package Deliverables Guide
5 pages
Brush Holders and Voltage Regulators
No ratings yet
Brush Holders and Voltage Regulators
709 pages
Wa0023.
No ratings yet
Wa0023.
3 pages
Configuring Security in ION Devices Using ION Setup
No ratings yet
Configuring Security in ION Devices Using ION Setup
10 pages
Structural Civil and Pipe Drafting Testbank
No ratings yet
Structural Civil and Pipe Drafting Testbank
6 pages
Next-Gen Monopiles for Offshore Wind
100% (1)
Next-Gen Monopiles for Offshore Wind
20 pages
Fiberglass FVP Plane Assembly Guide
No ratings yet
Fiberglass FVP Plane Assembly Guide
4 pages
Bank Albilad Transaction Details 2021
No ratings yet
Bank Albilad Transaction Details 2021
1 page
Accessible Interior Design for the Visually Impaired
No ratings yet
Accessible Interior Design for the Visually Impaired
26 pages
316 SS Valve Trim Material Standards
No ratings yet
316 SS Valve Trim Material Standards
2 pages
Grade 9 Computer Servicing Test Results
No ratings yet
Grade 9 Computer Servicing Test Results
27 pages
Essentials of Effective Pentest Reports
No ratings yet
Essentials of Effective Pentest Reports
11 pages
Data Leadership Course Overview at MIT
No ratings yet
Data Leadership Course Overview at MIT
8 pages
M322C Excavator Cab Riser Hydraulic System
No ratings yet
M322C Excavator Cab Riser Hydraulic System
6 pages
Grade IX Computer Science Assessment Framework
No ratings yet
Grade IX Computer Science Assessment Framework
31 pages
Heat Exchanger Design Principles
No ratings yet
Heat Exchanger Design Principles
31 pages
React Beginner Setup and Essentials Guide
No ratings yet
React Beginner Setup and Essentials Guide
5 pages
Long vs Short Use Cases Explained
No ratings yet
Long vs Short Use Cases Explained
6 pages
Third Harmonic Relay for Generator Faults
No ratings yet
Third Harmonic Relay for Generator Faults
4 pages
SCTP Configuration and Status Updates
100% (1)
SCTP Configuration and Status Updates
18 pages
Non-Destructive Examination Summary
No ratings yet
Non-Destructive Examination Summary
4 pages
Cessup 09 04 Pap LG
No ratings yet
Cessup 09 04 Pap LG
72 pages
Apartment Maintenance Expense Report
No ratings yet
Apartment Maintenance Expense Report
32 pages
Laptop Inventory Overview
No ratings yet
Laptop Inventory Overview
5 pages
Software Requirement Engineering Insights
No ratings yet
Software Requirement Engineering Insights
3 pages
3D Reflection Techniques in Graphics
No ratings yet
3D Reflection Techniques in Graphics
11 pages
High QoS in SDN: Bandwidth Guarantee
No ratings yet
High QoS in SDN: Bandwidth Guarantee
62 pages
Cardio7 Operation Manual Overview
No ratings yet
Cardio7 Operation Manual Overview
81 pages
Long Island N.V. Terms and Conditions
No ratings yet
Long Island N.V. Terms and Conditions
17 pages
High School Calculus: Limits and Derivatives
No ratings yet
High School Calculus: Limits and Derivatives
3 pages

Machine Learning Labs Overview

Uploaded by

Machine Learning Labs Overview

Uploaded by

Machine Learning Practices

LAB 01 – Overview of Machine Learning, Tool & Library ..................................... 2

1. How would you define Machine Learning?

1. Try a Support Vector Machine regressor ([Link]), with various

Question 1: Apply Gradient Descent to find minimum of a function

precision - a precision that determines the stop of the stepwise

l_r - the learning rate (size of each descent step)

return x_list, y_list, x_min

Question 2: Draw all the loss functions

x = [Link](range_min, range_max + stepsize, stepsize)

fig, ax = [Link](figsize=(15, 8))

Question 1: Draw all the activation functions

[Link](x, y1, '-ro')

sigmoid = lambda z: 1/(1 + [Link](-z))

y_sigmoid_manually, y_sigmoid_library = sigmoid(x),

y_relu_manually, y_relu_library = [relu(i) for i in x],

y_leakyrelu_manually, y_leakyrelu_library = [leakyrelu(i, 0.1) for i in x]

y_tanh_manually, y_tanh_library = tanh(x), [Link](x)

y_swish_manually, y_swish_library = swish(x), [Link](x)

Question 2: Draw all the derivatives of activation functions

def PlotDerivative(x, y, dydx, title = "Functions and Its Derivatives"):

y_leakyrelu = [Link](x, alpha= 0.3)

Question 1: Descibe KNN by handling

def euclidian_distance(x1, x2): # x1, x2 là 2 vector.

def fit(self, X, y): # Store sample and label

def predict(self, X):

def _predict(self, x):

Question 2: Using K-NN with [Link] dâta

[Link] Vector Machines

and ℒ𝑘 denotes the index set of class k.

from sklearn import svm

# In this case we assume 2d features.

classifier = [Link](C=10, kernel='linear')

print("The feature [1, 2] is classified as:", [Link]([[1, 2]]))

Familiarize yourself with the data sets by studying scatterplots.

def make_meshgrid(X, h=.02):

def scatter(X, Y, xx, yy, Z):

In the lecture we defined the rbf kernel as

In sklearn, σ is expressed through a γ parameter, and the rbf kernel is given as

LAB 08 – Evaluation Labs

def assign_labels(X, centers):

X = [Link]((X0, X1, X2), axis=0)

from [Link] import display

# Adapted from: [Link]

# Simulate inner core points

# Make the outer ring

# Simulate ring points

return [Link] ([crater_core, crater_ring])

def make_scatter_plot (df, x="x_1", y="x_2", hue="label",

if centers is not None:

def make_scatter_plot2 (df, x="x_1", y="x_2", hue="label", size=5):

Loading the data

display ([Link] (3))

assert len (crater) == 1500

with open (fn('crater_counts.txt'), 'rt') as fp:

make_scatter_plot (crater, hue='kmeans_label')

The first important concept in DBSCAN is that of an ϵ-neighborhood.

• p[:d]: The query point, of dimension d.

def region_query (p, eps, X):

Here is the test code for region_query(). In addition to sanity-checking your

with open (fn('region_query_soln.txt'), 'rt') as fp:

Exercise 2 (1 point). Suppose you are given a vector y[:] of boolean

def index_set (y):

i_test = index_set (y_test)

def find_neighbors (eps, X):

with open (fn('find_neighbors_soln.csv'), 'rt') as fp:

Exercise 4 (2 points). Complete the function, find_core_points(s, neighbors),

def find_core_points (s, neighbors):

core_set = find_core_points (5, neighbors)

def plot_core_set (df, core_set):

plot_core_set (crater, core_set)

with open (fn('find_core_points_soln.txt'), 'rt') as fp:

Growing clusters via "reachable" points

1. Consider any point p that is not yet assigned to a cluster.

Here is a brief illustration of these concepts:

Exercise 5 (4 points). Implement the "cluster growth" procedure described above in