Recommender Systems Course Overview
Recommender Systems Course Overview
1
Subject Code & Name : CCS360 & RECOMMENDER SYSTEMS
Branch : CSE
Year/Semester : III / VI
Course Outcomes :
On the successful completion of the course, the students will be able to
Cos Knowledge Course Outcomes
Level
CO1 K1 Understand the basic concepts of recommender systems.
CO2 K3 Implement machine-learning and data-mining algorithms in
recommender systems data sets.
CO3 K3 Implementation of Collaborative Filtering and carrying out
performance evaluation of recommender systems based on
various metrics.
CO4 K3 Design and implement a simple recommender system.
CO5 K4 Learn about advanced topics of recommender systems.
2
COLLEGE VISION & MISSION STATEMENT
Vision
To incubate value based technical education and produce outstanding women graduate
to compete with the technological challenges with right attitude towards social
empowerment.
Mission
To equip necessary resources and to establish sufficient infrastructure for a beneficial
process of learning that paves the way for making ideal technocrats.
To educate and make the students efficient with necessary skills and to make them
industry ready engineers.
To establish high-level learning and research skills to confront technological
scenarios.
To provide valuable resources for social empowerment and lifelong learning process.
3
1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems
and design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.
5. Modern tool usage:
Create, select, and apply appropriate techniques, resources, and modern engineering and IT
tools including prediction and modeling to complex engineering activities with an
understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give and
receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
4
12. Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.
PROGRAM SPECIFIC OUTCOMES PROGRAM (PSO’S)
Able to solve problems in the broad area of programming concepts, appraise
environmental and social issues with ethics and manage different projects.
Apply the acquired knowledge to design and develop the computer software and
hardware.
Create solutions by adapting emerging technologies for real time applications of
industry.
5
Python 12
Implement dimension reduction techniques for CO1 1,2,3,4,5,6,9,10,11, 1,2,3
2
recommender systems 12
Implement user profile learning CO2 1,2,3,4,5,6,9,10,11, 1,2,3
3
12
Implement content-based recommendation CO4 1,2,3,4,5,6,9,10,11, 1,2,3
4
systems 12
Implement collaborative filter techniques CO3 1,2,3,4,5,6,9,10,11, 1,2,3
5
12
Create an attack for tampering with CO5 1,2,3,4,5,6,9,10,11, 1,2,3
6
recommender systems 12
Implement accuracy metrics like Receiver CO3 1,2,3,4,5,6,9,10,11, 1,2,3
7
Operated Characteristic curves 12
ADVANCED EXPERIMENTS
Build a Movie Recommendation System CO3 1,2,3,4,5,6,9,10,11, 1,2,3
8
12
Restaurant Recommendation System CO4 1,2,3,4,5,6,9,10,11, 1,2,3
9
12
ADDITIONAL EXPERMINETS
Write a program to pre-process the dataset for CO1 1,2,3,4,5,6,9,10,11, 1,2,3
10
analysis. 12
Write a program to visualizing the Ratings in CO1 1,2,3,4,5,6,9,10,11, 1,2,3
11
the Data Set 12
INSTRUCTIONS TO STUDENTS
DOs:
6
Always sit on assigned computer.
Enter laboratory in time and work quitely.
Use the computer properly to keep it in good working condition.
Wear id cards and lab coats before entering the laboratory.
Report the problems identifies in the computer to the staff in charge.
Shut down the computer properly before leaving the lab.
DON’Ts:
INDEX
Sl.
No. Date Topic Page No Signature
7
Ex No: 1 DATA SIMILARITY MEASURES USING PYTHON
Date:
Aim:
8
Implement Python program to calculate Data similarity measures.
Similarity Measures
The similarity measure is the measure of how much alike two data objects are. A
similarity measure is a data mining or machine learning context is a distance with dimensions
representing features of the objects. If the distance is small, the features are having a high
degree of similarity. Whereas a large distance will be a low degree of similarity.
Similarity measure usage is more in the text related preprocessing techniques, Also
the similarity concepts used in advanced word embedding techniques. For example, two fruits
are similar because of color or size or taste.
Generally, similarities are measured in the range 0 to 1 [0,1]. In the machine learning
world, this score in the range of [0, 1] is called the similarity score. Two main considerations
of similarity:
Similarity = 1 if X = Y (Where X, Y are two objects)
Similarity = 0 if X ≠ Y
That’s all about similarity let’s drive to five most popular similarity distance measures.
1. Euclidean distance
Euclidean distance is the most common use of distance measure. In most cases when
people say about distance, they will refer to Euclidean distance. Euclidean distance is also
known as simply distance. The Euclidean distance between two points is the length of the
path connecting them. The Pythagorean theorem gives this distance between two points.
Program:
from math import*
def euclidean_distance(x,y):
return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))
print (euclidean_distance([0,3,4,5],[7,6,3,-1]))
9
Output:
9.746794344808963
2. Manhattan distance:
Manhattan distance is a metric in which the distance between two points is calculated as the
sum of the absolute differences of their Cartesian coordinates. In a simple way of saying it is
the total sum of the difference between the x-coordinates and y-coordinates.
10
In the equation, d^MKD is the Minkowski distance between the data record i and j, k the
index of a variable, n the total number of variables in y and λ the order of the Minkowski
metric. Although it is defined for any λ > 0, it is rarely used for values other than 1, 2, and ∞.
Synonyms of Minkowski
Different names for the Minkowski distance or Minkowski metric arise from the order:
λ = 1 is the Manhattan distance. Synonyms are L1-Norm, Taxicab, or City-Block
distance. For two vectors of ranked ordinal variables, the Manhattan distance is
sometimes called Foot-ruler distance.
λ = 2 is the Euclidean distance. Synonyms are L2-Norm or Ruler distance. For two
vectors of ranked ordinal variables, the Euclidean distance is sometimes called Spear-
man distance.
λ = ∞ is the Chebyshev distance. Synonyms are Lmax-Norm or Chessboard distance.
Program:
from math import*
from decimal import Decimal
def minkowski_distance(x,y,p_value):
return nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),p_value)
print (minkowski_distance([0,3,4,5],[7,6,3,-1],3))
Output:
8.373
4. Cosine Similarity
11
The cosine similarity metric finds the normalized dot product of the two attributes. By
determining the cosine similarity, we would effectively try to find the cosine of the angle
between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle.
It is thus a judgment of orientation and not magnitude. Two vectors with the same
orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0. Whereas
two vectors diametrically opposed having a similarity of -1, independent of their magnitude.
Cosine similarity is particularly used in positive space, where the outcome is neatly
bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very
efficient to evaluate, especially for sparse vectors.
Program:
from math import*
def square_rooted(x):
return round(sqrt(sum([a*a for a in x])),3)
def cosine_similarity(x,y):
numerator = sum(a*b for a,b in zip(x,y))
denominator = square_rooted(x)*square_rooted(y)
return round(numerator/float(denominator),3)
12
Intersection - The intersection between two sets A and B is denoted A ∩ B and reveals all
items which are in both sets A, B.
Union - The union between two sets A and B is denoted A ∪ B and reveals all items which
are in either set.
The Jaccard similarity measures the similarity between finite sample sets and is
defined as the cardinality of the intersection of sets divided by the cardinality of the union of
= 2 / 7 = 0.286
Program:
from math import*
def jaccard_similarity(x,y):
intersection_cardinality = len([Link](*[set(x), set(y)]))
union_cardinality = len([Link](*[set(x), set(y)]))
return intersection_cardinality/float(union_cardinality)
print (jaccard_similarity([0,1,2,5,6],[0,2,3,5,7,9]))
Output:
0.375
Reference: [Link]
in-python/
Result:
Thus, the Python program to calculate Data similarity measures were implemented
successfully.
13
Ex No: 2 DIMENSIONALITY REDUCTION
Date:
Aim:
Implement Python program to demonstrate Dimensionality reduction techniques.
Missing Value Ratio
What if we have too many missing values (say more than 50%)? Should we impute
the missing values or drop the variable? A better option is to drop the variable since it will not
have much information. However, this isn’t set in stone. We can set a threshold value and if
the percentage of missing values in any variable is more than that threshold, we will drop the
variable.
First download the csv file from the link given below and then upload the csv file into
the “Colab Notebook” folder in google drive.
[Link]
Train_UWu5bXk.csv
Program:
import pandas as pd
import numpy as np
import [Link] as plt
from [Link] import drive
[Link]('/content/gdrive')
train=pd.read_csv("/content/gdrive/MyDrive/Colab
Notebooks/Train_UWu5bXk.csv")
# checking the percentage of missing values in each variable
a = [Link]().sum()/len(train)*100
print(a)
Output:
Item_Identifier 0.000000 Outlet_Identifier 0.000000
Item_Weight 17.165317 Outlet_Establishment_Year 0.000000
Item_Fat_Content 0.000000 Outlet_Size 28.276428
Item_Visibility 0.000000 Outlet_Location_Type 0.000000
Item_Type 0.000000 Outlet_Type 0.000000
Item_MRP 0.000000 Item_Outlet_Sales 0.000000
dtype: float64
As you can see in the above table, there aren’t too many missing values (just 2
variables have them actually). We can impute the values using appropriate methods, or we
can set a threshold of, say 20%, and remove the variable having more than 20% missing
values.
Program:
# saving in ‘variable’ after dropping columns have > 20% data is missing
variables = [Link]
14
variable = [ ]
for i in range(1,12):
if a[i]<=20: #setting the threshold as 20%
[Link](variables[i])
print(variables)
print(variable)
Output:
Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
'Item_Type', 'Item_MRP', 'Outlet_Identifier',
'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
'Outlet_Type', 'Item_Outlet_Sales'],
dtype='object')
As the above output shows, the variance of Item_Visibility is very less as compared to
the other variables. We can safely drop this column. This is how we apply low variance filter.
15
Program:
numeric = train[['Item_Weight', 'Item_Visibility', 'Item_MRP',
'Outlet_Establishment_Year']]
var = [Link]()
print(“Variance : \n”, var)
numeric = [Link]
variable = [ ]
for i in range(len(var)):
if var[i]>=10: #setting the threshold as 10%
[Link](numeric[i])
print(numeric)
print(variable)
Output:
Variance:
Item_Weight 17.869561
Item_Visibility 0.002662
Item_MRP 3878.183909
Outlet_Establishment_Year 70.086372
dtype: float64
Index(['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year'],
dtype='object')
['Item_Weight', 'Item_MRP', 'Outlet_Establishment_Year']
Random Forest
Random Forest is one of the most widely used algorithms for feature selection. We
need to convert the data into numeric form by applying one hot encoding, as Random Forest
(Scikit-Learn Implementation) takes only numeric inputs. We can pick the top-most three
features to reduce the dimensionality in our dataset.
Program:
from [Link] import RandomForestRegressor
# Drop the columns with missing values and string data types
df=train[['Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']]
[Link]()
features = [Link]
importances = model.feature_importances_
indices = [Link](importances)[-3:] # top 3 features
[Link]('Feature Importances')
[Link](range(len(indices)), importances[indices], color='b', align='center')
[Link](range(len(indices)), [features[i] for i in indices])
[Link]('Relative Importance')
[Link]()
16
Output:
Reference: [Link]
techniques-python/
Factor Analysis
Factor analysis is a dimensionality reduction technique commonly used in statistics. It
is an unsupervised machine-learning technique. It uses the user generated biochemist dataset
and performs a FA that creates analysis between two components. There are two types of
factor analysis
1. Exploratory Factor Analysis - It is used to find structures among a set of attributes.
The number of factors/components is not specified on hand by the researchers or the
scientists. The overall values need to be derived as well.
2. Confirmatory Factor Analysis - It is used for ground-level hypotheses and is based
on existing theories or concepts. Here, the researchers already have an expected
(hypothesized) structure of the data. So the purpose of CFA is to determine the extent
to which the proven data fits the expected data.
Applications of Factor Analysis
1. To reduce the number of variables used to analyze data
2. To detect the structure of the relationship between two set of variables.
First create the bioChemist dataset in MS Excel with 15 rows and upload in drive.
index art sex mar kids phd mentor
1 0 women single 0 2 6
2 0 women single 0 4 6
3 0 men married 1 2 3
4 0 women single 0 4 26
5 0 women married 2 4 2
6 0 women married 0 4 3
7 0 men married 2 4 4
8 0 men single 0 3 6
9 0 women married 0 5 0
10 0 men single 0 2 14
11 0 women single 0 3 13
12 0 women married 1 1 3
13 0 women single 0 4 4
14 0 men married 0 4 0
15 0 women single 2 2 2
17
Program:
import numpy as np
import pandas as pd
from [Link] import FactorAnalysis
import [Link] as plt
df =pd.read_csv("/content/gdrive/MyDrive/Colab
Notebooks/[Link]")
df = [Link][1:15]
print(df)
fact_2c = FactorAnalysis(n_components = 2)
x_factor = fact_2c.fit_transform(x)
Output:
[0 1 0 1 1 1 0 1 0 0 1 0 1 0]
18
Program:
[Link]('Marital Status: Single - Blue & Married - Purple')
[Link]("Factor 1")
[Link]("Factor 2")
[Link](x_factor[:,0], x_factor[:,1], c = colors[z])
Output:
Reference: [Link]
Principal Component Analysis (PCA)
In this article, we will cluster the wine datasets with K-Means Clustering and
visualize them after dimensionality reductions with PCA. K-Means Clustering is an
unsupervised learning algorithm that tries to cluster data into K number of clusters based on
their similarity.
In k means clustering, we specify the number of clusters we want the data to be
grouped into. The algorithm randomly assigns each observation to a set and finds the centroid
of each set. Then, the algorithm iterates through two steps: Reassign data points to the cluster
whose centroid is closest. Calculate the new centroid of each cluster. These two steps are
repeated until the within-cluster variation cannot be reduced further. The within-cluster
deviation is calculated as the sum of the Euclidean distance between the data points and their
respective cluster centroids.
[Link] data set is the result of a chemical analysis of wines grown in the same
region in Italy but derived from three different cultivars. The analysis determined the
quantities of 13 constituents found in each of the three types of wines.
Program:
import pandas as pd
import seaborn as sns
import [Link] as plt
from [Link] import StandardScaler
from [Link] import load_wine
from [Link] import KMeans
from [Link] import PCA
19
df = load_wine(as_frame=True)
df = [Link]
[Link]('target', axis =1, inplace=True)
[Link]()
Output:
alcohol malic_acid ash alcalinity magnesium phenols flavanoid nonflavanoid pro intensity hue od280 proline
14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0
13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0
13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0
14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0
13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0
Program:
scaler =StandardScaler()
features =[Link](df)
features =[Link](df)
# Convert to pandas Dataframe
scaled_df =[Link](features,columns=[Link])
# Print the scaled data
scaled_df.head(2)
X=scaled_df.values
wcss = {}
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
[Link](X)
wcss[i] = kmeans.inertia_
pca=PCA(n_components=2)
reduced_X=[Link](data=pca.fit_transform(X),columns=['PCA1','PCA2'])
reduced_X.head()
20
centers=[Link](kmeans.cluster_centers_)
[Link](figsize=(7,5))
# Scatter plot
[Link](reduced_X['PCA1'],reduced_X['PCA2'],c=kmeans.labels_)
[Link](centers[:,0],centers[:,1],marker='x',s=100,c='red')
[Link]('PCA1')
[Link]('PCA2')
[Link]('Wine Cluster')
plt.tight_layout()
Output:
Reference: [Link]
Singular Value Decomposition (SVD)
Singular Value Decomposition is very vastly used in the field of computation
engineering and machine learning for feature extraction, linear regression problems with least
squares, dimension reduction, etc.
According to wikipedia "In linear algebra, the singular value decomposition (SVD) is
a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square
normal matrix with an orthonormal eigenbasis to any matrix. It is related to the polar
decomposition".
Program:
import requests
import cv2
import numpy as np
import [Link] as plt
from [Link] import imread, imshow
from [Link] import drive
[Link]('/content/gdrive')
21
gray_image = imread("/content/gdrive/MyDrive/Colab Notebooks/[Link]",
as_gray=True)
Output:
[Link]:(180, 180),[Link]:(180,),[Link]:(180, 240)
Program:
# plot images with different number of components
comps = [180, 1, 5, 10, 15, 20]
[Link](figsize=(12, 6))
for i in range(len(comps)):
low_rank = u[:, :comps[i]] @ [Link](s[:comps[i]]) @ v[:comps[i], :]
if(i == 0):
[Link](2, 3, i+1),
[Link](low_rank, cmap='gray'),
[Link](f'Actual Image with n_components = {comps[i]}')
else:
[Link](2, 3, i+1),
[Link](low_rank, cmap='gray'),
[Link](f'n_components = {comps[i]}')
References:
[Link]
[Link]
Result:
Thus the Python programs to demonstrate Dimensionality reduction techniques were
implemented successfully.
22
Ex No: 3 USER PROFILE LEARNING
Date:
Aim:
To write Python programs to implement different User Profile Learning techniques.
Naive Bayes Classifier
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one
of the simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and
reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.
Naive Bayes classifier assumes that the effect of a particular feature in a class is
independent of other features. For example, a loan applicant is desirable or not depending on
his/her income, previous loan and transaction history, age, and location. Even if these features
are interdependent, these features are still considered independently. This assumption
simplifies computation, and that's why it is considered as naive. This assumption is called
class conditional independence.
Naive Bayes classifier calculates the probability of an event in the following steps:
Step 1: Calculate the prior probability for given class labels
Step 2: Find Likelihood probability with each attribute for each class
Step 3: Put this value in Bayes Formula and calculate posterior probability.
Step 4: See which class has a higher probability, given the input belongs to the higher
probability class.
Download loan_data.csv dataset from [Link]
data and upload in Google drive.
Program:
import pandas as pd
import seaborn as sns
import [Link] as plt
df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/loan_data.csv')
[Link]()
Output:
credit purpose [Link] install log dti fico days bal util inq delinq [Link]
[Link]
23
1 d e b t _ c o n s o l i d a t i o n 0 . 11 8 9 829.10 11 . 3 5 0 4 0 7 1 9 . 4 8 737 5639.958333 28854 52.1 0 0 0
0
1 credit_card 0.1071 228.22 11 . 0 8 2 1 4 3 1 4 . 2 9 707 2760.000000 33623 76.7 0 0 0 0
1 debt_consolidation 0.1357 366.86 1 0 . 3 7 3 4 9 1 11 . 6 3 682 4710.000000 3 5 11 25.6 1 0 0 0
1 debt_consolidation 0.1008 162.34 11 . 3 5 0 4 0 7 8 . 1 0 7 1 2 2699.958333 33667 73.2 1 0 0 0
1 credit_card 0.1426 102.92 11 . 2 9 9 7 3 2 1 4 . 9 7 667 4066.000000 4740 39.5 0 1 0 0
[Link](data=df,x='purpose',hue='[Link]')
[Link](rotation=45, ha='right');
pre_df = pd.get_dummies(df,columns=['purpose'],drop_first=True)
pre_df.head()
X = pre_df.drop('[Link]', axis=1)
y = pre_df['[Link]']
model = GaussianNB()
[Link](X_train, y_train);
y_pred = [Link](X_test)
print("Accuracy:", accuray)
print("F1 Score:", f1)
24
Output:
Reference: [Link]
Rule Based Classifier
Rule-based classifiers are just another type of classifier which makes the class decision
depending by using various “if..else” rules. These rules are easily interpretable and thus these
classifiers are generally used to generate descriptive models. The condition used with “if” is
called the antecedent and the predicted class of each rule is called the consequent.
Properties of rule-based classifiers:
1. The percentage of records which satisfy the antecedent conditions of a particular rule.
2. The rules generated by the rule-based classifiers are generally not mutually exclusive,
i.e. many rules can cover the same record.
3. The rules generated by the rule-based classifiers may not be exhaustive, i.e. there may
be some records which are not covered by any of the rules.
4. The decision boundaries created by them is linear, but these can be much more
complex than the decision tree because the many rules are triggered for the same
record.
You can get [Link] data from this link [Link]
classification/tree/main/data
Program:
import pandas as pd
from [Link] import drive
[Link]('/content/gdrive')
df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/[Link]')
[Link]()
Output:
PRICE SOURCE SEX COUNTRY AGE
0 39 android male bra 17
1 39 android male bra 17
2 49 android male bra 17
3 29 android male tur 17
25
4 49 android male tur 17
Program:
[Link]().[Link]() # Returns any value is missing in DataFrame
[Link]().sum()
Output:
PRICE 0
SOURCE 0
SEX 0
COUNTRY 0
AGE 0
dtype: int64
Program:
df["SOURCE"].nunique() # Count number of distinct SOURCE elements
df["SOURCE"].value_counts()# Returns counts of SOURCE rows
df["COUNTRY"].value_counts() # Returns counts of COUNTRY rows
Output:
usa 2065
bra 1496
deu 455
tur 451
fra 303
can 230
Name: COUNTRY, dtype: int64
Program:
# Country breakdown of income averages
[Link]("COUNTRY")["PRICE"].agg({"mean"})
Output:
mean
COUNTRY
bra 34.327540 fra 33.587459
can 33.608696 tur 34.787140
deu 34.032967 usa 34.007264
Program:
# Country and Source breakdown of income averages
[Link](["COUNTRY", 'SOURCE'])["PRICE"].mean()
Output:
COUNTRY SOURCE
bra android 34.387029 ios 34.268817
ios 34.222222 fra android 34.312500
can android 33.330709 ios 32.776224
ios 33.951456 tur android 36.229437
deu android 33.869888 ios 33.272727
26
usa android 33.760357 ios 34.371703
Name: PRICE, dtype: float64
Program:
# Average income on the basis of variables
agg_df = [Link](["COUNTRY", 'SOURCE', "SEX", "AGE"])
["PRICE"].mean().sort_values(ascending=False)
agg_df.head()
Output:
COUNTRY SOURCE SEX AGE
bra android male 46 59.0 usa ios male 32 54.0
usa android male 36 59.0 deu android female 36 49.0
fra android female 24 59.0
Name: PRICE, dtype: float64
Program:
# Convert the index names to variable names
agg_df = agg_df.reset_index()
agg_df.head()
Output:
COUNTRY SOURCE SEX AGE PRICE
0 bra android male 46 59.0
1 usa android male 36 59.0
2 fra android female24 59.0
3 usa ios male 32 54.0
4 deu android female36 49.0
Program:
# Convert AGE variable to categorical variable and adding it to agg_df
my_labels = ['0_18', '19_23', '24_30', '31_40', '41_70']
agg_df["AGE_CUT"] = [Link](x=agg_df["AGE"], bins=[0, 18, 23, 30, 40,
70], labels=my_labels)
agg_df.tail(10)
Output:
COUNTRY SOURCE SEX AGE PRICE AGE_CUT
338 bra android male 23 21.5 19_23
339 tur android male 21 19.0 19_23
340 tur ios male 47 19.0 41_70
341 bra ios female 34 19.0 31_40
342 bra ios male 47 19.0 41_70
343 usa ios female 38 19.0 31_40
344 usa ios female 30 19.0 24_30
345 can android female 27 19.0 24_30
346 fra android male 18 19.0 0_18
347 deu android male 26 9.0 24_30
Program:
# Identify new level-based customers (Personas)
27
agg_df["customers_level_based"] = [f"{i[0]}_{i[1]}_{i[2]}_{i[-1]}" for i in
agg_df.values]
agg_df["customers_level_based"].head()
Output:
0 bra_android_male_41_70 3 usa_ios_male_31_40
1 usa_android_male_31_40 4 deu_android_female_31_40
2 fra_android_female_24_30
Name: customers_level_based, dtype: object
Program:
# Segment new customers (Personas)
agg_df["SEGMENT"] = [Link](agg_df["PRICE"], 4, labels=["D", "C", "B",
"A"])
agg_df.head()
Output:
COU SOURCE SEX AGE PRICE AGE_CUT customers_level
SEGMENT
bra android male 46 59.0 41_70 bra_android_male_41_70 A
usa android male 36 59.0 31_40 usa_android_male_31_40 A
fra android female24 59.0 24_30 fra_android_female_24_30 A
usa ios male 32 54.0 31_40 usa_ios_male_31_40 A
deu android female36 49.0 31_40 deu_android_female_31_40 A
Program:
# Describe the segments and especially "C"
agg_df.groupby(["SEGMENT"]).agg({"PRICE": ["mean", "max", "sum"]})
agg_df[agg_df["SEGMENT"] == "C"].describe()
Output:
AGE PRICE
count 95.000000 95.000000 25% 19.000000 32.333333
mean 26.663158 32.933339 50% 24.000000 32.913043
std 10.075893 0.877933 75% 32.000000 33.861004
min 15.000000 31.173913 max 54.000000 34.000000
Program:
new_user = "fra_android_male_24_30"
print(agg_df[agg_df["customers_level_based"] == new_user])
Output:
COU SOURCE SEX AGE PRICE AGE_CUT customers_level SEGMENT
fra android male 25 33.0 24_30 fra_android_male_24_30 C
Reference: [Link]
classification-problem-6088c0e405d4
Result:
The Python programs to implement different User Profile Learning techniques was
implemented successfully.
28
Ex No: 4 CONTENT-BASED RECOMMENDATION SYSTEM
Date:
Aim:
To implement a Content based Recommender System in Python.
Python Recommendation Systems
Python Recommendation Systems employs a data-driven methodology to offer
customers tailored recommendations. It uses user data and algorithms to forecast and suggest
goods, services, or content that a user is probably going to find interesting.
Recommender System is of different types:
Content-Based Recommendation: It is supervised machine learning used to induce a
classifier to discriminate between interesting and uninteresting items for the user.
Collaborative Filtering: Collaborative Filtering recommends items based on similarity
measures between users and/or items. The basic assumption behind the algorithm is
that users with similar interests have common preferences.
Content-Based Recommendation System
Content-based systems recommend items to the customer which are previously high-
rated items by other customer. It uses the features and properties of the item. From these
properties, it can calculate the similarity between the items.
Program:
import numpy as np
import pandas as pd
import sklearn
import [Link] as plt
import seaborn as sns
from [Link] import NearestNeighbors
import warnings
[Link](action='ignore', category=FutureWarning)
Output:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
29
Program:
# loading movie dataset
movies = pd.read_csv("[Link]
tutorial/[Link]")
print([Link]())
Output:
Program (Optional):
n_ratings = len(ratings)
n_movies = len(ratings['movieId'].unique())
n_users = len(ratings['userId'].unique())
Output:
Number of ratings: 100836 Average ratings per user: 165.3
Number of unique movieId's: 9724 Average ratings per movie: 10.37
Number of unique users: 610
Program (Optional):
user_freq = ratings[['userId', 'movieId']].groupby(
'userId').count().reset_index()
user_freq.columns = ['userId', 'n_ratings']
print(user_freq.head())
Output:
userId n_ratings
0 1 232
1 2 29
2 3 39
3 4 216
4 5 44
Program (Optional):
# Find Lowest and Highest rated movies:
mean_rating = [Link]('movieId')[['rating']].mean()
# Lowest rated movies
30
lowest_rated = mean_rating['rating'].idxmin()
[Link][movies['movieId'] == lowest_rated]
# Highest rated movies
highest_rated = mean_rating['rating'].idxmax()
[Link][movies['movieId'] == highest_rated]
# show number of people who rated movies highest
ratings[ratings['movieId']==highest_rated]
# show number of people who rated movies lowest
ratings[ratings['movieId']==lowest_rated]
# the above movies has very low dataset. We will use bayesian average.
movie_stats = [Link]('movieId')[['rating']].agg(['count', 'mean'])
movie_stats.columns = movie_stats.[Link]()
Program:
# Now, we create user-item matrix using scipy csr matrix
from [Link] import csr_matrix
def create_matrix(df):
N = len(df['userId'].unique())
M = len(df['movieId'].unique())
"""
Find similar movies using KNN
"""
def find_similar_movies(movie_id, X, k, metric='cosine',
show_distance=False):
neighbour_ids = []
movie_ind = movie_mapper[movie_id]
31
movie_vec = X[movie_ind]
k+=1
kNN = NearestNeighbors(n_neighbors=k, algorithm="brute",
metric=metric)
[Link](X)
movie_vec = movie_vec.reshape(1,-1)
neighbour = [Link](movie_vec,
return_distance=show_distance)
for i in range(0,k):
n = [Link](i)
neighbour_ids.append(movie_inv_mapper[n])
neighbour_ids.pop(0)
return neighbour_ids
Output:
Since you watched Grumpier Old Men (1995)
Grumpy Old Men (1993)
Striptease (1996)
Nutty Professor, The (1996)
Twister (1996)
Father of the Bride Part II (1995)
Broken Arrow (1996)
Bio-Dome (1996)
Truth About Cats & Dogs, The (1996)
Sabrina (1995)
Birdcage, The (1996)
Program:
def recommend_movies(user_id, X, user_mapper, movie_mapper,
movie_inv_mapper, k=10):
df1 = ratings[ratings['userId'] == user_id]
if [Link]:
print(f"User with ID {user_id} does not exist.")
return
32
return
print(f"Since you watched {movie_title}, you might also like:")
for i in similar_ids:
print(movie_titles.get(i, "Not found"))
Output:
Since you watched Twelve Monkeys (a.k.a. 12 Monkeys) (1995), you might also like:
Pulp Fiction (1994)
Terminator 2: Judgment Day (1991)
Independence Day (a.k.a. ID4) (1996)
Seven (a.k.a. Se7en) (1995)
Fargo (1996)
Fugitive, The (1993)
Usual Suspects, The (1995)
Jurassic Park (1993)
Star Wars: Episode IV - A New Hope (1977)
Heat (1995)
Reference: [Link]
Result:
The Python program to implement a Content based Recommender System was
implemented successfully.
33
Ex No: 5 COLLABORATIVE FILTERING TECHNIQUES
Date:
Aim:
To implement different Collaborative Filtering Techniques in Python.
Python Recommendation Systems
Python Recommendation Systems employs a data-driven methodology to offer
customers tailored recommendations. It uses user data and algorithms to forecast and suggest
goods, services, or content that a user is probably going to find interesting. Recommender
System is of different types:
Content-Based Recommendation: It is supervised machine learning used to induce a
classifier to discriminate between interesting and uninteresting items for the user.
Collaborative Filtering: Collaborative Filtering recommends items based on similarity
measures between users and/or items. The basic assumption behind the algorithm is
that users with similar interests have common preferences.
Similar users have similar ratings on the same item. Therefore, if Alice and Bob have
rated movies in a similar way in the past, then one can use Alice’s observed ratings on the
movie Terminator to predict Bob’s unobserved ratings on this movie. Use following
[Link] file.
user_0 user_1 user_2 user_3 user_4 user_5 user_6 user_7 user_8 user_9
movie_0 0 0 3 4 2 1 2 0 5 1
movie_1 3 0 1 3 0 0 0 0 0 0
movie_2 0 3 0 4 0 2 0 0 0 2
movie_3 5 2 3 2 0 4 3 3 0 0
movie_4 0 5 5 0 0 0 0 0 5 4
movie_5 0 0 0 0 4 0 4 2 3 0
movie_6 4 4 0 0 4 4 3 4 0 4
movie_7 5 0 4 2 3 0 3 3 3 3
movie_8 0 3 0 0 5 5 0 4 0 0
movie_9 2 0 0 0 0 0 0 0 4 0
34
Program:
import numpy as np
import pandas as pd
from [Link] import cosine_similarity
df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/[Link]')
[Link]()
Output:
Unnamed: 0user_0 user_1 user_2 user_3 user_4 user_5 user_6 user_7 user_8 user_9
0 movie_0 0 0 3 4 2 1 2 0 5 1
1 movie_1 3 0 1 3 0 0 0 0 0 0
2 movie_2 0 3 0 4 0 2 0 0 0 2
3 movie_3 5 2 3 2 0 4 3 3 0 0
4 movie_4 0 5 5 0 0 0 0 0 5 4
Program:
[Link]([Link][0], axis=1, inplace=True)
matrix = df[0:10].to_numpy()
item_similarity = cosine_similarity(matrix.T)
Output:
0 1 2 3 4 5 6 7 8 9 5 3 1 2 5 9 0 8 7 4 6
0 5 1 7 0 4 9 3 6 8 2 6 8 3 2 0 9 1 6 4 5 7
1 4 1 8 5 7 9 6 2 3 0 7 3 5 1 8 4 9 7 2 0 6
2 8 6 4 0 7 2 3 5 9 1 8 8 3 2 0 9 6 1 4 5 7
3 8 3 9 2 1 4 5 6 0 7 9 5 1 7 4 3 0 9 6 2 8
4 5 4 3 7 0 6 1 8 2 9
Program:
# Display recommendations
user_id = int(input("Enter user id as integer : "))
print("Top 5 Items recommended for user_", user_id)
for i in range(5):
print("Recommendation ", i+1, " : movie_", recommended_items[i]
[user_id])
Output:
35
Enter user id as integer : 4
Top 5 Items recommended for user_ 4
Recommendation 1 : movie_ 5
Recommendation 2 : movie_ 4
Recommendation 3 : movie_ 3
Recommendation 4 : movie_ 7
Recommendation 5 : movie_ 0
In order to make recommendations for target item B, the first step is to determine a set
S of items, which are most similar to item B. Then, to predict the rating of any user A for item
B, the ratings in set S, which are specified by A, are determined. The weighted average of
these ratings is used to compute the predicted rating of user A for item B. Use following
[Link] file.
user_0 user_1 user_2 user_3 user_4 user_5 user_6 user_7 user_8 user_9 description
movie_0 0 0 3 4 2 1 2 0 5 1 Adventure|Animation|Children|Comedy
movie_1 3 0 1 3 0 0 0 0 0 0 Adventure|Children|Fantasy
movie_2 0 3 0 4 0 2 0 0 0 2 Comedy|Romance
movie_3 5 2 3 2 0 4 3 3 0 0 Comedy|Drama|Romance
movie_4 0 5 5 0 0 0 0 0 5 4 Comedy
movie_5 0 0 0 0 4 0 4 2 3 0 Adventure|Comedy|Fantasy
movie_6 4 4 0 0 4 4 3 4 0 4 Animation|Children|Fantasy
movie_7 5 0 4 2 3 0 3 3 3 3 Children|Comedy
movie_8 0 3 0 0 5 5 0 4 0 0 Animation|Romance
movie_9 2 0 0 0 0 0 0 0 4 0 Comedy|Drama
Program:
import pandas as pd
import numpy as np
from [Link] import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/[Link]')
[Link]()
Output:
Unnamed: 0user_0 user_1 user_2 user_3 user_4 user_5 user_6 user_7 user_8 user_9
description
0 movie_0 0 0 3 4 2 1 2 0 5 1 Adventure|Animation|Children|
Comedy|Fantasy
1 movie_1 3 0 1 3 0 0 0 0 0 0 Adventure|Children|Fantasy
2 movie_2 0 3 0 4 0 2 0 0 0 2 Comedy|Romance
3 movie_3 5 2 3 2 0 4 3 3 0 0 Comedy|Drama|Romance
4 movie_4 0 5 5 0 0 0 0 0 5 4 Comedy
Program:
# Extract features from text descriptions
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['description'])
36
user_profile = [Link](tfidf_matrix.shape[1])
[Link]([Link][0], axis=1, inplace=True)
[Link]([Link][10], axis=1, inplace=True)
matrix = df[0:10].to_numpy()
print(matrix)
Output:
[[0 0 3 4 2 1 2 0 5 1] [0 0 0 0 4 0 4 2 3 0]
[3 0 1 3 0 0 0 0 0 0] [4 4 0 0 4 4 3 4 0 4]
[0 3 0 4 0 2 0 0 0 2] [5 0 4 2 3 0 3 3 3 3]
[5 2 3 2 0 4 3 3 0 0] [0 3 0 0 5 5 0 4 0 0]
[0 5 5 0 0 0 0 0 5 4] [2 0 0 0 0 0 0 0 4 0]]
Program:
rating = [Link](axis = 1)
for i in range(10):
user_profile += tfidf_matrix[i].toarray()[0] * rating[i]
# Calculate cosine similarity between the user profile and item features
similarities = cosine_similarity([user_profile], tfidf_matrix)
# Get recommended item IDs
recommended_items = [Link](similarities)[:10]
print(recommended_items)
Output:
[[9 8 3 1 4 5 2 6 7 0]]
Program:
# Display recommendations
print("Top 5 Recommended Items:")
for i in range(5):
print("Recommendation ", i+1, " : movie_", recommended_items[0][i])
Output:
Top 5 Recommended Items:
Recommendation 1 : movie_ 9
Recommendation 2 : movie_ 8
Recommendation 3 : movie_ 3
Recommendation 4 : movie_ 1
Recommendation 5 : movie_ 4
References:
[Link]
[Link]
recommendation-systems-836e5e2fe152
[Link]
item-collaborative-filtering-in-python-3baae5179c52
Result:
37
Thus the Python programs to implement different Collaborative Filtering Techniques was
executed successfully.
Ex No: 6 RECEIVER OPERATED CHARACTERISTIC CURVES
Date:
Aim:
To implement Receiver Operated Characteristic curves in Python.
Receiver Operated Characteristic curves
The AUC-ROC curve, or Area Under the Receiver Operating Characteristic curve, is a
graphical representation of the performance of a binary classification model at various
classification thresholds. It is commonly used in machine learning to assess the ability of a
model to distinguish between two classes, typically the positive class (e.g., presence of a
disease) and the negative class (e.g., absence of a disease).
ROC Curve
ROC stands for Receiver Operating Characteristics, and the ROC curve is the graphical
representation of the effectiveness of the binary classification model. It plots the true positive
rate (TPR) vs the false positive rate (FPR) at different classification thresholds.
AUC Curve:
AUC stands for Area Under the Curve, and the AUC curve represents the area under the
ROC curve. It measures the overall performance of the binary classification model. As both
TPR and FPR range between 0 to 1, So, the area will always lie between 0 and 1, and A
greater value of AUC denotes better model performance. Our main goal is to maximize this
area in order to have the highest TPR and lowest FPR at the given threshold. The AUC
measures the probability that the model will assign a randomly chosen positive instance a
higher predicted probability compared to a randomly chosen negative instance. It represents
the probability with with our model is able to distinguish between the two classes which are
present in our target.
38
TPR – True Positive Rate
Basically, the ROC curve is a graph that shows the performance of a classification model
at all possible thresholds (threshold is a particular value beyond which you say a point
belongs to a particular class). The curve is plotted between two parameters
Program:
import [Link] as plt
from [Link] import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from [Link] import roc_curve
from [Link] import roc_auc_score
39
[Link](trainX, trainy)
# predict probabilities
lr_probs = model.predict_proba(testX)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# calculate scores
ns_auc = roc_auc_score(testy, ns_probs)
lr_auc = roc_auc_score(testy, lr_probs)
# summarize scores
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (lr_auc))
# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(testy, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(testy, lr_probs)
# plot the roc curve for the model
[Link](ns_fpr, ns_tpr, linestyle='--', label='No Skill')
[Link](lr_fpr, lr_tpr, marker='.', label='Logistic')
# axis labels
[Link]('False Positive Rate')
[Link]('True Positive Rate')
# show the legend
[Link]()
# show the plot
[Link]()
Output:
Reference: [Link]
classification-in-python/
Result:
Thus the Python program to implement Receiver Operated Characteristic curves was
executed successfully.
40
Ex No: 7 ATTACKS ON RECOMMENDER SYSTEM
Date:
Aim:
To implement attack on Recommender System in Python.
Attacks on Recommender Systems
Recommender systems have been shown vulnerable to adversarial attacks that force
the models to produce misleading recommendations.
The person making the attack on the recommender system is also referred to as the
adversary.
A fake profile refers to a set of ratings corresponding to a fake user created by the
adversary. The number of injected profiles may depend on the specific
recommendation algorithm being attacked, and the approach used to attack it.
Major Classification of Attacks
An attack that requires a smaller number of injected profiles is referred to as an
efficient attack because such attacks are often difficult to detect.
On the other hand, if an attack requires a large number of injected profiles, then such
an attack is inefficient attack because most systems should be able to detect a sudden
injection of a large number of ratings about a small number of items.
Attacks can also be classified based on the amount of knowledge required attack successfully.
Some attacks require only limited knowledge about the ratings distribution. Such
attacks are referred to as low-knowledge attacks.
On the other hand, attacks that require a large amount of knowledge about the ratings
distribution are referred to as high knowledge attacks.
Example of recommender system attacks:
Amazon product’s reviews is distorted with thousands of fake ones. False reviews were
helping unknown brands dominate searches for popular items. Hundreds of unverified five-
star reviews were being posted on product pages in a single day. Many product pages also
included positive reviews for completely different items.
Push Attack
The manufacturer of an item or the author of a book might submit fake positive reviews
on Amazon in order to maximize sales. Such attacks are referred to as Product push attacks.
Program:
41
import numpy as np
import pandas as pd
df=pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/[Link]')
print(df)
Output
users title genre rating
0 user_0 Leo Action | Comedy | Romance 4
1 user_1 Mark Antony Comedy 5
2 user_2 Por Thozhil Action | Detective 4
3 user_3 PS2 Action | Romance 4
4 user_0 Dada Romance 4
5 user_3 Thalaivi Biography 5
6 user_0 PS2 Action | Romance 3
7 user_4 Dada Romance 5
8 user_5 Mark Antony Comedy 5
9 user_6 Mark Antony Comedy 5
10 user_7 Mark Antony Comedy 5
11 user_0 Mark Antony Comedy 5
12 user_7 Leo Action | Comedy | Romance 3
Program:
#Naive Push Attack
import csv
Output:
['users', 'title', 'genre', 'rating']
['user_0', 'Leo', 'Action | Comedy | Romance', '4']
['user_1', 'Mark Antony', 'Comedy', '5']
42
['user_2', 'Por Thozhil', 'Action | Detective', '4']
['user_3', 'PS2', 'Action | Romance', '4']
['user_0', 'Dada', 'Romance', '4']
['user_3', 'Thalaivi', 'Biography', '5']
['user_0', 'PS2', 'Action | Romance', '3']
['user_4', 'Dada', 'Romance', '5']
['user_5', 'Mark Antony', 'Comedy', '5']
['user_6', 'Mark Antony', 'Comedy', '5']
['user_7', 'Mark Antony', 'Comedy', '5']
['user_0', 'Mark Antony', 'Comedy', '5']
['user_7', 'Leo', 'Action | Comedy | Romance', '3']
['user_1', 'Jawan', 'Action | Romance', 5]
['user_2', 'Jawan', 'Action | Romance', 5]
['user_3', 'Jawan', 'Action | Romance', 5]
['user_4', 'Jawan', 'Action | Romance', 5]
['user_5', 'Jawan', 'Action | Romance', 5]
['user_6', 'Jawan', 'Action | Romance', 5]
['user_7', 'Jawan', 'Action | Romance', 5]
Nuke Attack
The competitor of an item manufacturer might submit malicious reviews about the item.
Such attacks are referred to as Nuke attacks.
Program:
#Naive Nuke Attack
import csv
Output:
['users', 'title', 'genre', 'rating']
['user_0', 'Leo', 'Action | Comedy | Romance', '4']
['user_1', 'Mark Antony', 'Comedy', '5']
43
['user_2', 'Por Thozhil', 'Action | Detective', '4']
['user_3', 'PS2', 'Action | Romance', '4']
['user_0', 'Dada', 'Romance', '4']
['user_3', 'Thalaivi', 'Biography', '5']
['user_0', 'PS2', 'Action | Romance', '3']
['user_4', 'Dada', 'Romance', '5']
['user_5', 'Mark Antony', 'Comedy', '5']
['user_6', 'Mark Antony', 'Comedy', '5']
['user_7', 'Mark Antony', 'Comedy', '5']
['user_0', 'Mark Antony', 'Comedy', '5']
['user_7', 'Leo', 'Action | Comedy | Romance', '3']
['user_1', 'Jawan', 'Action | Romance', 0]
['user_2', 'Jawan', 'Action | Romance', 0]
['user_3', 'Jawan', 'Action | Romance', 0]
['user_4', 'Jawan', 'Action | Romance', 0]
['user_5', 'Jawan', 'Action | Romance', 0]
['user_6', 'Jawan', 'Action | Romance', 0]
['user_7', 'Jawan', 'Action | Romance', 0]
Bandwagon Attack
The basic idea of the bandwagon attack is to leverage the fact that a small number of
items are very popular in terms of the number of ratings they receive. For example, a
blockbuster movie or a widely used textbook might receive many ratings. Therefore, if these
items are always rated in the fake user profile, it increases the chance of a fake user profile
being similar to the target user.
Program:
#Bandwagon Attack
import csv
44
for i in range(1,8):
user = "user_"+str(i)
List1 = [user] + List
writer_obj.writerow(List1)
print(List1)
[Link]()
[Link]()
Output:
Highest rated Movie : Mark Antony
['user_7', 'Mark Antony', 'Comedy', 5]
['user_0', 'Leo', 'Action | Comedy | Romance', '4']
['user_1', 'Mark Antony', 'Comedy', '5']
['user_2', 'Por Thozhil', 'Action | Detective', '4']
['user_3', 'PS2', 'Action | Romance', '4']
['user_0', 'Dada', 'Romance', '4']
['user_3', 'Thalaivi', 'Biography', '5']
['user_0', 'PS2', 'Action | Romance', '3']
['user_4', 'Dada', 'Romance', '5']
['user_5', 'Mark Antony', 'Comedy', '5']
['user_6', 'Mark Antony', 'Comedy', '5']
['user_7', 'Mark Antony', 'Comedy', '5']
['user_0', 'Mark Antony', 'Comedy', '5']
['user_7', 'Leo', 'Action | Comedy | Romance', '3']
['user_1', 'Mark Antony', 'Comedy', 5]
['user_2', 'Mark Antony', 'Comedy', 5]
['user_3', 'Mark Antony', 'Comedy', 5]
['user_4', 'Mark Antony', 'Comedy', 5]
['user_5', 'Mark Antony', 'Comedy', 5]
['user_6', 'Mark Antony', 'Comedy', 5]
['user_7', 'Mark Antony', 'Comedy', 5]
Result:
Thus the Python program to implement simple attacks on Recommender System was
executed successfully.
45
ADVANCED EXPERIMENTS
Aim:
To implement a program in Python to build a Movie Recommendation system using
NumPy and Pandas.
Movie Recommendation:
Our recommendation system functions based on the similarities between movies. More
specifically, it will recommend movies to you that other users with similar taste have enjoyed.
To demonstrate this, we'll select two movies from the data set:
Toy Story (1995)
Returns of the Jedi (1983)
The first thing we need to do is create matrices that contain the user ratings for each movie in
the data set. These movie matrices will allow you to see how each user rated every movie in
the data set. Let's examine what's stored in the toy_story_user_ratings and
star_wars_user_ratings variables.
A value of NaN is stored if a specific user has not provided a rating for the Toy Story
(1995) movie. The user ID of the user who provided the rating is stored as the index of the
Series. Next, we will use the corrwith method to calculate the correlation between the
toy_story_user_ratings and star_wars_user_ratings data sets. This will allow us to see if the
movies are similar, since their ratings distribution among users will be highly correlated if so!
First, a pandas Series is created using ratings_matrix.corrwith(toy_story_user_ratings)
that shows the correlation of user ratings between the Toy Story (1995) movie and every
other movie in the data set. Next, the specific correlation for Return of the Jedi (1983) is
pulled from the data structure by passing in the name of the movie in square brackets.
Let's try and find a movie that _is _highly similar to the Return of the Jedi (1983) movie.
To do this, let's build a pandas DataFrame that stores the correlation of every movie's user
ratings with the Return of the Jedi (1983) user ratings.
The first line of code creates a pandas DataFrame with a single column that shows
the correlation of every movie's user ratings with the user ratings of Return of the
Jedi (1983)
The dropna method removes null values from the DataFrame
The sort_values method combined with the arguments 0 and ascending = False
modifies the DataFrame so the most similar movies are shown at the top
The head(15) method shows only the 15 entries at the top of the DataFrame.
Let's filter out movies that have less than 50 reviews to improve the basic recommendation
system that we have built in this tutorial so far. To start this process, we'll want to add the
number of ratings from each movie to our ratings_matrix data structure.
46
Program:
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
%matplotlib inline
#Create a DataFrame and add the number of ratings to is using a count method
ratings_data = [Link](merged_data.groupby('title')['rating'].mean())
ratings_data['# of ratings'] = merged_data.groupby('title')['rating'].count()
#Create the ratings matrix and get user ratings for `Return of the Jedi (1983)` and `Toy Story
(1995)`
ratings_matrix = merged_data.pivot_table(index='user_id',columns='title',values='rating')
47
star_wars_user_ratings = ratings_matrix['Return of the Jedi (1983)']
toy_story_user_ratings = ratings_matrix['Toy Story (1995)']
ratings_matrix.corrwith(toy_story_user_ratings)['Return of the Jedi (1983)']
#Get new recommendations from movies that have more than 50 ratings
correlation_with_star_wars[correlation_with_star_wars['# of Ratings'] >
50].sort_values('Corr. With SW Ratings', ascending = False).head(10)
Output:
Result:
48
Ex No: 9
RESTAURANT RECOMMENDATION SYSTEM
Date:
Aim:
To implement a program in Python to build a Restaurant Recommendation system using
NumPy and Pandas.
Restaurant Recommendation:
These are active information filtering systems that personalize the information provided
to a user based on their interests, relevance of the information, etc. Recommendation systems
are widely used to recommend movies, items, restaurants, places to visit, items to buy, etc.
1. Content-based filtering
2. Collaborative filtering
Start the task of Restaurant Recommendation System by importing the necessary Python
Libraries. Before that download dataset from
[Link]
ml/input?select=[Link]
In [1]:
import numpy as np
import pandas as pd
import seaborn as sb
import [Link] as plt
import seaborn as sns
49
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from [Link] import classification_report
from [Link] import confusion_matrix
from [Link] import r2_score
import warnings
[Link]('always')
[Link]('ignore')
import re
from [Link] import stopwords
from [Link] import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
Now, I will load and read the dataset:
In [2]:
zomato_real=pd.read_csv("../input/zomato-bangalore-dataset/[Link]")
zomato_real.head() # prints the first 5 rows of the dataset
Out[2]:
Now the next step is data cleaning and feature engineering for this step we need to do a lot of
stuff with the data such as:
1. Deleting Unnecessary Columns
2. Removing the Duplicates
3. Remove the NaN values from the dataset
4. Changing the column names
5. Data Transformations
6. Data Cleaning
1. Adjust the column names Now, let’s perform all the above steps in our data:
In [3]:
#Deleting Unnnecessary Columns
50
zomato=zomato_real.drop(['url','dish_liked','phone'],axis=1) #Dropping the column
"dish_liked", "phone", "url" and saving the new dataset as "zomato"
#Some Transformations
zomato['cost'] = zomato['cost'].astype(str) #Changing the cost to string
zomato['cost'] = zomato['cost'].apply(lambda x: [Link](',','.')) #Using lambda function to
replace ',' from cost
zomato['cost'] = zomato['cost'].astype(float)
#Removing '/5' from Rates
zomato = [Link][[Link] !='NEW']
zomato = [Link][[Link] !='-'].reset_index(drop=True)
remove_slash = lambda x: [Link]('/5', '') if type(x) == [Link] else x
[Link] = [Link](remove_slash).[Link]().astype('float')
for i in range(len(restaurants)):
zomato['Mean Rating'][zomato['name'] == restaurants[i]] = zomato['rate'][zomato['name']
== restaurants[i]].mean()
51
## Removal of Puctuations
import string
PUNCT_TO_REMOVE = [Link]
def remove_punctuation(text):
"""custom function to remove the punctuation"""
return [Link]([Link]('', '', PUNCT_TO_REMOVE))
## Removal of Stopwords
from [Link] import stopwords
STOPWORDS = set([Link]('english'))
def remove_stopwords(text):
"""custom function to remove the stopwords"""
return " ".join([word for word in str(text).split() if word not in STOPWORDS])
zomato["reviews_list"] = zomato["reviews_list"].apply(lambda text:
remove_stopwords(text))
## Removal of URLS
def remove_urls(text):
url_pattern = [Link](r'https?://\S+|www\.\S+')
return url_pattern.sub(r'', text)
zomato[['reviews_list', 'cuisines']].sample(5)
Out[4]:
reviews_list cuisines
rated 40 ratedn hi allnni visited place South Indian, North Indian, Chinese,
19691
friend... Street Food
rated 40 ratedn got friday nightnot
35018 Mediterranean, Italian, Asian
crowded go...
rated 10 ratedn bad experience air
22624 Mexican, Continental, Italian, Chinese
conditionin...
rated 10 ratedn packed drinage food
32489 North Indian, Biryani, Chinese
delivered ...
rated 40 ratedn hello regular adda week
38093 Cafe
visit ...
In [5]:
# RESTAURANT NAMES:
restaurant_names = list(zomato['name'].unique())
def get_top_words(column, top_nu_of_words, nu_of_word):
vec = CountVectorizer(ngram_range= nu_of_word, stop_words='english')
bag_of_words = vec.fit_transform(column)
52
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:top_nu_of_words]
In [6]:
df_percent.set_index('name', inplace=True)
indices = [Link](df_percent.index)
In [7]:
def recommend(name, cosine_similarities = cosine_similarities):
# Find the restaurants with a similar cosine-sim value and order them from bigges number
score_series = [Link](cosine_similarities[idx]).sort_values(ascending=False)
53
df_new = [Link](columns=['cuisines', 'Mean Rating', 'cost'])
# Drop the same named restaurants and sort only the top 10 by the highest rating
df_new = df_new.drop_duplicates(subset=['cuisines','Mean Rating', 'cost'], keep=False)
df_new = df_new.sort_values(by='Mean Rating', ascending=False).head(10)
return df_new
recommend('Pai Vihar')
TOP 10 RESTAURANTS LIKE Pai Vihar WITH SIMILAR REVIEWS:
Out[7]:
Mean
cuisines cost
Rating
Atithi North Indian, Chinese, Street Food 3.63 800.0
Atithi North Indian 3.63 750.0
Samosa Singh Street Food, Fast Food, Rolls, Desserts 3.60 200.0
Fast Food, North Indian, Chinese,
Magix'S Parattha Roll 3.52 400.0
Mughlai, Rolls
Prasiddhi Food Corner Fast Food, North Indian, South Indian 3.45 200.0
Shrusti Coffee Cafe, South Indian 3.45 150.0
South Indian, North Indian, Chinese, Street
Shanthi Sagar 3.44 400.0
Fo...
Mayura Sagar Chinese, North Indian, South Indian 3.32 250.0
Vasanth Vihar - Since
South Indian, Street Food 3.32 150.0
1965
Marwa Restaurant North Indian, Chinese, Fast Food, BBQ 3.19 600.0
Result:
Thus the Machine Learning project on Restaurant Recommendation system with Python
programming language was executed successfully.
54
ADDITIONAL EXPERIMENTS
Aim:
To implement a program in Python on data preprocessing using Python, NumPy and
Pandas.
Data Preprocessing
For machine learning algorithms to work, it’s necessary to convert raw data into a
clean data set, which means we must convert the data set to numeric data. We do this by
encoding all the categorical labels to column vectors with binary values. Missing values, or
NaNs (not a number) in the data set is an annoying problem. You have to either drop the
missing rows or fill them up with a mean or interpolated values. Kaggle provides two data
sets: training data and results data. Both data sets must have the same dimensions for the
model to produce accurate results.
Load Data in Pandas
To work on the data, you can either load the CSV in Excel or in Pandas. For the purposes of
this tutorial, we’ll load the CSV data in Pandas.
Program:
import pandas as pd
df = pd.read_csv('C:\\Users\\ADMIN\\Downloads\\[Link]')
[Link]()
Output:
55
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
56
After dropping rows with missing values, we find the data set is reduced to 712 rows
from 891, which means we are wasting data. Machine learning models need data to train and
perform well. So, let’s preserve the data and make use of it as much as we can.
Creating Dummy Variables
Instead of wasting our data, let’s convert the Pclass, Sex and Embarked to columns in
Pandas and drop them after conversion.
Program:
df = pd.read_csv('C:\\Users\\ADMIN\\Downloads\\[Link]')
[Link]()
Output:
<class '[Link]'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Program:
dummies = []
cols = ['Pclass', 'Sex', 'Embarked']
for col in cols:
[Link](pd.get_dummies(df[col]))
titanic_dummies = [Link](dummies, axis=1)
[Link]()
Output:
<class '[Link]'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
57
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
12 1 891 non-null bool
13 2 891 non-null bool
14 3 891 non-null bool
15 female 891 non-null bool
16 male 891 non-null bool
17 C 891 non-null bool
18 Q 891 non-null bool
19 S 891 non-null bool
20 1 891 non-null bool
21 2 891 non-null bool
22 3 891 non-null bool
23 female 891 non-null bool
24 male 891 non-null bool
25 C 891 non-null bool
26 Q 891 non-null bool
27 S 891 non-null bool
dtypes: bool(16), float64(2), int64(5), object(5)
memory usage: 97.6+ KB
58
11 3 891 non-null bool
12 female 891 non-null bool
13 male 891 non-null bool
14 C 891 non-null bool
15 Q 891 non-null bool
16 S 891 non-null bool
17 1 891 non-null bool
18 2 891 non-null bool
19 3 891 non-null bool
20 female 891 non-null bool
21 male 891 non-null bool
22 C 891 non-null bool
23 Q 891 non-null bool
24 S 891 non-null bool
dtypes: bool(16), float64(2), int64(4), object(3)
memory usage: 76.7+ KB
59
20 female 891 non-null bool
21 male 891 non-null bool
22 C 891 non-null bool
23 Q 891 non-null bool
24 S 891 non-null bool
dtypes: bool(16), float64(2), int64(4), object(3)
memory usage: 76.7+ KB
[46]:
import numpy as np
X = [Link](X, 1, axis=1)
Divide the Data Set Into Training Data and Test Data
Now that we’re ready with X and y, let's split the data set: we’ll allocate 70 percent for
training and 30 percent for tests using scikit model_selection.
Program:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
Result:
Now you can preprocess data on your own. Go on and try it for yourself to start building
your own models and making predictions.
60
Ex No: 11 VISUALIZING THE RATINGS IN THE DATA SET
Date:
Aim:
To implement a program in Python on visualize data using Python, NumPy and Pandas.
This tutorial will make use of a number of open-source Python libraries, including
NumPy, pandas, and matplotlib. We'll import these libraries now. To start, open a Jupyter
Notebook in the directory you'd like to work in. Here are the imports that we will start our
Python script with:
Program:
#Data imports
import pandas as pd
import numpy as np
#Visualization imports
import [Link] as plt
import seaborn as sns
%matplotlib inline
Now that our imports have been executed, we can move on to importing our movie database.
61
Movie_Id_Titles, [Link], [Link]
Move these files into the directory that you'd like to work in for this tutorial. This needs to be
the same folder that you opened your Jupyter Notebook in earlier. Then, you'll need to import
the data into a pandas DataFrame.
The actual data for our movie database lies within the [Link] file. Here is the command
required to import the data into a DataFrame:
Program:
raw_data = pd.read_csv('[Link]', sep = '\t', names = ['user_id', 'item_id', 'rating', 'timestamp'])
You will notice that this DataFrame has four columns and none of them contain the title
of the movie. This data lies in a separate that we downloaded previously named
Movie_Id_Titles. You will need to import this data and merge it with our existing raw_data
DataFrame before [Link], let's import the movie title data. Then let's merge the two
DataFrames together into one DataFrame by merging them on the item_id column.
Program:
movie_titles_data = pd.read_csv('Movie_Id_Titles')
You can get a sense of what the new DataFrame contains by running merged_data.columns,
which returns.
Program:
Index(['user_id', 'item_id', 'rating', 'timestamp', 'title'], dtype='object')
Exploratory data analysis is the process of learning more about a data set by calculating
aggregate statistics or creating visualizations. Let's dig in to our merged movies data set
before building our recommendation system later in this tutorial.
For every movie in our data set, there are a number of different ratings that are submitted by
the different users of the database. Let's start by calculating the average rating for every
movie in the database with the following command.
Program:
merged_data.groupby('title')['rating'].mean().sort_values(ascending = False)
This will return a pandas Series that orders the movies from the highest average rating to the
lowest average rating. It will look something like this.
62
Calculating The Movies With The Most Ratings
You can list the movies in order of their number of ratings with the following command.
Program:
merged_data.groupby('title')['rating'].count().sort_values(ascending = False)
Now visualize the distribution of movie ratings in our data set. It will be helpful to store our
ratings in a simpler data structure first. Accordingly, let's quickly create a pandas DataFrame
that contains the average rating and the number of ratings for every movie in the data set.
Let's start the DataFrame with just the average rating by movie with the following statement.
Program:
ratings_data = [Link](merged_data.groupby('title')['rating'].mean())
Next let's add another column to this DataFrame that contains the number of ratings for every
movie in the data set.
Program:
63
We can now use this DataFrame to create some nice visualizations. First, let's visualize the
distribution of number of ratings by movie using seaborn's distplot function.
[Link](ratings_data['# of ratings'])
As you can see, most movies seem to have either 0 ratings or 1 rating. This makes sense -
very few movies have the mass appeal to receive many ratings from watchers. Let's create a
similar visualization for the actual rating assign to the movies.
[Link](ratings_data['rating'])
64
As you can tell, most movies seem to be distributed around a rating of 3 or so, with peaks
at 1, 2, 4, and 5 - which are presumably movies with only one rating.
Let's create one last visualization that explores the relationship between a movie's average
rating and its number of ratings. The seaborn jointplot is a nice visualization for this. We can
create our jointplot with the following command.
65
It seems like there seems to be some positive relationship between the number of ratings and
the average rating. Said differently, movies with high average ratings tend to have more
ratings, and vice versa.
Result:
We have now spent some time on exploratory data analysis, which ensures that we have a
good sense of the structure of our data before building our recommendation system.
66
Collaborative filtering techniques in recommendation systems include user-based and item-based collaborative filtering. User-based collaborative filtering suggests items by finding users with similar rating patterns and using their preferences to predict another user's interests . Item-based collaborative filtering, on the other hand, involves identifying a set of items similar to the target item, then using their known ratings to predict a user's potential rating of a target item based on a weighted average . The fundamental difference is that user-based focuses on similarities between users, while item-based focuses on similarities between items.
Content-based filtering in recommendation systems works by recommending items to users based on the features of items and the user’s previous interactions with similar items . This approach relies on machine learning principles to induce a classifier that can differentiate between interesting and uninteresting items for the user. This is often implemented using supervised machine learning to model user preferences and the characteristics of items, allowing the system to generate recommendations based on content similarity .