Mohammedia School of Engineering Computer Engineering
Computer Science Department 3emeYear
Artificial Intelligence
Practical Work in Python
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 2022-2023
Mohammedia School of Engineers Computer Engineering
Department of Computer Science 3thYear
Table of contents
Table of contents
1 Technical environment................................................................................................... 4
1.1 Python.......................................................................................................................... 4
1.2 Development environment
1.3 Execution of a Python program
1.3.1 The code editor......................................................... 5
1.3.2 Add the display package ............................................................................ 5
2 Uninformed search: DFS and BFS
2.1 Problem of the 8-puzzle
2.2 DFS Strategy................................................................................................................. 7
2.2.1 Algorithm ........................................................................................................... 7
2.2.2 Implementation..................................................................................................... 7
2.2.3 Output
2.3 BFS Strategy................................................................................................................. 8
2.3.1 Algorithm
2.3.2 Implementation..................................................................................................... 9
2.3.3 Output :................................................................................................................. 9
3 Informed research............................................................................................................ 9
3.1 BestFirst Strategy........................................................................................................ 9
3.1.1 Algorithm ........................................................................................................... 9
3.1.2 Implementation
3.1.3 Output
3.2 A* Strategy
3.2.1 Algorithm
3.2.2 Implementation
3.2.3 Output................................................................................................................. 11
4 Machine Learning: Linear Regression ................................................................. 12
4.1 Objectif Global .......................................................................................................... 12
4.2 Provided Functions ..................................................................................................... 12
4.2.1 Display Function .......................................................................................... 12
4.2.2 Call code ...................................................................................................... 12
4.3 Functions to implement ........................................................................................... 13
4.3.1 Arithmetic Mean........................................................................................ 13
4.3.2 Covariance
4.3.3 Variance
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 1
Mohammadia School of Engineering Computer Science
Department of Computer Science 3thYear
4.3.4 Calculation of coefficients ....................................................................................... 13
4.3.5 Calculation of RMSE................................................................................................. 14
4.3.6 Linear regression ............................................................................................. 14
4.4 Outputs
5 Machine Learning: Classification ................................................................................... 14
5.1 Global Objective
5.2 Call Program.................................................................................................... 15
5.3 Functions to implement ........................................................................................... 15
5.3.1 Load the data ........................................................................................... 15
5.3.2 Calculate the average........................................................................................... 16
5.3.3 Calculate the standard derivative ............................................................................... 16
5.3.4 Synthesize the dataset
5.3.5 Model the dataset
5.3.6 Calculate the probability distribution of x ........................................................ 16
5.3.7 Calculate the probability for a line................................................................ 16
5.3.8 Making the prediction
5.4 Outputs....................................................................................................................... 17
6 Machine Learning: Sickit-Learn ..................................................................................... 17
6.1 Overall Objective .......................................................................................................... 17
6.1.1 Classification
6.1.2 Regression
6.1.3 Clustering
6.2 Work Environment
6.2.1 Sickit-learn
6.2.2 Panda .................................................................................................................. 21
6.2.3 Six and Ipython
6.3 Classification ............................................................................................................. 22
6.3.1 Logistic Regression (Logistic_R.py)........................................................... 22
6.3.2 SVM ([Link])............................................................. 22
6.3.3 Naives Bayes ([Link])........................................................................................ 22
6.3.4 Decision Tree ([Link]) ....................................................................... 23
6.3.5 Logistic Regression with Cross validation (Logistic_R-[Link]).................... 23
6.4 Linear Regression ([Link]) ........................................................................................ 23
6.5 Clustering
6.5.1 K-means ([Link]) ...................................................................................... 24
6.5.2 Mean-shift ([Link])................................................................................. 24
7 NLP: Natural Language Processing
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 2
Mohammadia School of Engineers Computer Engineering
Department of Computer Science 3thYear
7.1 Pipeline
7.2 Rules based Sentiments analysis ----------------------------------------------------- 25
7.3 Fake-News detection
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 3
Mohammadia School of Engineers Computer Engineering
Computer Department 3thYear
1 Technical Environment
1.1 Python
Python is a portable, dynamic, extensible, free language that allows (without imposing) a
modular and object-oriented approach to programming. Python has been developed since 1989.
by Guido van Rossum and many volunteer contributors. It is under the GPL free license,
free, works on all platforms (Windows, Linux, OSX, etc.) and was designed for
being a readable language. As a result, comments are indicated by the hash character (#),
Blocks are identified by indentation, and have a number of keywords, objects
of different types (int, float, bool, tuple, list, string ...).
1.2 Development Environment
We need to install Python, a code editor, and then add the packages.
necessary as the progress of the labs proceeds.
Download the last version python since son site official:
[Link] start the installation
2. After completing the installation, add Python to your environment variables
On Windows: In the "start" menu, search for "advanced system settings."
In the 'Advanced system settings' tab, click on the 'Variables' button
of the environment", then modify the variable "Path" to add the location
of installation of python which can be "C:\Program Files (x86)\Pythonx" for 32-bit
installation or "C:\Program Files\Pythonx" for 64-bit installation (depending on the version
Python that you have installed
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 4
Mohammadia School of Engineers Computer Engineering
Computer Department 3thYear
Linux: edit environment variables using your terminal
export PYTHONPATH=$HOME/test:$PYTHONPATH
export PATH=$HOME/test/bin:$PATH
3. Check if Python has been successfully installed
1.3 Executing a Python Program
After writing and saving your program with the " .py " extension you can
Run it with the command 'python program_name.py'.
1.3.1 The code editor
Use a simple text editor such as Notepad or notepad.
1.3.2 Add the display package
For the first sessions, it is recommended to have a display package that allows
to visualize the path found. To do this, we use the Graphviz package through the link:
[Link]
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 5
Mohammadia School of Engineers Computer Engineering
Computer Science Department 3thYear
2 Uninformed search: DFS and BFS
2.1 8-puzzle Problem
The 8-puzzle problem is modeled by a matrix of 9 cells (3 rows and 3 columns). A
starting from any initial state, use an artificial intelligence algorithm to achieve
a final state. The algorithm must return the set of steps that allow this resolution.
The basic program consists of:
The Node class:
o__init__: the constructor allows to initialize the problem by defining the state
initial, the final state, an action to take, and the limit depth;
o__repr__ : returns a string that represents the current state of
problem
The function possible_moves: allows generating possible actions based on
the state passed as a parameter
The function generate_state: allows generating the next state based on the previous state
as a parameter and an action (move);
The function create_node: to create a node from a given state passed as a parameter, a
action, and a depth;
The expand_node function: allows generating possible nodes from the given node
in parameters by reusing the expand_node function;
The display function: this function allows you to generate a pdf file containing the path
result of the solving algorithm;
Execution example:
Etat initial : [2, 8, 3, 1, 6, 4, 7, 0, 5]
Etat final : [1, 2, 3, 8, 0, 4, 7, 6, 5]
Execution tree: the result depending on the algorithm used
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 6
Mohammadia School of Engineers Computer Engineering
Department of Computer Science 3thYear
2.2 DFS Strategy
2.2.1 Algorithm
This function implements the dfs (depth-first search) algorithm which can be implemented in
iterative or recursive format:
inDepth_resolution(Current_State, History, [Mvts]) :-
if is_goal(Current_State): stop,
for each Mvt in legal_Mouvement :
New_State = apply_operator(Current_State, Mvt)
if legal_state(New_State) and not(member(New_State, History)) :
add_member(New_State, History)
update(Current_State, Mvt, New_State),
inDepth_resolution(New_State, History, Mvts).
DFS (node):
Visited //History
If goal is true:
Stop
Else:
If node not in Visited:
[Link](node)
If node is goal:
goal = true
Stop
for each Mvt in legal_Mouvement :
New_State = apply_operator(Current_State, Mvt),
if not(member(New_State, History)) :
add_member(New_State, History),
update(Current_State, Mvt, New_State),
inDepth_resolution(New_State, History).
2.2.2 Implementation
Declaration of variables: visited History
def dfs(node):
stack = []
visited = []
visited_str = []
depth_limit = 5
[Link](create_node(initial, "283164705", None, 0))
Stopping condition:
while len(stack) > 0:
if len(stack) == 0: return None
Test if the node does not exist in the history: not(member(New_State, History)),
node = [Link](0)
if [Link] in visited_str:
continue
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 7
Mohammadia School of Engineers Computer Engineering
Department of Computer Science 3thYear
else:
[Link](node)
visited_str.append([Link])
Test if the current state is not final: legal_state(New_State)
if [Link] == goal:
return visited
Apply the operator: apply_operator (Current_State, Mvt),
And if legal state: legal_state(New_State) and not(member(New_State, History))
Update the state: update(Current_State, Mvt, New_State):
if [Link] < depth_limit:
expanded_nodes = expand_node(node)
if(expanded_nodes not in visited):
expanded_nodes.extend(stack)
stack = expanded_nodes
2.2.3 Output
2.3 BFS Strategy
2.3.1 Algorithm
This function must implement the bfs (breadth-first search) algorithm. This is a task to
to render
BFS (G, s):
let Q be queue.
[Link]( s )
Mark s as visited
while (Q is not empty)
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 8
Mohammadia School of Engineering Computer Engineering
Department of Computer Science 3thYear
v = [Link]()
for all neighbors w of v in Graph G
if w is not visited
[Link]( w )
mark w as visited
2.3.2 Implementation
Work to be submitted
2.3.3 Output :
3 Informed research
Informed search uses information about the nodes or the arcs or both.
3.1 BestFirst Strategy
3.1.1 Algorithm
The BestFirst algorithm that can be implemented as follows:
Create 2 empty lists: OPEN and CLOSED
Start from the initial node (say N) and put it in the 'ordered' OPEN list.
Repeat the next steps until GOAL node is reached
If OPEN list is empty, then EXIT the loop returning ‘False’
Select the first/top node (say N) in the OPEN list and move it to the
CLOSED list. Also capture the information of the parent node.
If N is a GOAL node, then move the node to the Closed list and exit.
the loop returning 'True'. The solution can be found by backtracking the
path
If N is not the GOAL node, expand node N to generate the ‘immediate’
next nodes linked to node N and add all those to the OPEN list
Reorder the nodes in the OPEN list in ascending order according to an
evaluation function f(n)
3.1.2 Implementation
Work to be done
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 9
Mohammadia School of Engineers Computer Engineering
Department of Computer Science 3thYear
3.1.3 Output
3.2 A* Strategy
3.2.1 Algorithm
The A* algorithm (A Star search) can be implemented as follows:
1 Put node_start in the OPEN list with f(node_start) = h(node_start)
(initialization)
2 while the OPEN list is not empty {
3 Take from the open list the node node_current with the lowest cost
4 if(node_current) = g(node_current) + h(node_current)
5 if node_current is node_goal we have found the solution; break
6 Generate each state node_successor that comes after node_current
7 for each node_successor of node_current {
8 Set successor_current_cost = g(node_current) + w(node_current,
node_successor
9 if node_successor is in the OPEN list {
10 if g(node_successor) ≤ successor_current_cost continue
(to line 20
11 } else if node_successor is in the CLOSED list {
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 10
Mohammadia School of Engineering Computer Science
Department of Computer Science 3thYear
12 if g(node_successor) ≤ successor_current_cost continue
(to line 20)
13 Move node_successor from the CLOSED list to the OPEN
list
14 } else {
15 Add node_successor to the OPEN list
Set h(node_successor) to be the heuristic distance
to node_goal
17 }
Set g(node_successor) = successor_current_cost
Set the parent of node_successor to node_current
20 }
21 Add node_current to the CLOSED list
22
23 if(node_current != node_goal) exit with error (the OPEN list is
empty
3.2.2 Implementation
Work to be done
3.2.3 Output
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 11
Mohammadia School of Engineers Computer Engineering
Department of Computer Science 3thYear
4 Machine Learning: Linear Regression
4.1 Global Objective
The overall objective of this lab is to implement linear regression. The main function
linear regression reuses a set of functions to implement. These functions are
implemented in a use case that corresponds to explaining the revenue of a
company by advertising expenses over 12 months.
The task involves implementing simple linear regression, following the implementations of
functions (Mean, covariance, variance, regression coefficients and rmse_metric for the
calculation of the mean squared error.
Linear regression is the most commonly used statistical method. One usually distinguishes the
simple regression (one explanatory variable) of multiple regression (multiple variables)
explanatory) although the conceptual framework and calculation methods are identical.
The principle of linear regression is to model a quantitative dependent variable Y,
through a linear combination of p quantitative explanatory variables, X1, X2, …, Xp.
The deterministic model (not taking randomness into account) is expressed for an observation i:
yi= a1x1i+ a2x2i+ ... + apxpi+ ei
where toiis the observed value for the dependent variable for observation i, xijis the value
taken by the variable j for observation i, and eiis the model's error.
4.2 Provided Functions
4.2.1 Display Function
This function allows you to generate a graph.
def plot_graph(x, y, predicted):
[Link](x, y, c = 'red')
[Link](x, predicted, marker = 'o', color = 'blue')
[Link]()
4.2.2 Call code
We declare two series for which we try to perform linear regression, to display the
regression graph, and then we calculate the root mean square error RMSE (Root Mean
Square Error).
Advertising costs (*1000 DH)
x = [25, 17, 18, 28, 22, 20, 19, 22, 30, 30, 27, 24]
# Chiffre d'affaires (*1000 DH)
y = [280, 250, 255, 292.5, 265, 260, 262.5, 280, 285, 296, 285, 270]
yp = simple_linear_regression(x, y)
plot_graph(x, y, yp)
rmse = rmse_metric(y, yp)
print('RMSE: %.3f' % (rmse))
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 12
Mohammadia School of Engineers Computer Science
Département Informatique 3thYear
4.3 Functions to implement
4.3.1 Arithmetic Mean
An arithmetic mean in mathematics is the average value of a series of data.
It is the sum of the values divided by the number of values.
def mean(values):
#insert code
4.3.2 Covariance
The covariance between two random variables is a number that quantifies their
joint deviations from their respective expectations.
Calculate covariance between x and y
def covariance(x, mean_x, y, mean_y):
#insert code
4.3.3 Variance
Variance is a measure of the dispersion of values in a sample or distribution.
of probability. It expresses the average of the squares of the deviations from the mean, also equal to the
difference between the average of the squares of the values of the variable and the square of the average.
Calculate the variance of a list of numbers
def variance(values, mean):
#insert code
4.3.4 Calculation of coefficients
This function allows to calculate the coefficients of the linear equation.
This function allows you to calculate the coefficients b.0and b1of the linear equation.
y = b0+ b1x And
With
Calculate coefficients
def coefficients(x, y):
#insert code
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 13
Mohammadia School of Engineers Computer Science
Department of Computer Science 3thYear
4.3.5 Calculation of RMSE
This function calculates the RMSE (Root Mean Square Error).
Calculate root mean squared error
def rmse_metric(actual, predicted):
#insert code
4.3.6 Linear regression
This function implements linear regression using the calculated coefficients:
Simple linear regression algorithm
def simple_linear_regression(x, y):
#insert code
4.4 Outputs
Example of result:
5 Machine Learning: Classification
5.1 Global Objective
In this lab, we are interested in predicting the class of a flower (name) based on its features.
characteristics.
The dataset consists of 150 rows, where each row presents the characteristics and the class.
(name) of a flower. We have 3 classes of flowers (names), while the characteristics define (in)
cm) the length of the sepals, the width of the sepals, the length of the petals and the width of the petals
petals. Class 0 corresponds to the orchid flower, 1 to lavender, and 2 to tulip.
Naive Bayesian classification is a type of simple probabilistic Bayesian classification.
based on Bayes' theorem with a strong independence (so-called naive) of the hypotheses.
In simple terms, a naive Bayesian classifier assumes that the existence of a feature
for a class, is independent of the existence of other characteristics. A fruit can be
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 14
Mohammadia School of Engineering Computer Engineering
Department of Computer Science 3èmeYear
considered an apple if it is red, round, and about ten centimeters. Even
if these characteristics are related in reality, a naive Bayesian classifier will determine that
the fruit is an apple considering these characteristics of color independently,
shape and size.
The probabilistic model for a classifier is the conditional model. whereC
is a dependent class variable whose instances or classes are few in number,
conditioned by several characteristic variables F1...,Fn.
5.2 Call Program
We provide the path of the file containing the data, we perform the learning then we
predict the class of a record (row).
Make a prediction with Naive Bayes on flower Dataset
filename = '[Link]'
dataset = load_csv(filename)
# fit model
model = summarize_by_class(dataset)
define a new record
row = [5.6,2.8,4.3,1.4]
predict the label
label = predict(model, row)
Data=%s, Predicted: %s
5.3 Functions to implement
The work involves implementing naïve bayes, following the implementation of the following functions:
5.3.1 Load the data
This is the function that loads the data from the Excel file:
def load_csv(filename):
separated = dict()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
rowF = [0]*len(row)
for i in range(len(row)-1):
rowF[i] = float(row[i].strip())
rowF[i+1] = int(row[i+1].strip())
class_value = rowF[-1]
if (class_value not in separated):
separated[class_value] = list()
separated[class_value].append(rowF)
return separated
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 15
Mohammadia School of Engineering Computer Science
Department of Computer Science 3emeYear
5.3.2 Calculate the average
This function implements the arithmetic mean of the numbers provided as parameters:
def mean(numbers):
#insert code
5.3.3 Calculate the standard derivative
This function provides the standard derivative using the following formula:
def stdev(numbers):
#insert code
5.3.4 Synthesize the dataset
Perform the calculations for the average, standard deviation, and number of rows for each column of
data
def summarize_dataset(dataset):
#insert code
5.3.5 Model the dataset
Organize the records into classes and calculate the statistics using the function
summarize_dataset:
def summarize_by_class(dataset):
#insert code
5.3.6 Calculate the probability distribution of x
Calculate the probabilities:
def calculate_probability(x, mean, stdev):
#insert code
5.3.7 Calculate the probability for a line
Calculate the prediction probabilities for each class for a given row:
def calculate_class_probabilities(summaries, row):
#insert code
5.3.8 Make the prediction
Predict the class for a line:
def predict(summaries, row):
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 16
Mohammadia School of Engineers Computer Science
Department of Computer Science 3emeYear
#insert code
5.4 Outputs
Example of result:
6 Machine Learning: Scikit-Learn
6.1 Global Objective
In this practical session, we will try to exploit the potential of Scikit-Learn to achieve the
classification, linear regression, and clustering.
For any ML process, we divide the dataset into two parts:
The training set: is the part that allows the algorithm to learn.
Test set: is the part that allows you to verify the effectiveness of the learning.
6.1.1 Classification
The classification process allows for the grouping of all data into different classes.
[Link] Confusion matrix
A confusion matrix or contingency table is a summary of prediction results.
on a classification problem. The correct and incorrect predictions are highlighted.
and distributed by class. The results are thus compared with the actual values. It allows
to understand how the classification model is confused when it makes
predictions.
Calculation of the confusion matrix:
Based on the results obtained by the trained model and the predictions, the matrix indicates
the number of correct and incorrect predictions for each class. Each line of the
The table corresponds to a predicted class, and each column corresponds to a actual class.
In the lines under the actual classes, the predictions or results are recorded. These
Results can be the correct indication of a positive prediction such as 'true positive'.
a true positive and a negative prediction as a true negative, or
an incorrect positive prediction such as 'false positive' (false positive) and a
incorrect negative prediction as 'false negative':
TP (True Positive): the cases where the prediction is positive, and where the actual value is
indeed positive. Example: the doctor informs you that you are pregnant, and
You are indeed pregnant.
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 17
Mohammadia School of Engineers Computer Engineering
Department of Computer Science 3thYear
oTN (True Negative): cases where the prediction is negative, and where the actual value is
actually negative. Example: the doctor tells you that you are not
pregnant, and you are indeed not pregnant.
oFP (False Positive): the cases where the prediction is positive, but the actual value is
negative. Example: the doctor informs you that you are pregnant, but you are not.
not pregnant.
FN (False Negative): cases where the prediction is negative, but the actual value is
positive. Example: the doctor tells you that you are not pregnant, but you
You are pregnant.
[Link] Performance measures
From the confusion matrix, we can derive a whole set of performance criteria.
Here are some examples of performance metrics commonly used:
The Recall ("recall" in English), or sensitivity ("sensitivity" in English), is the rate of true
positives, that is to say the proportion of positives that we have correctly identified
Precision, that is to say the proportion of correct predictions among the points we have.
positive predictions
The "F-measure" to evaluate a trade-off between recall and precision, which is their average.
harmonic
Specificity or Support is the rate of true negatives. It is a
complementary measure of sensitivity.
6.1.2 Regression
Regression in machine learning involves mathematical methods that
allow scientists to predict a continuous outcome (y) based on the value of one or
several predictive variables (x). Linear regression is probably the most common form
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 18
Mohammadia School of Engineers Computer Engineering
Department of Computer Science 3thYear
popular regression analysis due to its ease of use for prediction and
forecast.
Pour évaluer un modèle de régression :
We can calculate the distance between predicted values and true values. This gives us:
The sum of squared residuals (RSS);
the average of this sum (MSE) ;
The square root of this average (RMSE).
We can prefer to calculate the correlation between predicted values and true values:
the relative squared error (RSE);
the coefficient of determination (R2).
[Link] Performance measures: RSS and MSE
The sum of the squares of the residuals, or RSS, stands for Residual Sum of Squares. It is calculated for
each point xifrom the test game the distance between its label and the predicted value and make it the
sum :
The problem with RSS is that it becomes greater the more data we have. For this
reason, it has been normalized by the number n of points in the test set which is the MSE
The mean squared error, or MSE, for Mean Squared Error:
[Link] RMSE and RMSLE
The RMSE, or Root Mean Squared Error, is the square root of the MSE to bring it back to the unit of y.
while the RMSLE (Root Mean Squared Log Error).
In fact, the RMSE does not perform very well when the labels can take values
which span several orders of magnitude. Let's imagine making an error of 100 units on a
Label worth 4; the corresponding term in the RMSE is worth 1002=10000. It is exactly
the same thing as if we make an error of 100 units on a label that is worth 8000. For
Taking this into account, we can pass the predicted values and the true values to the log before
calculate the RMSE
[Link] Performance measures: CSR and R2
The relative squared error (RSE) is the RSS normalized by the sum of the squares of the distances between
the labels and their average, it is actually the complement to 1 of the coefficient of
determination (R2), which is the square of the Pearson correlation between predicted and true values
values.
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 19
Mohammadia School of Engineering Computer Engineering
Department of Computer Science 3thYear
6.1.3 Clustering
Clustering is a machine learning method that involves grouping points.
of data by similarity or distance. It is an unsupervised learning method and
a popular technique for statistical data analysis. For a given set of points,
You can use classification algorithms to classify these data points.
individuals in specific groups.
There are different functions with which we can evaluate the performance of
clustering algorithms.
[Link] Adjusted Rand Index
Rand Index is a function that calculates a measure of similarity between two clusters.
For this calculation, the rand index considers all pairs of samples and counting pairs.
who are assigned to similar or different clusters in the predicted and true clustering.
Then, the raw score of the Rand index is 'adjusted for randomness' in the index score.
[Link] Score based on mutual information
Mutual information is a function that calculates the agreement of the two assignments. It ignores the
permutations. The following versions are available:
– Normalized Mutual Information (NMI): Scikit learn
to have [Link].normalized_mutual_info_score module.
– Adjusted mutual information (AMI): Scikit learn
to have [Link].adjusted_mutual_info_score module.
[Link] Fowlkes-Mallows Score
The Fowlkes-Mallows function measures the similarity of two clusterings of a set of
points. It can be defined as the geometric mean of precision and recall by
pairs.
Mathematically,
[Link] Silhouette Coefficient
The Silhouette function will calculate the average silhouette coefficient of all samples in
using the average intra-cluster distance and the average distance to the nearest cluster to
each sample.
Mathematically,
S = left(ba right)/max left(a,b right)
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 20
Mohammadia School of Engineers Computer Engineering
Department of Computer Science 3emeYear
Here, a is the intra-cluster distance, and b is the average distance of the nearest group.
[Link] Contingency matrix
This matrix will indicate the intersection cardinality for each pair of trust (true,
predicted). The confusion matrix for classification problems is a matrix of
square contingency.
6.2 Work Environment
In order to ensure the present labs, the following libraries must be installed:
6.2.1 Sickit-learn
Scikit-learn is a free Python library for machine learning. It is developed
par de nombreux contributeurs notamment dans le monde académique par des instituts français
higher education and research organizations like Inria. It is written in Python, with some
essential algorithms written in Cython to optimize performance.
The installation of Scikit-learn involves the following steps:
1. Installing pip:
a. Download [Link] to a folder on your computer.
b. Open the command prompt and navigate to the folder containing the program
of installation [Link].
c. Run the following command: python [Link]
d. pip is now installed! We can check that Pip has been installed
correctly by opening the command prompt and entering the following command
: pip -V
2. Installing Scikit-Learn:
pip install -U scikit-learn
python -m pip show scikit-learn # para ver qué versión y dónde está instalado scikit-learn
python -m pip freeze # to see all packages installed in the active virtualenv
python -c "import sklearn; sklearn.show_versions()"
6.2.2 Panda
Pandas is a library written in Python that allows for manipulation and analysis.
data. In particular, it offers data structures and operations of
manipulation of numerical arrays and time series.
It allows to:
Manipulating data tables with variable labels (columns) and
of individuals (lines);
These tables are called DataFrames;
One can easily read and write these dataframes from or to a tab-delimited file;
One can easily plot graphs from these DataFrames using matplotlib.
Install Panda:
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 21
Mohammadia School of Engineers Computer Science
Computer Science Department 3thYear
pip install pandas
you :
py -m pip install pandas
6.2.3 Six and Ipython
IPython is an interactive terminal, or shell, for the Python programming language that
proposes features such as introspection, additional syntax, completion
and a rich history.
Installation: pip install ipython
Six is a Python compatibility library.⋅2 and⋅3. It provides utility functions for
bridge the differences between Python versions in order to write Python code that is
compatible with both versions of Python.
Installation: pip install six
6.3 Classification
6.3.1 Logistic Regression (Logistic_R.py)
Logistic regression is a binary regression model. It aims to model as accurately as possible.
a simple mathematical model to numerous real observations. In other words
to associate with a vector of random variables (x1, x2, …, xka binomial random variable
generically noted y.
The goal is therefore to execute the program without adding code. The program uses a
dataset available in scikit-learn through the load_digits() function [Link]
[Link]/stable/modules/generated/[Link].load_digits.html) which allows generating
randomly generated numbers. Then, we test and display the confusion matrix and the
precision.
6.3.2 SVM ([Link])
To stay succinct, Support Vector Machines (SVM) are a set of techniques
supervised learning aimed at finding, in a space of dimension N>1,
the hyperplane that best divides a dataset into two. SVMs are separators
linear, that is to say that the boundary separating the classes is a straight line.
The objective is therefore to run the program without adding any code. The program uses a
dataset available in scikit-learn through the function datasets.load_iris() [Link]
[Link]/stable/modules/generated/[Link].load_iris.html) which represents the problem
classification of flowers into 3 classes (as seen in the Machine Learning lab:
Classification).
6.3.3 Naives Bayes ([Link])
Naive Bayes Classifier is a popular algorithm in Machine Learning. It is an algorithm
of supervised classification. It is particularly useful for classification problems.
text. An example of using Naive Bayes is that of the anti-spam filter.
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 22
Mohammadia School of Engineering Computer Engineering
Department of Computer Science 3thYear
The goal is to execute the program with the addition of code. The program uses a dataset.
available in sickit-learn through the function datasets.load_breast_cancer() [Link]
[Link]/stable/modules/generated/[Link].load_breast_cancer.html) which represents
the classification problem of a breast cancer dataset in Wisconsin,
USA. The goal is to implement the division of the dataset into training and test sets, make the prediction,
calculate the accuracy and the confusion matrix.
6.3.4 Decision Tree ([Link])
A decision tree is a visual representation of a classification algorithm.
data according to different criteria that we will call decisions (or nodes).
The objective is to execute the programs without adding code. The program uses the dataset.
provides [Link]. The goal is to load and split the data, perform training, testing,
then evaluate the model using the confusion matrix, the classification report which
allows displaying precision, recall, F1, support, and precision.
6.3.5 Logistic Regression with Cross validation (Logistic_R-[Link])
The objective is to execute an example of Logistic Regression using Cross-validation.
K-Fold. Cross-validation helps in the evaluation of machine learning models.
This statistical method helps to compare and select the model in learning.
automatic applied. The dataset is divided into a number K. It divides the set
data to the point where the test set uses each set.
Let's understand the concept using 5-fold cross-validation or K=5: the method will divide
the dataset in five folds. The model uses the first fold of the first iteration
to test the model. It uses the remaining datasets to train the model. The
second part helps to test the entire dataset and other supports with the process of
training. The same process repeats until the test set uses each of the
five folds.
We will therefore execute the program without adding any code. The code reuses the provided dataset.
[Link].
6.4 Linear Regression ([Link])
The linear regression algorithm is a supervised learning algorithm, that is to say
that from the target variable or the variable to be explained (Y), the model aims to make
a prediction using so-called explanatory (X) or predictive variables (see Machine
Learning: Linear Regression.
The goal is to run the program with added code. The program uses a dataset.
available in scikit-learn through the function datasets.load_diabetes() [Link]
[Link]/stable/modules/generated/[Link].load_diabetes.html) which represents the
prediction problem of diabetes rate progression one year later based on 10
properties. The objective is to implement the division of the dataset into training and test sets, to make the
prediction, and display the result in the form of a graph.
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 23
Mohammadia School of Engineers Computer Engineering
Department of Computer Science 3thYear
6.5 Clustering
Clustering is an unsupervised learning method.
line represents an individual (an observation). At the end of the clustering application, we
will retrieve this data grouped by similarity. Clustering will group into several
families (clusters) individuals/objects based on their characteristics. Thus, individuals
those found in the same cluster are similar and the data found in another cluster
they are not.
There are two types of clustering:
Hierarchical clustering
Non-hierarchical clustering (partitioning)
The goal is to execute the programs without adding code.
6.5.1 K-means ([Link])
K-means (k-means) is a non-hierarchical unsupervised clustering algorithm. It
allows grouping the observations of the data set into K distinct clusters. Thus the data
Similar ones will be found in the same cluster. Furthermore, an observation cannot be
find only in one cluster at a time (exclusivity of belonging). The same observation, does not
can therefore belong to two different clusters.
The goal is to run the program without adding code. The program uses a dataset.
available in scikit-learn through the function datasets.make_blobs() [Link]
[Link]/stable/modules/generated/[Link].make_blobs.html) which allows generating
isotropic Gaussian blobs for clustering. We load the data, we make a display
Initially, we apply K-Means learning, then we test the prediction and finally we move on to
the graphical display of cluster centers.
6.5.2 Mean-shift ([Link])
Mean Shift is also known as mode-seeking algorithm (Kernel
Density Estimation - KDE) that assigns data points to clusters in a way that
shifting the data points to the high-density area. The highest density of points
data is called a model in the region. It has widely used applications in the
field of computer vision and image segmentation.
KDE is a method for estimating the distribution of data points. It works by
placing a kernel on each data point. The kernel in mathematical terms is a
weighting function that will apply weights to individual data points.
The addition of all individual kernels generates the probability.
The objective is to run the program without adding code. We reuse the same dataset from the lab.
previous datasets.make_blobs(). #import the display style. We import the dataset, we create
the dataset by defining the initial centers and creating clusters based on the defined centers,
we display the initial dataset, then we start the learning, we move to the display of
centres des clusters, et finalement l’affichage des clusters.
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 24
Mohammadia School of Engineers Computer Engineering
Department of Computer Science 3èmeYear
7 NLP: Natural Language Processing
7.1 Pipeline
A Machine Learning pipeline is used to help automate ML workflows. They
work by allowing a sequence of data to be transformed and correlated between
they in a model that can be tested and evaluated to obtain a result, whether positive or
negative.
Machine learning pipelines consist of several steps to train a
model. Machine learning pipelines are iterative because each step is repeated.
to continuously improve the model's accuracy and obtain an efficient algorithm. To
create better machine learning models and make the most of them,
accessible, scalable, and sustainable storage solutions are imperative, paving the way for
on-site object storage.
7.2 Rules based Sentiments analysis
The lab consists of executing a program that allows detecting the sentiment of a sentence.
(positive/negative).
The pipeline consists of tokenization, normalization, removal of stop words,
stemming, lemmatization, word occurrence calculation, calculation of the number and
percentages of positive and negative words, decision on whether the sentence is positive or negative.
The calculation of positive and negative words is done using a dictionary of positive words.
and a dictionary of negative words.
The program consists of:
1. Declare a text string to analyze: new_text
2. We transform the text into lowercase and divide it into tokens.
3. We normalize the text: removal of non-alphanumeric text symbols.
4. On radicalization (stemming): reduction of each word to its radical or root form
5. Lemmatization: reducing words to a normalized form
6. We calculate the positive words and the negative words
7. We calculate the percentages of positive and negative words.
8. Decide for the text to be analyzed whether it is a positive or negative word;
7.3 Fake-News detection
This lab consists of detecting whether a news item is false (fake) or true (true).
The pipeline consists of reading a dataset '[Link]' of fake news and another '[Link]' of real news.
news. To preprocess this news: flag data, concatenate dataframes, delete the
date and title, convert to lowercase, remove stopwords, trace some statistics,
then divide the train/test data and model them using five models.
Random Forest is implemented using Scikit-learn, while you need to implement (in
using Scikit-learn) Naive Bayes, Logistic regression, Decision Tree and SVM. Finally, the
Five models are compared to select the most accurate.
The objective is to program Naïve Bayes, Logistic regression, Decision Tree, and SVM.
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 25
Mohammedia School of Engineers Computer Science
Department of Computer Science 3thYear
The program consists of:
1. Read the dataset
2. Declare the Fake and True flags
3. Concatenate the data from the dataset
4. Shuffling the data to reduce variance and ensure that
models remain general and less oversized
5. Display the data
6. Remove the date and the title from the data
7. Convert to lowercase
8. Remove the punctuation
9. Remove stop words
10. Calculate the number of articles per subject
11. Calculate the number of fake and real articles.
12. Count the frequent words
13. Count the frequent words in fake news and in real news
14. Flatten the confusion matrix
15. Divide the data
16. Code à insérer
17. Display the result of the Random Forest
18. Code to insert
19. Display the graph
Asmae EL KASSIRI – Driss NAMLY - Karim BOUZOUBAA 26