Date: 01.09.
2020
What is machine learning?
It is the process to teach machine to recognize the images or pictures or some kind of data.
Combination of data
Categories
Machine learning
Supervised- we have input as well as to correct it we have output
input
output
The difference between predicted and actual output is called error margin
Two type of algorithms (to handle/complete the big main process divide it into small process)
1. Classification – to differentiate/distinguish between things (e.g. difference between cat
and dog)
Example
o K-nn
o Logistic regression
2. Regression – where we have to predict the answer is yes or no
o Single linear
o multiple linear
o polynomial
Unsupervised- we only have input.
We mostly use clustering (grouping) in unsupervised
Algorithms:
1. Clustering(grouping)
Example
o K mean clustering
2. Anamoly
To understand the pattern
Reinforcement- reward based learning (deep learning concept).
Algorithms:
1. Monte Carlo
2. Q learning
Pre-processing- python pandas library
Pi cham
Jupiter note book
Idle
Python should be recognised
Python 3.7.4
Python install package
In cmd type pip install Jupiter
After that create new folder and open cmd from that path
Shift+enter to run jupyter code
Num pi
In cmd pip install num pi
Import numpy/pandas/matplotlib
Commands in cmd
1. Python hit enter
2. Exit
3. Pip –version
4. Pip install jupyter
5. Pip install numpy
6. Pip install pandas
7. Pip install matploy
Date: 03.09.2020
Numpy
It is a library which is used in python. Used for mathematical and scientific calculations
In jupyter notebook
Import numpy as np
Print(dir(np))
Print(help(np))
In google
Numpy documentation for python 3 (for guidance)
Array (collection of same data type)
Array is different from matrix. As the operations performed on them are different.
# 1D array
A_1 = [Link]([1,2,3,4,5])--------the ‘()’are used for array method in numpy and ‘[]’ are used to
define an array items
Size is used to know the number of elements on array
print(A_1)
ouput
[1 2 3 4 5]-----shape of this array is 8
# 2D array
A_2 = [Link]([
[1,2,3],---->1st Row
[4,5,6],---->2nd row---------shape of this array is 3
[7,8,9]----->3rd row
])
print(A_2)
Output
[[1 2 3]
[4 5 6]
[7 8 9]]
# 3D array shape 3x3x3---->3-rows, 3-cols, 3-elements
A_3=[Link]([
[[1,2,3],[4,5,6],[7,8,9]],
[[10,11,12],[13,14,15],[16,17,18]],
[[19,20,21],[22,23,24],[25,26,27]],
])
Output:
[[[1 2 3]
[4 5 6]
[7 8 9]]
[[10 11 12]
[13 14 15]
[16 17 18]]
[[19 20 21]
[22 23 24]
[25 26 27]]]
Range function
Have three parameters (start, stop, increment/ decrement)
Gives integer type of data
size = range(10)
print(size)
type(size)
for i in size:
print(i)
print(type(i))
Arange(same as range function)method and functions are almost same
It generates array type of data
> arrange(p1,p2,p3)
P1=staring position
P2=ending position
P3=increment/decrement
arr = [Link](1,10)
print(arr)
print(type(arr))
output :[1 2 3 4 5 6 7 8 9]
<class '[Link]'>
Nd means nth dimension array
ar_1=[Link](10,51,2)
print(ar_1)
output: [10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50]
How to create zero array?
[Link]((no_of_rows,no_of_cols))
[Link]((no_of_rows,no_of_cols))
zeros=[Link]((4,3))
zeros
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
zeros_1=[Link]((3,4),dtype=np.int16)
zeros_1
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]], dtype=int16)
The same goes with one array
Methods and keywords (numpy)
Ndim(used to get the dimension of array/provides the dimension of the array)
--------------print(arr_1.ndim)
Size(gives no of elements of array)
--------------print(arr_1.size)
itemsize(gives the size of how much data/bits a single element consume)
-------------print(arr_1.itemsize)
Dtype(gives datatype of the element)
---------------print(arr_1.dtype)
Shape(gives no of rows and cols)
---------------print(arr_1.shape) output=(8,)
Reshape an array (conversion of 1D array to 2D or 3D or vice versa)
temp_arr=arr_1.reshape(2,4)
print(temp_arr)
temp_arr.ndim
Flatten(coverts 2D&3D array to 1D)
Array consumes less memory than list/why is array faster than list.
import numpy as np
import time
import sys
s=range(1000)
print([Link](5)*len(s))
14000
d=[Link](1000)
print([Link]*[Link])
4000
Matrix multiplication matrix addition and matrix division therioticaly explained
Date: 04.09.2020
Zip is used to compress two lists
Array is faster than list
Program
import numpy as np
import time
size = 10000
#list
l1=range(size)
l2=range(size)
start=[Link]()
print(start)
#result=l1+l2--->this will concatinate
result=[(x,y) for x,y in zip(l1,l2)] #zip is used to compress to list
print(([Link]()-start)*1000)
#array
a1=[Link](size)
a2=[Link](size)
start=[Link]()
result=a1+a2
print(([Link]()-start)*1000)
What is line?
Line is the combination of multiple/infinte points.
Linspace function
Linspace is used to get point in between two points of line
Syntax: linspace(a,b,c)
a-starting point
b- Ending point
c- No of points you want
to get minimum value from an array
array_name.min()
to get maximum value from an array
array_name.max()
to get sum of array element
array_name.sum()
arr_1=
8 9
10 11
12 13
Rows are indicated as axis-1
Columns are indicated as axis-0
#used mostly in pre-processing
arr_1=[Link]([
[8,9],
[10,11],
[12,13]
])
arr_1.shape
print(arr_1.sum())
print(arr_1.sum(axis=0))---->gives addition of cols
print(arr_1.sum(axis=1))--->gives addition of rows
Square Root of array
It gives element vise square root
Syntax: [Link](array_name)
Example: arr_1=[Link]([
[8,9],
[10,11],
[12,13]])
print([Link](arr_1))
output:
[[2.82842712 3. ]
[3.16227766 3.31662479]
[3.46410162 3.60555128]]
Addition, subtraction, multiplication and division of array
Program
arr_1=[Link]([[1,2,3],[3,4,5]])
arr_2=[Link]([[1,2,3],[3,4,5]])
print("addition \n",arr_1+arr_2)
print("subtraction \n",arr_1-arr_2)
print("multiplication \n",arr_1*arr_2)
print("division\n",arr_1/arr_2)
output
addition
[[ 2 4 6]
[ 6 8 10]]
subtraction
[[0 0 0]
[0 0 0]]
multiplication
[[ 1 4 9]
[ 9 16 25]]
division
[[1. 1. 1.]
[1. 1. 1.]]
Horizontal stacking vertical stacking
It should be in tuple only
Syntax: vstack(tuple)
hstack(tuple)
Program
a= [Link]([[1,2,3],[3,4,5]])
b=[Link]([[1,2,3],[3,4,5]])
#vertical stacking
print([Link]((a,b)))
print([Link]((a,b)))
Matrix
#convert array to matrix
m_1=[Link](a)
print(type(m_1))
#creating a matrix 3X3
m_2=[Link]("1 2 3 ;4 5 6;7 8 9")
print(m_2)
print([Link](m_2))
print([Link](m_2))
print([Link](m_2))
print([Link](m_2))
PANDAS
Python has 4 type of number system
Decimal
Hexadecimal
Octadecimal
Binary
Dictionary
Key is ont of two data type either numeric(int) or string
Dataframe(mostly used in data analysis)
It is a table like structure which looks like an excel sheet
Key works as a column name in dataframe
Program
#creation of dataframe
import pandas as pd
weather_data={
"Day":['1/1/2020','22/1/2020','3/2/2020','12/4/2020','25/5/2020'],
"Temp":[31,29,22,35,19],
"Wind_speed":[7,9,4,5,6],
"Event":["sunny","sunny","rain","fog","sunny"]
print(weather_data)
#convertion od dictionary to data frame
df=[Link](weather_data)
df
Date: 05.09.2020
Shape: used to give dimensions of the table that is the number of rows and columns
ROW
Head
Syntax: dataframe_name.head(no of rows)
Gives/display upper/starting rows of the data frame
Bydefault: 5
Tail
Syntax: dataframe_name.tail(no of rows)
Gives lower/last rows of the dataframe
Bydefault: 5
Slicing:
It is used to create a sublist
Syntax:
List_name[starting index:ending index]
Indexing/slicing in dataframe:
Dataframe_name[starting_index:ending index]
COLUMN
Dataframe_name.columns
Gives name of the columns
To get the data of a particular column
--->dataframe_name[column_name]
Day=df[‘Day’].values
Print(Day)
.values--->is used to get the values in array
Program to get two columns values
Df[[‘day’,temp’]].values
METHODS
Temp_col=df[“Temp”].values
Temp_col.max()
Program
Df[Df[“Temp”]>32]-->queries
Df[df[“temp”]==df[“temp”].max()]
[Link]()------>used to get operational/ int fomate data, gives mean, standered deviation and cout
and many more things.
Using Tuple list
Weather_data=[ (‘12/2/2020’,32,8,”rain”),
(‘29/3/2020’,22,8,”rain”),
(‘28/5/2020’,19,8,”rain”),
(‘22/7/2020’,23,8,”rain”),]
Df=[Link](data=weather_data,columns=[“day”,”temp”,”wind_speed”,”events”])
Date: 07.09.2020
CSV file
How to import data from csv file
Copy the csv data path
Read_csv()
It is a me
#csv data
Pd.read_csv(r”path”)
#XLS
Df=pd.read_excel(r”path”)
Pandas is fast because of dataframe
Set first column as index
Df.set_index(“column_name”,inplace=True)--->to perment the operation inplace is used
[Link]
------
[Link][“index_name”]
------gives data of the index mentioned
READ and WRITE the data operations
Csv and excel
Skip row attribute is used to skip upper rows
Syntax
df= pd.read_csv(r"C:\Users\K\Documents\stock_csv_data.csv",skiprows=No_of_rows)
Header
Makes the mentioned row a header
Index wise
Of we put 2 then it skips the ist row and makes the 2nd row header
Nrows stands for number of rows
To replace the empty values from the table to NaN
So that it becomes easy to perform operations
Example:
df= pd.read_csv(r"C:\Users\K\Documents\stock_csv_data.csv",na_values=["n.a.","not available"])
df
Write csv
To_csv()=to create a new writern csv dataframe file
Example: df1.to_csv("[Link]",index=False)
Write excel:
df_1.to_excel("[Link]",sheet_name="stock",index=False,startrow=2,startcol=1,header=False)
to_excel:----- to create a new writern excel dataframe file
#write two dataframe to two seperate sheet in excel
program
import pandas as pd
df_stock=[Link]({
"tikers":["Google","WMT","MSFT"],
"price":[30,40,10],
"pe":[20.5,65.10,35.2],
"eps":[20.5,20.5,56.1]
})
df_stock
df_weather=[Link]({
"day":["monday","tuesday","friday"],
"temp":[30,40,10],
"event":["rain","sunny","rain"],
"humidity":[20.5,20.5,56.1]
})
df_weather
#openpyxl----used to write a excel file
with [Link]("stock_weather.xlsx") as writer:
df_stock.to_excel(writer ,sheet_name="stock_file")
df_weather.to_excel(writer ,sheet_name="weather_file")
Handle missing values
[Link](attribute)----used to fill the missing data for whole data
Attribute can be any thing you wanna replace the missing value with
Handling the empty value column vise
New_df=[Link]({
“day” : “no date”,
“temperature” : 0,
“windspeed” : 0,
“event” : ”no events”})
Date:09.09.2020
Handling missing data
On index methods are not performed
parse_dates=['day']
used to change the day/date str values into date format/ time stamp
example: df=pd.read_csv("weather_data.csv",parse_dates=['day'])
Fillna(value)
Fills all the NAN values of the table with the value passed
method of fillna
Ffill(forward fill)
It is used to fill the current value with the previous value/above cell value
Bfill (backward fill)
It is used to fill the current value with the next value/below cell value
Interpolate
It gives meaningful data. It by default gives data in linear interpolation.
Dropna
Used to drop the data which has missing data
Thresh
It is used to remove the no of NAN valued columns
Like if (thresh=1)----then if then it will keep all the rows which has 1 NAN in it
#insert date used to insert the date into the table
rg=pd.date_range("2017-01-01","2017-01-11")
index=[Link](rg)
[Link](index)
replace using #regular expression regex
df2=[Link]({
"temperature":'[a-z]',
"windspeed":'[a-z]'
},'',regex=True)
df2
Grouping / clustering
If we have a data which has a column having repitative values of different type
Date: 10.09.2020
Concatenation:
Used for basic concatenation
#conactenate and key
df=[Link]([india_weather,us_weather],keys=["india","us"])
df
#with the index
df=[Link]([india_weather,us_weather])
df
#ignore the index
df=[Link]([india_weather,us_weather],ignore_index=True)
df
Merge
Merge is use to merge to data frames
Syntax: [Link](dataframe1,dataframe2,on=”columnname”,
how=”inner/outer”,indicator=”true/false(bydefault”)
“on”---it merges the dataframes on the basis o fthe column name mentioned.
Merge has the same consept as joins in DBMS
Inner join-----comman things only(intersection)
Outer join-----whole data(union)
Left join------only left data(with comman things from right)
Right join-------only right data(with comman things from left)
Indicator flag-----used to indicate which are common
Ex: df3=[Link](df1,df2,on="city",how="outer",indicator=True)
Suffixes----- used to get values of both the dataframes
Ex:
df3=[Link](df1,df2,on="city",how="outer",suffixes=("_first","_second"),indicator=True)
df3
output:
temp_firs humidity_firs humidity_secon
city temp_second _merge
t t d
0 new york 22.0 55.0 18.0 68.0 both
1 chicago 15.0 85.0 23.0 65.0 both
2 orlando 35.0 76.0 NaN NaN left_only
baltimor
3 40.0 68.0 NaN NaN left_only
e
4 san diego NaN NaN 35.0 71.0 right_only
MATPLOT LIB
import [Link] as plt
%matplotlib inline(always used in jupyter only)
To [Link](x,y)
[Link]("Weather Graph")
[Link]("Days")
[Link]("Temp")
[Link](x,y,color="blue", linewidth=2,linestyle="dotted",marker="*")
attributes of plot
color
linewidth
linesyle
marker----end pints syle of the plot
alpha----controles opacity
In IDLE python to see the the graph you have to right
[Link]()
String format
Refer: [Link]
Date: 11.09.2020
To get three cities temperature information in one graph
Program:
[Link]("weather")
[Link]("day")
[Link]("temp")
[Link](day,mumbai,"g*-")
[Link](day,delhi,"ro-")
[Link](day,pune,"b^-")
[Link](loc="upper right")-------used to place the graph scale(bydefault value is best is it will fit
itself at the emply space )
attributes of legend:
loc,fontsize====”large”,”small” ,shadow—give shadow to the scale box
[Link]()
Bar Chart:
company=["Relince","Indian Oil","State Bank Of India","TATA"]
revenue=[82,77,47,65]
[Link](company,revenue,color="red")
#if error comes abut integer values
company_position=[Link](len(company))
[Link](company_position,revenue,color="green")
[Link](company_position,company)
[Link]()
#multiple bar vertical
[Link](company_position-0.2,revenue,width=0.4,label="revenue")
[Link](company_position+0.2,profit,width=0.4,label="profit")
[Link](fontsize="large")
[Link](company_position,company)
[Link]()
#multiple bar horizontal
[Link](company_position-0.2,revenue,label="revenue")
[Link](company_position+0.2,profit,label="profit")
[Link](fontsize="large")
[Link](company_position,company)
[Link]()
Histogram
It can be generated using also with one parameter
X axis carries the variable
Y axis generates the frequency accordingly
people_ages=[12,45,18,8,3,85,75,65,15,95,35,23,44,66,58,62,73,84,92,110]
age_group=[1,10,20,30,40,50,60,70,80,90,100,110]
[Link](people_ages,age_group,rwidth=0.8)
[Link](range(0,7))
[Link]()
Pi Chart
exp=[1400,600,300,410,250]
exp_label=["bike","food","phone","internet","others"]
[Link](exp,labels=exp_label,shadow=True,autopct=”%1.5f%%”,explode=[0,0,0,0.4,0)
[Link]()
#[Link]("equal") ---when you get ovel shaped pi chart used to make is circular
To save the chart
[Link]("[Link]")
#never write [Link]() before savefig it will not give the graph
Date: 14.09.2020
ALGORITHMS
Simple linear Regression
----algorithm implementation in ml is called model
When predict value is in number form there we can used regression
Independent variable---data that can be controlled directly
Dependent variable---data that cannot be controlled directly
Used to find best fit line with minimum error margin
x Y
Y_e Sal
1 20
2 40
3 50
4 40
5 50
50
40
30
20
10
1 2 3 4 5
Formula: y=b0+b1*x / y=c+mx
B0--->it is the intercept
B1---->slop of coefficient
Main formula to find correlation: y=b0+b1x
b0=y_mean-b1*x_mean b1=sum(x-x_mean)*sum(y-y_mean)/sum(x-x_mean)^2
c---->constant
m---->slope
y---->dependent x---->independent
Required things to find best fit line
(x- (x-x’)
X Y x-x’ Y-y’
x’)^2 *(y-y’)
1 20 -2 -20 4 40
2 40 -1 0 1 0
3 50 0 10 0 0
4 40 1 0 1 0
5 50 2 10 4 20
Xmean= Ymean=
0 0 10 60
3 40
m= (x-x’)*(y-y’)/(x-x’)^2=6
to find constant
y=c+m*x
40=c+18
c=2
simple linear regression code in python
import numpy as np #predict value
import [Link] as plt y_pred=b[0]+(b[1]*x)
print(y_pred)
def coe(x,y):
global m_x,m_y,c
#plotting regression line
n=[Link](x)
print(n)
#mean [Link](x,y_pred)
m_x=[Link](x) #labels
m_y=[Link](y) [Link]("year of experience")
#calculating cross validation about x [Link]("Salary")
ss_xy=[Link](y*x)-(n*m_y*m_x) #for (x-x’) *(y- [Link]()
y’) [Link]()
ss_xx=[Link](x*x)-(n*m_x*m_x) #for (x-x’)^2
def main():
#calculating regression coefficient
m=ss_xy/ss_xx
x=[Link]([1,2,3,4,5])
c=m_y - m*m_x y=[Link]([20,40,50,40,50])
#call function coe
return (c,m) b=coe(x,y)
def plotting_regeression_line(x,y,b): print(b)
#plotting data points
global m_x,m_y,c plotting_regeression_line(x,y,b)
[Link](x,y)
main()
sklearn it is used to make things easy
code jupyter
from sklearn.linear_model import LinearRegression
mport numpy as np #reshaping the x,y cos it only takes 2D arrays or
import pandas as pd values
import [Link] as plt
%matplotlib inline X=[Link](-1,1)
dataframe=pd.read_csv("emp_data.csv") [Link]
dataframe Y=[Link](-1,1)
[Link]()
#isnull gives the number of null values in the data lr=LinearRegression()
[Link]().sum()
#training the machine according to the data
[Link](X,Y)#---------used to fit the value in
X=dataframe["Year of Experience"].values machine
Y=dataframe["salary"].values y_pred=[Link](X)
y_pred
from sklearn.linear_model import
LinearRegression #plotting
[Link](X,Y,color="red")
[Link](X,y_pred,color="blue")
[Link]()
Date: 15.09.2020
Cross validation for simple linear regression
Main formula to find best fit line:
Y=b0+b1*x
b0 =sum(y)*sum(x**2)-sum(x)*sum(xy)/n*sum(x**2)-(sum(x))**2=22
b1=n*sum(xy)-sum(x)*sum(y)/ n*sum(x**2)-(sum(x))**2=6
X y xy x**2
1 20 20 1
2 40 80 4
3 50 150 9
4 40 160 16
5 50 250 25
15 200 660 55
Y=22+6*5(to get the predict value )
seperate x and y using iloc method
date: 16.09.2020(absent)
Multiple Linear Regression:
output
X1 X2 Y
21 31 44
22 36 45
23 32 46
24 35 47
28 34 48
27 38 49
24.16 34.33 46.5
Y= b0 + b1x
Multiple linear:
Y= b0 + b1x1 + b2x2 pred
b1 = sum[(x1 – x1mean) * (y – ymean)] / sum(x1 – x1mean)**2
b2 = sum[(x2 – x2mean) * (y – ymean)] / sum(x2 – x2mean)**2
x1_mean = 24.16
x2_mean = 34.33
y_mean = 46.5
x1 x2 y (x1 - (x2 – (y – ymean) (x1 -x1mean)^2 (x2 – (x1 – (x2 –
x1mea
x1mean) x2mean) x2mean)^2 x2mea
n) * (y
-ymea n) * (y
n) –
ymean)
21 31 44 -3.16 -3.33 -2.5 9.9856 11.0889 7.9 8.325
22 36 45 -2.16 1.67 -1.5 4.6656 2.7889 3.24 -2.505
23 32 46 -1.16 2.33 -0.5 2.5056 5.4289 0.58 1.165
24 35 47 -0.16 0.67 0.5 0.0256 0.4489 -0.08 0.335
28 34 48 3.84 -0.33 1.5 14.7456 0.1089 5.76 -0.495
27 38 49 2.84 3.67 2.5 8.0656 13.4689 7.1 9.175
24.16 34.33 46.5 38.8336 32.8334 24.5 16
b1 = sum[(x1 – x1mean) * (y – ymean)] / sum(x1 – x1mean)**2
b1 = 24.5 / 38.8336
b1 = 0.63089
b2 = sum[(x2 – x2mean) * (y – ymean)] / sum(x2 – x2mean)**2
b2 = 16 / 33.3334
b2 = 0.48
y_mean = b1x1_mean + b2x2_mean +b0
b0 = y_mean – b1x1_mean – b2x2_mean
b0 = 46.5 – (0.63089)*(24.16) – (0.48)*(34.33)
b0 = 46.5 – 15.24 – 16.4784
b0 = 14.7
check the prediction:
y_pred = b1x1 + b2x2 + b0
y_pred = 0.6309*22 + 0.48*36 +14.78
y_pred = 13.87 + 17.28 + 14.78
y_pred = 45.93
Date: 18.09.2020
simple Multiple
Single input x,y Multiple input x1,x2,x3,y
x y X1 X2 X3 y
Reshape(-1,1) to fit in the columns Reshape(1,-1) to fit row wise
Get dummies
It will convert the string data in int by analysing it
Refer get dummies excel sheet
New york California Florida
1 0 0
0 1 0
We can’t do plotting in multiple linear regressions because of multiple inputs and one output
Score(used r squared method)
Date. 21.09.2020
Under fitting model (not creates much problem)
Score of the model is very poor--- (50 or 60)
Not fully fit
Training is give60-50% accuracy
Actual value and predicted value has large difference which mean testing is also not accurate
Over fitting model
Model tries to cover the whole model
Type of data
Training data- is wholly memorised and 100% accurate
Testing data-but gives bad prediction
In this model the training model is 99% accurate but the testing data is not that accurate
Under fitting problem is very low and over fitting problem is very high.
Generalized model
It almost cover all the points and give as must as low error as possible
We should get low error in training as well as in testing
This is the ideal model
R squared method
R2=1-RSS/TSS
Recedual = y-y_pred
T=y-y_mean
Rss=(sum(y-y_pred))^2
Tss=(sum(y-y_mean))^2
Polynomial regression
One input one output
Y=b0+b1x
Polynomial regression
Y=b0+b1x+b2x^2+b2x^3
More the curve more increase the degree
It is used where we have to predict values continuously there regression algorithm is used it predicts
continuous numbers
Generalize form
n Sum(x) Sum(x^2) B0
Sum(x) Sum(x^2) Sum(x^3) B1
Sum(x^2) Sum(x^3) Sum(x^4) B1
=
Sum(y)
Sum(yx)
Sum(yx^2)
Date: 22.09.2020
Date: 23.09.2020
Regression:- regression means predict the continuous value.
Classification:- categorising the data , it means it predicts output from multiple classes
(Categorical data) Animal data example
Categorical data
Class1 is cat, class2 is dog
Cat
Dog
Dog
Cat
Cat
Cat
Dog
Dog
Logistic Regression:
(It is a classification algorithm), it is similar to linear regression.
For binary classification best algorithm used is logistic regression
Sygmoid curve
The value is either 0 or 1
It works on probability
Formula of sigmoid curve:simple linear
y=eb0+b1x/1+eb0+b1
Or
y=1/1+e-(bo+b1x)
multiple linear
y=eb0+b1x1+b2x2+…...+bnxn/1+ eb0+b1x1+b2x2…..+bnxn
or
y=1/1+e-(b0+b1x1+b2x2+…...+bnxn)
Example:
y- (x-
x- (x-
x y y_mea x_mean)*(
x_mean x_mean)**2
n y-y_mean)
21 1 -1.5 0.5 2.25 -0.75
22 0 -0.5 -0.5 0.25 0.25
23 0 0.5 -0.5 0.25 -0.25
24 1 1.5 0.5 2.25 0.75
22.5 0.5 0 0 5 0
B1=0
B0=0.5
if we have new value x=46 predict y
y=b0+b1x
y=0.5
y=eb0+b1x/1+eb0+b1
y=1/1+e-(bo+b1x)
y=0.622459
this will go in 1’s category
Code of logistic regression with sklearn for single variable i.e single input
import pandas as pd
import numpy as np
import [Link] as plt
%matplotlib inline
Df=pd.read_csv(“insurance_data.csv”)
[Link]().sum()
X=df[“age”].values
y=df[“bought_insuarance”].values
#plotting the actual data
[Link](X,y)
from sklearn,model_selection import train_test_split
x_train, x_test,y_train,y_test=train_test.split(X,y,test_size=0.2,random_state=0)
#creating a model
from sklearn.linear_model import LogisticRegression
log_model=LogisticRegression()
x_train.x_train.reshape(-1,1)
y_train=y_train.reshape(-1,1)
x_test.x_test.reshape(-1,1)
y_test=y_test.reshape(-1,1)
log_model.fit(x_train,y_train)
y_pred=log_model.predict(x_test)
log_model.score(x_test,y_test)*100
#prediction for external value
test=[Link]([[25]])
log_model.predict(test)
[Link](X,y)
y_pred=log_model.predict(X)
[Link](X,y_pred)
Date:24.09.2020
Logistic regression (titanic project)
Step analyse the data by see what information and columns are gives and remove the
unnecessary data.
If there is categorical data given use getdummies to separate them.
life cycle of model steps:
1. Collecting data
2. Analyse the data---which variable have most or many correlation(between labels and ).i.e.
which column to remove an what to keep
3. Data Wrangling—clean the data I.e. to remove unwanted data, fill or remove null values
as given.
4. Training or testing the data---we have to train the model and separate the training and
testing data
5. Validation-----weather the model gives right output or not. Checking the accuracy of the
model
Confusion matrix:
Pred0 pred 1
Prednot pred survive
Actual not survived [74, 8] actual 0
Actual survived [18, 43]actual 1
used to check the right and wrongness of the probability
always seen in the diagonal
18 value is the error of the model
Date: 25.09.2020
K Nearest Neighbour
We count the distance of the all points from the new data point entered.
Euclidean distance formula:
root((x1-x2)^2+(y1-y2)^2)
male=0,female=1
Name Age Gender Sports Distance
A 32 0 Football 27.01
M 40 0 Neither 35.01
S 16 1 Cricket 11
Z 34 1 Cricket 29
S 55 0 Neither 50.01
R 40 0 Cricket 35.01
A 20 1 Neither 15
A 15 0 Cricket 10.1
P 55 1 Football 50
A 15 0 Football 10
5 1 ?===cricket
X1=32 #existing data point k=3 #k is always odd number
Y1=0 #existing data point
X2=5 #new data point
Y2=1 #new data point
[Link] ((x1-x2)**2+(y1-y2)**2)
Use case is used to solve the data manually without sklearn
Date: 28.09.2020
Knn is mostly used in YouTube, online shopping apps and websites
For recommending things
Date. 30.09.2020
SVM (Support Vector Machine)
Hyper plane is nothing but linear line used for classification (decision boundary)
Support vectors are the data points which are near to hyper plane
Opposite data points –d and +d from the hyper plane
The margin is created by using the support vectors
Maximum margin hyper plane
On this basis we choose how our hyper plane will be formed.
How to decide which margin we have to pick
In the below example we choose the margin 2 because it has more distance or width
There are two types of SVM
1. Linear support vector machine
We can easy separate data by linear line
2. Non Linear support vector machine
Kernel: converts low dimensional data to high dimensional data
I.e. if the data is in 2D then it coverts in into 3D
There are 4 types of kernel
1. Linear(by default)
2. Polynomial
3. RBF-Radial basis function(best for vector)(non-linear)
4. Sigmoid
Example: maths sum
How mathematically hyper plane is drawn
Q. draw the hyper plane data given
(1,1)(2,1)(1,-1)(2,-1)(4,0)(5,1)(5,-1)(6,0)
1 1
2 1
1 -1
2 -1
4 0
5 1
5 -1
6 0
Step1: plot the graph
s1 = 2 1 1 s2= 2 -1 1 s3= 4 0 1
the last added 1 is bias
we will use linear method as the by default is linear
and vector separation
α1 s1 s1 + α2 s1 s2 + α3 s1 s3=-1----s1 is constant
α1 s1 s2 + α2 s2 s2 + α3 s2 s3=-1----s2 is constant
α1 s1 s3 + α2 s2 s3 + α3 s3 s3=1----s3 is constant
solving the above equations
for vector s1
α1 s1 s1 + α2 s1 s2 + α3 s1 s3
6 α1+4 α2+9 α3
Similarly
For s2
4 α1+6 α2+9 α3
For s3
9 α1+9 α2+17 α3
α1=-3.25 α2=-3.25 α3=3.5
Ɯ=∑αisi
Ɯ=α1*s1+α2*s2+α3+s3=2
Ɯ= 1 0 -3(matrix)
Equation of a plane
y=b0+b1x
b0=3---offset/bias
b1=1 0
Date: 01.10.2020
Bydefault the kernel is taking RBF
Benefit
We can do parameter tuning to increase accuracy
Disadvantage of SVM
SVM cannot handle very large scale data
Because its training time is very large
To see al the build in datasets in sklearn
from sklearn import datasets
dir(datasets)
Homework for cancer dataset
Convert in dataframe
check 0 and 1 clases
Date: 02.10.2020
Decision tree
Helps to take decision like whether is yes or no, profit or loss.
Example of salesman for Loan Company
This is called decision tree
It either classifies or regression : mostly used in classification
Target: is to sale the loan
The rectangular boxes are nodes
Two nodes are very imp
1. Starting node: Root node
The starting of decision
2. Ending node: leaf node
Where the decision has been taken
Step one find the target
Step two find the root note
Old=50-50 above mid= 20 -50 new =20 and below 20
Age Competition Type Profit
Old Yes Software Down
Old No Software Down
Old No Hardware Down
Mid Yes Software Down
Mid Yes Hardware Down
Mid No Hardware Up
Mid No Software Up
New Yes Software Up
New No Hardware Up
New No software Up
Step1: find the target attribute
Target attribute: profit
Step 2: find the information gain of the target attribute
IG(information gain)
if from both even if one of the value is 0 then its IG is zero and if both values are same then its IG is
1
Formula:
IG= (-P/P+N) log2 (P/P+N)-(N/P+N) log2 (N/P+N)
P=no of up i.e. profit gain
N=no of down i.e. profit loss
P=5
N=5
IG=1
Step 3: find the gain of each feature attribute
1. Age:
Information gain (IG) for each old, mid, new
age down Up
Old 3 0
Mid 2 2
new 0 3
For old
IG= (-3/3+0) log2 (3/3+0)-(0/3+0)log2(0/3+0)
IG=0
Similarly
For mid IG=1
For new IG=0
Find the Entropy ---summation of IG’s
E(A)=∑(Pi+Ni/P+N)IG(PiNi)
E(A)=(3+0/10)*0+(4/10)1+(0+3/10)*0
Gain
Gain=IG(target)-E(A)=1-0.4
Gain=0.6
2. Competition
Information gain (IG) for yes and no
Comp Down N Up P
Yes 3 1
No 2 4
For yes
IG= (-P/P+N) log2 (P/P+N)-(N/P+N) log2 (N/P+N)
IG (yes)=0.81127
IG (no)=0.91829
E(C) = (1+3/10)*0.81127+(4+4/10)0.91829)
E=0.87548
Gain=1-0.87548=0.12452
3. Type
Type Up down
Software 3 3
Hardware 2 2
IG(s)=1
IG(h)=1.
E(T)= (6/10)*1+(4/10)1
Gain=0
Now compare the gain
Age=0.6
Competition= 0.12
Type=0
As the gain of age is most age is the root node
Decision tree
Id3 algorithm
Date: 06.10.2020
prediction
D
A
T
A SVM
S
E
T
Ensemble learning: mobile in ss
Random Forest Algorithmde
Uses ensemble learning for choosing
And used only decision tree
Like multiples of decision tree
Original dataset
SR A B C Target
1 Y
2 N
3 Y
4 Y
5 N
Bootstrap dataset:duplicasy is allowed
Sr A B C Target
2 N
1 Y
1 Y
3 Y
4 Y
Date: 07.10.2020
Absent
Date:08.10.2020
Naive Bayes algorithm
We find the probability
Used in predicting spam messages
It uses conditional probability.
It is classification algorithm but can also perform regression
It haves different Varian’s
Intotal it has tree varian’s
1. Bernoulli distribution/naive bayes
For binary classification
Best for success and failures ,no yes,true false
2. Multinomial naive bayes
For multiple classes
3. Gaussian naive bayes
To predict the continuous value
Mathematical approach
Sample of fruits
Fruits(target) Yellow(feature Sweet(feature Long(feature) total
) )
Mango 350 450 0 650
Banana 400 300 350 400
others 50 100 50 150
total 800 850 400 1200
Predict=(yellow,sweet,long) =which fruit is this
Formula: to find probability
p(A|B) = probability of A when B is true
p(A|B)=p(B|A)*p(A)/p(B)
Step1: Find the probability of mango
Probability for mango:
(yellow,sweet,long)=x(considered value which changes accordingly
1. Probability of yellow mango x=yellow
P(yellow|mango)=p(mango|yellow)*p(yellow)/p(mango)
=(350/800)*(800/1200)/(650/1200)
=(0.4375)*(0.6667)/(0.54667)
=0.5386
2. x=sweet
p(sweet|mango)=(450/850)*(850/1200)/(650/1200)
=0.692
3. x=long
p(long|mango)=(0/400)*(400/1200)/(650/1200)
=0
Total=0
Probability for banana:
1. probability for yellow banana x=yellow
p(yellow|banana)=(400/800)*(800/1200)/(400/1200)=1
2. x=sweet
p(sweet/banana)=(300/850)*(850/1200)/(400/1200) =0.75
3. x=long
p(long/banana)=(350/400)*(400/1200)/(400/1200)=0.875
total:0.6562
Probability for other:
1. probability for yellow other x=yellow
p(yellow|other)=(50/800)*(800/1200)/(150/1200)=0.333
2. x=sweet
p(sweet/others)=(100/850)*(850/1200)/(150/1200) =0.666
3. x=long
p(long/others)=(50/400)*(400/1200)/(150/1200)=0.333
total: 0.0738
Therefore the predict fruit is banana
Example 2
colour type origin Stolen
Red Sports Domestic Y
Red Sports Domestic N
Red Sports Domestic Y
Yellow Sports Domestic N
Yellow Sports Imported Y
Yellow SUV Imported N
Yellow SUV Imported Y
Yellow SUV Domestic N
Red SUV Imported N
Red sports Imported Y
Red=5 yellow=5 total 10
Sports=6 SUV=4 total=10
Domestic=5 imported=5 total=10
Y=5 n=5 total 10
(red,SUV,domestic)=whether it is stolen or not?
Prob(red|yes)=(3/5)(5/10)/(5/10)=0.6
Prob(red|no)=(2/5)(5/10)/(5/10)=0.4
Prob(yellow|yes)=(2/5)(5/10)/(5/10)=0.4
Prob(yellow|no)=(2/5)(5/10)/(5/10)=0.6
For color yess: 0.24 no:0.24
Prob(sports|yes)=(4/6)(6/10)/(5/10)
Prob(Sports|no)=(2/6)(6/10)/(5/10)=0.4
Pron(suv|yes)=(1/4)(4/10)/(5/10)=0.2
Date: 13.10.2020
K mean Clustering
Unsupervised machine learning
Amazon
Home
Eletronics
appliances
sports
swim
Amazon cluster
Example 1:
Data={2,3,4,10,11,12,20,25,30 }
K=2
Form 2 clusters
Step 1: pick any two random values
m1=4 mid/mean val for c1 m2=12 mid/mean val for c2
we see the nearest val to the mean or mid val; by calculation for eg take the val 10 then we subtract it
with means 1st 10-4=6 2nd 12-10=2 therefore 10 belongs to c2 always take +ve difference
c1={2,3,4}
c2={10,11,12,20,25,30}
find the mean of c1 and c2
actual m1=2+3+4/3=3
m2=18
same steps will be performed with actual calculated m1 and m2
c1={2,3,4,10}
c2={11,12,20,25,30}
m1= 4.75 m2=19.6
c1={2,3,4,10,11,12}
c2={20,25,30}
m1= 7 m2=25
c1={2,3,4,10,11,12}
c2={20,25,30}
This is the final cluster as the m1 and m2 are same as before
Euclidean formula
√(XH-H1)**2+(XW-W1)**2)
H->height
Sr Height Weight
no
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76
Form 2 clusters
height weight Centroid value
C1 185 72 (185,72)
C2 170 56 (170,56)
C1={1,4,5}
C2={2,3,}
Row3:168,60
For 1.
√(XH-H1)**2+(XW-W1)**2)
√(168-185)**2+(60-72)**2)=20.8086
For 2
√(168-170)**2+(60-56)**2)=4.47
height weight Centroid value
C1 185 72 (185,72)
C2 170 56 (169,58)
C1={1,}
C2={2,3,}
Update centroid value of c2
Mean of r2 =170+168/2=169,58
Row 4: 179,68
For 1.
√(XH-H1)**2+(XW-W1)**2)
√(179-185)**2+(68-72)**2)=7.211
For 2
√(179-169)**2+(68-58)**2)=14.14
height weight Centroid value
C1 185 72 (182,70)
C2 170 56 (169,58)
C1={1,4}
C2={2,3,}
Row5:182,72
For 1.
√(182-182)**2+(72-70)**2)=2
For 2
√(182-169)**2+(72-58)**2)=19.798
height weight Centroid value
C1 185 72 (182,71)
C2 170 56 (169,58)
C1={1,4,5}
C2={2,3,}
Row 6:188,77
For 1.
√(188-182)**2+(77-71)**2)=8.485
For 2
√(188-169)**2+(77-58)**2)=26.8700
height weight Centroid value
C1 185 72 (185,74)
C2 170 56 (169,58)
C1={1,4,5,6}
C2={2,3,}
Row7: 180,71
For 1.
√(180-185)**2+(71-74)**2)=5.830
For 2
√(180-169)**2+(71-58)**2)=17.029
height weight Centroid value
C1 185 72 (182.5,72.5)
C2 170 56 (169,58)
C1={1,4,5,6,7}
C2={2,3,}
Row8: 180,70
For 1.
√(180-182.5)**2+(70-72.5)**2)=3.5355
For 2
√(180-169)**2+(70-58)**2)=16.2788
height weight Centroid value
C1 185 72 (185,74)
C2 170 56 (169,58)
C1={1,4,5,6,7,8}
C2={2,3,}
Date: 14.10.2020
Date: 15.10.2020
NLP : Natural Language Processing
It’s a part of deep learning and AI.
It is a part of computer science, machine learning and artificial intelligence which deal with the
human language.
Tokenization
Types
1. Bigram
2. Trigram
3. Ngram
Library: NLTK (natural language tool kit)
Helps in text analysis
Application:
1. Sentiments analysis: it analysed our moods or our words or politeness
2. Alexa, siri, chatbot
NLP has two part
1. NLU (Natural language understanding)
It map the text or sentence from the database
2. NLG (Natural language generation)
It answers with meaning full sentence
Ambiguity is the errors that occur in NLP
NLU ambiguity
1. Lexical ambiguity
Error of word which have two meaning
Ex: she looking for a match
Here match has two meaning like one is games match ore partner match
2. Syntactic ambiguity
A sentence which has two different meaning because of wrong grammar/sentence
misformation
Ex: chicken ready to eat
3. Referential ambiguity
Sentence with wrong reference like
Ex: this is that and that is this
Step1: Pip install nltk
Step2: import nltk on python idle shell
Step3: [Link]()
A nltk downloader windows gets open
Date: 16.10.2020
Tokenization
To take only the imp things with dividing the Para in small token stop words will be removed
Stemming
Words which are similar to each other finds unique ness in the words and then it generates a new
word. Stemming is faster
There a possibility that the word for has any meaning
Ex history and historicalhistori
Finally and finalizationfinal
Lemmatization
It is similar to stemming but it always gives meaningful words. It used more time to generate words
Ex history and historical history
Stopwords-> (I,me,your,of,them,for,on,to,….)
Word which are not usefull
Date: 17.10.2020
Date: 19.10.2020
Bag of Word
It is the function which tells which word is important
For ex: to tell whether the review is good or not
It is also called binary filtration
It finds with the help of frequency
YouTube also uses bag of words
Ex.
He is a nice boy 1
She is a nice girl 1
Boy and girl are nice 0
Nice boy
Nice girl
Boy girl nice
Nice boy Girl
1 1 1 0 110
2 1 0 1 101
3 1 1 1 111
Date: 20.10.2020
Neural Network:
Layers:
1. Input layer
2. Output layer
Date: 22.10.2020
Forward propagation:
Activation function
It converts input signal to output signal
Weight and bias are some random numbers
f->indicates activation function
v1=f(u1*wa+u2*wx+b1) O=f(v1*W1+v2*W2+v3*W3+B)
v2=f(u1*wb+ u2*wy+b2) Final activation function O
v3=f(u1*wc+ u2*wz+b3)
Do neural networks need activation function?
Answer is yes also and no also
When the data turn toward non-linear or the complexity increases at that time activation function is
necessary
Activation function
1. Identity function\Linear activation function
Completes the activation function place in node
Equation: f(x)=x for all x
Generates Output signal same as input signal
Used when the problems are very simple
2. Heviside activation function/binary step function
Helps in complex decision
For example multiple classes
Use in single layer network convert net input to output signal should be 0 or 1
Equation: f(x)={1 if x >=t}
{0 if x<t}
T is threshold value
3. Sigmoid activation function
Used for backward propagation
Equation: f(x) =1/1+e (-x)
4. Hyperbolic tangent activation function/Relu
It is called as tanh activation function
Optimization is easy
-1 to 1
(-2x) (-2x)
Equation: f(x) =1- e /1+e
5. Remp activation function
Equation: f(x)={x,x>0}
{0,x<0}
Length Width Color Of
H(X1) W(X2) Flower
2 1.5 R(1)
3 1 B(0)
4 1.5 R(1)
2 0.5 ?
Code in python
Cost function
Difference of actual data point and calculated data point
1. R2/square error
2. MSE-mean squared error
3. RMSE(root mean squared error)
Date: 23-10-2020
--------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------
To import and merge many file(1lakh) All_file)names=[I for I in
[Link](f””(file_extention))]
Import os
Df=pd.read_csv(“csv file name”)
Import pandas as pd
Read _all_data=[]
Import glob #for directories
[Link](r”path of file for output dir”)
[Link]()
Combine_file.to_csv(“combine_file.csv”)
[Link](r”path of file where all the csv files are saved”)
File_extention=”.csv”
--------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------
1. R2/square error
Formula:
SE=sum(y^i-y1)2 y^i is y dash of i
2. MSE-mean squared error
Formula:
MSE=sum(y^i –yi )2/n
3. RMSE(root mean squared error)
Formula:
RMSE=root(sum(y^i –yi )2/n)
Cost function does not give value in percentage it gives continuous values
How to reduce error using cost function:
Actual data: 2
Predicted value=1.2
Cost=actual data-predicted value=2-1.2=0.8
Date: 24.10.2020
Date: 26.10.2020-last day
Name