0% found this document useful (0 votes)

82 views52 pages

Machine Learning and Python Basics

Q: What are the steps involved in a data science model's lifecycle, and how do each of these steps contribute to successful model implementation and deployment?

The lifecycle of a data science model typically involves the following steps: collecting data, analyzing the data, data wrangling, training/testing the model, and validation. 1. **Collecting Data**: Gathering relevant data sets that pertain to the problem at hand. This step involves ensuring data is collected from reliable sources . 2. **Analyzing the Data**: Examining data attributes to understand its structure and correlation between variables, assisting in feature selection and data preprocessing steps . 3. **Data Wrangling**: Cleaning the data by handling missing values, outliers, and transforming data as required, ensuring that the model inputs are of high quality . 4. **Training/Testing the Model**: Splitting data into training and test sets to build and evaluate the model on unseen data, which prevents overfitting and helps generalize the model . 5. **Validation**: Assessing model accuracy and performance through metrics such as confusion matrix, precision/recall, and other relevant indicators, refining the model as needed before deployment . These steps interconnect to ensure that the model development is systematic, robust, and suitable for deployment in real-world scenarios, contributing to the success of predictive modeling endeavors .

Q: Explain the role of dataframes in data analysis using Pandas and illustrate with an example how they can be beneficial in handling real-world data.

Dataframes in Pandas provide a powerful data structure for handling tabular data, enabling easy manipulation, cleansing, and analysis of datasets similar to a spreadsheet or SQL table. Key advantages include: 1. **Intuitive data representation**: As a two-dimensional size-mutable structure, dataframes allow for easy data handling with labels for both rows and columns . 2. **Efficient handling of missing data**: With methods like `fillna`, `dropna`, and `interpolate`, dataframes simplify the process of managing missing data, facilitating realistic data analysis . 3. **Complex data operations**: Dataframes support complex operations such as merging, joining, grouping, and pivoting which are crucial for real-world data manipulation . For example, when analyzing weather data, a dataframe can be created to efficiently calculate the mean, max, or min temperature values, or even perform complex data aggregation by grouping data based on events or days, which substantially optimizes workflows compared to handling such data through lists or dictionaries .

Q: In what ways do the matplotlib library functions facilitate the creation of various types of plots, and how can these plots be used to enhance data analysis?

The matplotlib library offers a suite of functions that facilitate the creation of diverse types of plots, enhancing data visualization and analysis. Key features include: 1. **Line Plots**: `plt.plot()` is used for creating line graphs, which can be customized with colors, styles, and markers to track trends over continuous variables . 2. **Bar Charts**: `plt.bar()` allows for vertical and horizontal bar plotting, useful for comparing discrete categories or groups, with flexibility in arranging and labeling bars . 3. **Histograms**: `plt.hist()` is designed for showing the distribution of a set of continuous data points, providing insights into the underlying frequency distribution and data spread . 4. **Pie Charts**: `plt.pie()` generates circular statistical graphics divided into slices, illustrating numerical proportions and summarizing categorical data . 5. **Customization**: Users can enhance plots with titles, labels, legends, and color modifications to make visual data representations more informative . These plotting capabilities make matplotlib an invaluable tool for data analysts and researchers, allowing them to meaningfully present complex datasets, identify patterns and outliers, and communicate findings effectively .

Q: How does the sklearn library simplify the process of implementing machine learning algorithms in Jupyter notebooks?

The sklearn library simplifies the implementation of machine learning algorithms in Jupyter notebooks by providing a consistent, user-friendly API for a wide range of machine learning tasks such as classification, regression, and clustering . 1. **Preprocessing**: Sklearn facilitates data preprocessing through built-in classes and functions for scaling, normalization, and handling missing data, which ensures the data is in a suitable format for modeling . 2. **Algorithm Implementation**: It offers a simple interface to initiate and run various machine learning models with minimal code (e.g., LinearRegression, LogisticRegression), handling the computational heavy-lifting behind the scenes . 3. **Model Validation**: Sklearn includes tools for splitting datasets, cross-validation, and various metrics to evaluate model performance effectively . 4. **Visualization**: With support for integrating matplotlib, results and learning curves can be easily visualized to interpret model behavior . 5. **Interoperability**: As a robust library in the Python ecosystem, sklearn works seamlessly with other Python libraries like pandas and numpy, enhancing versatility in data science workflows . Overall, sklearn streamlines the model development cycle, making machine learning accessible to users in Jupyter notebook environments without compromising functionalities .

Q: How does regularization help improve the performance of machine learning models?

Regularization improves the performance of machine learning models by adding a penalty term to the loss function of the model, discouraging complex models with extreme coefficient values that might seem perfect for training data but fail to generalize well to new data . Two common forms of regularization are: 1. **L1 Regularization (Lasso)**: It adds the absolute value of the magnitude of coefficients as a penalty term to the loss function. This form actually performs feature selection as it can shrink some model coefficients to zero, effectively removing them . 2. **L2 Regularization (Ridge)**: It adds the square of the coefficients to the loss function, reducing the coefficients uniformly and encouraging smaller but diffused coefficient values lowering the complexity of the model . Overall, regularization helps to avoid overfitting, particularly in models with many predictors, and supports better generalization ability by simplifying the model, which ultimately enhances its performance on unseen data .

Q: How does numpy facilitate mathematical and scientific calculations in Python, and what are the key features that enhance its performance over traditional Python lists?

Numpy facilitates mathematical and scientific calculations in Python by providing a powerful n-dimensional array object, allowing for efficient storage and handling of large data sets. Key features that enhance performance over traditional lists include: 1. **Memory efficiency**: Numpy arrays are more compact in memory than Python lists. They store only data of the same type, reducing memory overhead . 2. **Vectorization**: It supports vectorized operations, meaning loop operations can be replaced with single array operations, which enhances speed . 3. **Predefined functions**: Numpy provides a multitude of functions like mathematical, statistical operations, and linear algebraic functions which are optimized heavily for performance . These features eliminate the need for explicit loops, thus speeding up the computations tremendously compared to native Python lists .

Q: Describe how the concept of overfitting can impact the performance of a machine learning model, and what techniques might be employed to mitigate this issue.

Overfitting occurs when a machine learning model learns the training data too well, capturing noise and details that don't generalize to new data. This results in a model that performs well on train data but poorly on unseen data, causing a gap between training and test accuracy . To mitigate overfitting, several techniques are employed: 1. **Cross-validation**: Using techniques like k-fold cross-validation helps ensure that the model performs well on various subsets of data, providing a more reliable performance measure rather than a single train-test split . 2. **Regularization**: Techniques like L1 and L2 regularization add a penalty for larger coefficients, discouraging complex models and reducing overfitting by simplifying the model . 3. **Pruning (for decision trees)**: Simplifying the model by removing parts that provide little power can help in reducing overfitting in decision trees and improving generalizability . 4. **Early Stopping**: Monitoring the model’s performance on a validation set and halting training when performance starts to degrade prevents overfitting during the training process . 5. **Increasing Data**: Introducing more data can help the model generalize better by providing a more comprehensive training base . These techniques help in achieving a balance between bias and variance, thereby improving the machine learning model's ability to generalize beyond training data .

Q: What are the main types of machine learning, and how do they differ in terms of data requirements and learning objectives?

Machine learning is mainly categorized into supervised, unsupervised, and reinforcement learning. In supervised learning, the algorithm is trained on labeled data, meaning both input and output are provided. The objective is to learn a mapping from inputs to outputs, minimizing the error margin between predicted and actual outputs. Examples include classification (e.g., k-NN, logistic regression) and regression (e.g., linear regression). Unsupervised learning involves training without labeled outputs, primarily using input data to identify patterns or groupings, such as clustering (e.g., k-means clustering). Reinforcement learning involves an agent interacting with an environment, receiving rewards for certain actions, aiming to maximize the cumulative reward over time. It is often used in situations where exploring strategies is needed, such as games or robotics (deep learning concepts like Monte Carlo and Q-learning are used here).

Q: Discuss the significance of the sigmoid function in logistic regression and how it differs from linear regression in handling categorical data.

The sigmoid function plays a crucial role in logistic regression as it converts the linear equation output into a probability value between 0 and 1, suitable for binary classification tasks. Unlike linear regression that predicts continuous values, logistic regression uses the sigmoid function to handle the categorical nature of target variables by applying the logistic function: y = 1 / (1 + e^-(b_0 + b_1x)), where 'e' is the base of natural logarithms . This transformation allows logistic regression to predict the probability that a given input point belongs to a certain class (e.g., success/failure, yes/no), rather than predicting an arbitrary linear value. The resulting probability can then be mapped to the binary classes, making logistic regression well-suited for classification problems, whereas linear regression is typically applied to regression problems involving continuous data .

Q: What is the concept of R-squared in regression analysis, and how does it help evaluate the performance of a regression model?

R-squared, or the coefficient of determination, is a statistical measure in regression analysis that indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. It provides an index to measure the model's ability to explain the distribution of outcomes. The value of R-squared ranges from 0 to 1. An R-squared of 0 indicates that the model explains none of the variability, whereas an R-squared of 1 signifies that the model explains all the variability of the response data around its mean . R-squared helps in evaluating the performance of a regression model by reflecting how well the independent variables are able to predict the dependent variable. Nonetheless, a high R-squared does not imply a good model, particularly in non-linear data, as it does not account for overfitting, but it is a useful statistic in conjunction with others to evaluate model performance .

The document discusses machine learning and its different types including supervised learning, unsupervised learning, and reinforcement learning. It also discusses various machine learning algorithms such as classification algorithms like K-nearest neighbors and logistic regression, regression algorithms like single linear regression and multiple linear regression, and clustering algorithms like K-means clustering. The document also covers concepts related to numpy and pandas libraries in Python for machine learning.

Uploaded by

Bhagyashri Rane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views52 pages

Machine Learning and Python Basics

Uploaded by

Bhagyashri Rane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Introduction to Machine Learning
Python Setup for Machine Learning
Introduction to Numpy
Pandas for Data Handling
Matplotlib for Visualization
Advanced Machine Learning Algorithms

Date: 01.09.

2020

What is machine learning?

 It is the process to teach machine to recognize the images or pictures or some kind of data.
 Combination of data

Categories
Machine learning
Supervised- we have input as well as to correct it we have output

input

output

The difference between predicted and actual output is called error margin

Two type of algorithms (to handle/complete the big main process divide it into small process)

1. Classification – to differentiate/distinguish between things (e.g. difference between cat

and dog)
Example
o K-nn
o Logistic regression
2. Regression – where we have to predict the answer is yes or no
o Single linear
o multiple linear
o polynomial

Unsupervised- we only have input.

We mostly use clustering (grouping) in unsupervised

Algorithms:

1. Clustering(grouping)
Example
o K mean clustering
2. Anamoly
To understand the pattern

Reinforcement- reward based learning (deep learning concept).

Algorithms:

1. Monte Carlo
2. Q learning

Pre-processing- python pandas library

Pi cham
Jupiter note book

Idle

Python should be recognised

Python 3.7.4

Python install package

In cmd type pip install Jupiter

After that create new folder and open cmd from that path

Shift+enter to run jupyter code

Num pi

In cmd pip install num pi

Import numpy/pandas/matplotlib

Commands in cmd

1. Python hit enter

2. Exit
3. Pip –version
4. Pip install jupyter
5. Pip install numpy
6. Pip install pandas
7. Pip install matploy
Date: 03.09.2020

Numpy
It is a library which is used in python. Used for mathematical and scientific calculations

In jupyter notebook

Import numpy as np

Print(dir(np))

Print(help(np))

In google

Numpy documentation for python 3 (for guidance)

Array (collection of same data type)

Array is different from matrix. As the operations performed on them are different.

# 1D array

A_1 = [Link]([1,2,3,4,5])--------the ‘()’are used for array method in numpy and ‘[]’ are used to
define an array items

Size is used to know the number of elements on array

print(A_1)

ouput

[1 2 3 4 5]-----shape of this array is 8

# 2D array

A_2 = [Link]([

[1,2,3],---->1st Row

[4,5,6],---->2nd row---------shape of this array is 3

[7,8,9]----->3rd row

])

print(A_2)

Output

[[1 2 3]
[4 5 6]
[7 8 9]]
# 3D array shape 3x3x3---->3-rows, 3-cols, 3-elements

A_3=[Link]([

[[1,2,3],[4,5,6],[7,8,9]],

[[10,11,12],[13,14,15],[16,17,18]],

[[19,20,21],[22,23,24],[25,26,27]],

])

Output:

[[[1 2 3]
[4 5 6]
[7 8 9]]

[[10 11 12]
[13 14 15]
[16 17 18]]

[[19 20 21]
[22 23 24]
[25 26 27]]]

Range function
Have three parameters (start, stop, increment/ decrement)
Gives integer type of data
size = range(10)
print(size)
type(size)
for i in size:
print(i)
print(type(i))

Arange(same as range function)method and functions are almost same

It generates array type of data
> arrange(p1,p2,p3)

P1=staring position
P2=ending position
P3=increment/decrement
arr = [Link](1,10)
print(arr)
print(type(arr))
output :[1 2 3 4 5 6 7 8 9]
<class '[Link]'>

Nd means nth dimension array

ar_1=[Link](10,51,2)

print(ar_1)
output: [10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50]

How to create zero array?

[Link]((no_of_rows,no_of_cols))
[Link]((no_of_rows,no_of_cols))

zeros=[Link]((4,3))
zeros
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])

zeros_1=[Link]((3,4),dtype=np.int16)
zeros_1
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]], dtype=int16)

The same goes with one array

Methods and keywords (numpy)

Ndim(used to get the dimension of array/provides the dimension of the array)

--------------print(arr_1.ndim)

Size(gives no of elements of array)

--------------print(arr_1.size)

itemsize(gives the size of how much data/bits a single element consume)

-------------print(arr_1.itemsize)

Dtype(gives datatype of the element)

---------------print(arr_1.dtype)

Shape(gives no of rows and cols)

---------------print(arr_1.shape) output=(8,)

Reshape an array (conversion of 1D array to 2D or 3D or vice versa)

temp_arr=arr_1.reshape(2,4)

print(temp_arr)

temp_arr.ndim

Flatten(coverts 2D&3D array to 1D)

Array consumes less memory than list/why is array faster than list.
import numpy as np

import time

import sys

s=range(1000)

print([Link](5)*len(s))

14000

d=[Link](1000)

print([Link]*[Link])

4000

Matrix multiplication matrix addition and matrix division therioticaly explained

Date: 04.09.2020

 Zip is used to compress two lists

Array is faster than list
Program
import numpy as np

import time

size = 10000

#list

l1=range(size)

l2=range(size)

start=[Link]()

print(start)

#result=l1+l2--->this will concatinate

result=[(x,y) for x,y in zip(l1,l2)] #zip is used to compress to list

print(([Link]()-start)*1000)

#array

a1=[Link](size)
a2=[Link](size)

start=[Link]()

result=a1+a2

print(([Link]()-start)*1000)

What is line?

Line is the combination of multiple/infinte points.

Linspace function
Linspace is used to get point in between two points of line

Syntax: linspace(a,b,c)

a-starting point

b- Ending point

c- No of points you want

to get minimum value from an array

array_name.min()

to get maximum value from an array

array_name.max()

to get sum of array element

array_name.sum()

arr_1=

8 9
10 11
12 13
Rows are indicated as axis-1

Columns are indicated as axis-0

#used mostly in pre-processing

arr_1=[Link]([

[8,9],

[10,11],

[12,13]

])

arr_1.shape

print(arr_1.sum())

print(arr_1.sum(axis=0))---->gives addition of cols

print(arr_1.sum(axis=1))--->gives addition of rows

Square Root of array

It gives element vise square root

Syntax: [Link](array_name)

Example: arr_1=[Link]([

[8,9],

[10,11],

[12,13]])

print([Link](arr_1))

output:

[[2.82842712 3. ]
[3.16227766 3.31662479]
[3.46410162 3.60555128]]

Addition, subtraction, multiplication and division of array

Program
arr_1=[Link]([[1,2,3],[3,4,5]])

arr_2=[Link]([[1,2,3],[3,4,5]])

print("addition \n",arr_1+arr_2)

print("subtraction \n",arr_1-arr_2)

print("multiplication \n",arr_1*arr_2)

print("division\n",arr_1/arr_2)
output
addition
[[ 2 4 6]
[ 6 8 10]]
subtraction
[[0 0 0]
[0 0 0]]
multiplication
[[ 1 4 9]
[ 9 16 25]]
division
[[1. 1. 1.]
[1. 1. 1.]]

Horizontal stacking vertical stacking

It should be in tuple only
Syntax: vstack(tuple)
hstack(tuple)
Program
a= [Link]([[1,2,3],[3,4,5]])
b=[Link]([[1,2,3],[3,4,5]])
#vertical stacking
print([Link]((a,b)))
print([Link]((a,b)))

Matrix
#convert array to matrix
m_1=[Link](a)
print(type(m_1))

#creating a matrix 3X3

m_2=[Link]("1 2 3 ;4 5 6;7 8 9")
print(m_2)
print([Link](m_2))
print([Link](m_2))
print([Link](m_2))
print([Link](m_2))

PANDAS
Python has 4 type of number system
Decimal

Hexadecimal

Octadecimal

Binary

Dictionary
Key is ont of two data type either numeric(int) or string

Dataframe(mostly used in data analysis)

It is a table like structure which looks like an excel sheet

Key works as a column name in dataframe

Program
#creation of dataframe
import pandas as pd

weather_data={

"Day":['1/1/2020','22/1/2020','3/2/2020','12/4/2020','25/5/2020'],

"Temp":[31,29,22,35,19],

"Wind_speed":[7,9,4,5,6],

"Event":["sunny","sunny","rain","fog","sunny"]

print(weather_data)

#convertion od dictionary to data frame

df=[Link](weather_data)

Date: 05.09.2020

Shape: used to give dimensions of the table that is the number of rows and columns
ROW
Head
Syntax: dataframe_name.head(no of rows)

Gives/display upper/starting rows of the data frame

Bydefault: 5

Tail
Syntax: dataframe_name.tail(no of rows)

Gives lower/last rows of the dataframe

Bydefault: 5

Slicing:
It is used to create a sublist

Syntax:

List_name[starting index:ending index]

Indexing/slicing in dataframe:
Dataframe_name[starting_index:ending index]

COLUMN
Dataframe_name.columns

Gives name of the columns

To get the data of a particular column

--->dataframe_name[column_name]

Day=df[‘Day’].values

Print(Day)

.values--->is used to get the values in array

Program to get two columns values

Df[[‘day’,temp’]].values

METHODS

Temp_col=df[“Temp”].values

Temp_col.max()
Program

Df[Df[“Temp”]>32]-->queries

Df[df[“temp”]==df[“temp”].max()]

[Link]()------>used to get operational/ int fomate data, gives mean, standered deviation and cout
and many more things.

Using Tuple list

Weather_data=[ (‘12/2/2020’,32,8,”rain”),

(‘29/3/2020’,22,8,”rain”),

(‘28/5/2020’,19,8,”rain”),

(‘22/7/2020’,23,8,”rain”),]

Df=[Link](data=weather_data,columns=[“day”,”temp”,”wind_speed”,”events”])

Date: 07.09.2020

CSV file

How to import data from csv file

Copy the csv data path

Read_csv()

It is a me

#csv data

Pd.read_csv(r”path”)

#XLS

Df=pd.read_excel(r”path”)

Pandas is fast because of dataframe

Set first column as index

Df.set_index(“column_name”,inplace=True)--->to perment the operation inplace is used

[Link]

------

[Link][“index_name”]

------gives data of the index mentioned

READ and WRITE the data operations

Csv and excel

Skip row attribute is used to skip upper rows

Syntax

df= pd.read_csv(r"C:\Users\K\Documents\stock_csv_data.csv",skiprows=No_of_rows)

Header
Makes the mentioned row a header

Index wise

Of we put 2 then it skips the ist row and makes the 2nd row header

Nrows stands for number of rows

To replace the empty values from the table to NaN

So that it becomes easy to perform operations

Example:

df= pd.read_csv(r"C:\Users\K\Documents\stock_csv_data.csv",na_values=["n.a.","not available"])

Write csv
To_csv()=to create a new writern csv dataframe file

Example: df1.to_csv("[Link]",index=False)

Write excel:
df_1.to_excel("[Link]",sheet_name="stock",index=False,startrow=2,startcol=1,header=False)

to_excel:----- to create a new writern excel dataframe file

#write two dataframe to two seperate sheet in excel

program

import pandas as pd

df_stock=[Link]({

"tikers":["Google","WMT","MSFT"],

"price":[30,40,10],

"pe":[20.5,65.10,35.2],

"eps":[20.5,20.5,56.1]

})

df_stock

df_weather=[Link]({

"day":["monday","tuesday","friday"],

"temp":[30,40,10],

"event":["rain","sunny","rain"],

"humidity":[20.5,20.5,56.1]

})

df_weather

#openpyxl----used to write a excel file

with [Link]("stock_weather.xlsx") as writer:

df_stock.to_excel(writer ,sheet_name="stock_file")

df_weather.to_excel(writer ,sheet_name="weather_file")

Handle missing values

[Link](attribute)----used to fill the missing data for whole data

Attribute can be any thing you wanna replace the missing value with

Handling the empty value column vise

New_df=[Link]({

“day” : “no date”,

“temperature” : 0,

“windspeed” : 0,

“event” : ”no events”})

Date:09.09.2020

Handling missing data

On index methods are not performed

parse_dates=['day']

used to change the day/date str values into date format/ time stamp

example: df=pd.read_csv("weather_data.csv",parse_dates=['day'])

Fillna(value)

Fills all the NAN values of the table with the value passed

method of fillna

Ffill(forward fill)

It is used to fill the current value with the previous value/above cell value

Bfill (backward fill)

It is used to fill the current value with the next value/below cell value

Interpolate

It gives meaningful data. It by default gives data in linear interpolation.

Dropna

Used to drop the data which has missing data

Thresh

It is used to remove the no of NAN valued columns

Like if (thresh=1)----then if then it will keep all the rows which has 1 NAN in it

#insert date used to insert the date into the table

rg=pd.date_range("2017-01-01","2017-01-11")

index=[Link](rg)

[Link](index)

replace using #regular expression regex

df2=[Link]({

"temperature":'[a-z]',

"windspeed":'[a-z]'

},'',regex=True)
df2

Grouping / clustering

If we have a data which has a column having repitative values of different type

Date: 10.09.2020

Concatenation:
Used for basic concatenation

#conactenate and key

df=[Link]([india_weather,us_weather],keys=["india","us"])

#with the index

df=[Link]([india_weather,us_weather])

#ignore the index

df=[Link]([india_weather,us_weather],ignore_index=True)

Merge
Merge is use to merge to data frames

Syntax: [Link](dataframe1,dataframe2,on=”columnname”,
how=”inner/outer”,indicator=”true/false(bydefault”)

“on”---it merges the dataframes on the basis o fthe column name mentioned.

Merge has the same consept as joins in DBMS

Inner join-----comman things only(intersection)

Outer join-----whole data(union)

Left join------only left data(with comman things from right)

Right join-------only right data(with comman things from left)

Indicator flag-----used to indicate which are common

Ex: df3=[Link](df1,df2,on="city",how="outer",indicator=True)

Suffixes----- used to get values of both the dataframes

Ex:
df3=[Link](df1,df2,on="city",how="outer",suffixes=("_first","_second"),indicator=True)
df3
output:

temp_firs humidity_firs humidity_secon

city temp_second _merge
t t d

0 new york 22.0 55.0 18.0 68.0 both

1 chicago 15.0 85.0 23.0 65.0 both

2 orlando 35.0 76.0 NaN NaN left_only

baltimor
3 40.0 68.0 NaN NaN left_only
e

4 san diego NaN NaN 35.0 71.0 right_only

MATPLOT LIB
import [Link] as plt

%matplotlib inline(always used in jupyter only)

To [Link](x,y)

[Link]("Weather Graph")

[Link]("Days")

[Link]("Temp")

[Link](x,y,color="blue", linewidth=2,linestyle="dotted",marker="*")

attributes of plot
color

linewidth

linesyle

marker----end pints syle of the plot

alpha----controles opacity

In IDLE python to see the the graph you have to right

[Link]()
String format

Refer: [Link]

Date: 11.09.2020

To get three cities temperature information in one graph

Program:

[Link]("weather")

[Link]("day")

[Link]("temp")

[Link](day,mumbai,"g*-")

[Link](day,delhi,"ro-")

[Link](day,pune,"b^-")

[Link](loc="upper right")-------used to place the graph scale(bydefault value is best is it will fit
itself at the emply space )

attributes of legend:

loc,fontsize====”large”,”small” ,shadow—give shadow to the scale box

[Link]()

Bar Chart:
company=["Relince","Indian Oil","State Bank Of India","TATA"]

revenue=[82,77,47,65]

[Link](company,revenue,color="red")

#if error comes abut integer values

company_position=[Link](len(company))

[Link](company_position,revenue,color="green")

[Link](company_position,company)

[Link]()

#multiple bar vertical

[Link](company_position-0.2,revenue,width=0.4,label="revenue")

[Link](company_position+0.2,profit,width=0.4,label="profit")

[Link](fontsize="large")

[Link](company_position,company)

[Link]()
#multiple bar horizontal

[Link](company_position-0.2,revenue,label="revenue")

[Link](company_position+0.2,profit,label="profit")

[Link](fontsize="large")

[Link](company_position,company)

[Link]()

Histogram
It can be generated using also with one parameter

X axis carries the variable

Y axis generates the frequency accordingly

people_ages=[12,45,18,8,3,85,75,65,15,95,35,23,44,66,58,62,73,84,92,110]

age_group=[1,10,20,30,40,50,60,70,80,90,100,110]

[Link](people_ages,age_group,rwidth=0.8)

[Link](range(0,7))

[Link]()

Pi Chart
exp=[1400,600,300,410,250]

exp_label=["bike","food","phone","internet","others"]

[Link](exp,labels=exp_label,shadow=True,autopct=”%1.5f%%”,explode=[0,0,0,0.4,0)

[Link]()

#[Link]("equal") ---when you get ovel shaped pi chart used to make is circular

To save the chart

[Link]("[Link]")

#never write [Link]() before savefig it will not give the graph

Date: 14.09.2020
ALGORITHMS
Simple linear Regression
----algorithm implementation in ml is called model

When predict value is in number form there we can used regression

Independent variable---data that can be controlled directly

Dependent variable---data that cannot be controlled directly

Used to find best fit line with minimum error margin

x Y
Y_e Sal
1 20
2 40
3 50
4 40
5 50

1 2 3 4 5

Formula: y=b0+b1*x / y=c+mx

B0--->it is the intercept

B1---->slop of coefficient

Main formula to find correlation: y=b0+b1x

b0=y_mean-b1*x_mean b1=sum(x-x_mean)*sum(y-y_mean)/sum(x-x_mean)^2

c---->constant

m---->slope

y---->dependent x---->independent
Required things to find best fit line

(x- (x-x’)
X Y x-x’ Y-y’
x’)^2 *(y-y’)
1 20 -2 -20 4 40
2 40 -1 0 1 0
3 50 0 10 0 0
4 40 1 0 1 0
5 50 2 10 4 20
Xmean= Ymean=
0 0 10 60
3 40
m= (x-x’)*(y-y’)/(x-x’)^2=6

to find constant

y=c+m*x

40=c+18

c=2

simple linear regression code in python

import numpy as np #predict value
import [Link] as plt y_pred=b[0]+(b[1]*x)
print(y_pred)
def coe(x,y):
global m_x,m_y,c
#plotting regression line
n=[Link](x)
print(n)
#mean [Link](x,y_pred)
m_x=[Link](x) #labels
m_y=[Link](y) [Link]("year of experience")
#calculating cross validation about x [Link]("Salary")
ss_xy=[Link](y*x)-(n*m_y*m_x) #for (x-x’) *(y- [Link]()
y’) [Link]()
ss_xx=[Link](x*x)-(n*m_x*m_x) #for (x-x’)^2
def main():
#calculating regression coefficient
m=ss_xy/ss_xx
x=[Link]([1,2,3,4,5])
c=m_y - m*m_x y=[Link]([20,40,50,40,50])
#call function coe
return (c,m) b=coe(x,y)
def plotting_regeression_line(x,y,b): print(b)
#plotting data points
global m_x,m_y,c plotting_regeression_line(x,y,b)
[Link](x,y)
main()

sklearn it is used to make things easy

code jupyter
from sklearn.linear_model import LinearRegression

mport numpy as np #reshaping the x,y cos it only takes 2D arrays or

import pandas as pd values
import [Link] as plt
%matplotlib inline X=[Link](-1,1)
dataframe=pd.read_csv("emp_data.csv") [Link]
dataframe Y=[Link](-1,1)
[Link]()
#isnull gives the number of null values in the data lr=LinearRegression()
[Link]().sum()
#training the machine according to the data
[Link](X,Y)#---------used to fit the value in
X=dataframe["Year of Experience"].values machine
Y=dataframe["salary"].values y_pred=[Link](X)
y_pred
from sklearn.linear_model import
LinearRegression #plotting
[Link](X,Y,color="red")
[Link](X,y_pred,color="blue")
[Link]()

Date: 15.09.2020

Cross validation for simple linear regression

Main formula to find best fit line:

Y=b0+b1*x

b0 =sum(y)*sum(x**2)-sum(x)*sum(xy)/n*sum(x**2)-(sum(x))**2=22

b1=n*sum(xy)-sum(x)*sum(y)/ n*sum(x**2)-(sum(x))**2=6

X y xy x**2
1 20 20 1
2 40 80 4
3 50 150 9
4 40 160 16
5 50 250 25
15 200 660 55

Y=22+6*5(to get the predict value )

seperate x and y using iloc method

date: 16.09.2020(absent)

Multiple Linear Regression:

output

X1 X2 Y
21 31 44
22 36 45
23 32 46
24 35 47
28 34 48
27 38 49
24.16 34.33 46.5

Y= b0 + b1x

Multiple linear:

Y= b0 + b1x1 + b2x2 pred

b1 = sum[(x1 – x1mean) * (y – ymean)] / sum(x1 – x1mean)**2

b2 = sum[(x2 – x2mean) * (y – ymean)] / sum(x2 – x2mean)**2

x1_mean = 24.16

x2_mean = 34.33

y_mean = 46.5

x1 x2 y (x1 - (x2 – (y – ymean) (x1 -x1mean)^2 (x2 – (x1 – (x2 –

x1mea
x1mean) x2mean) x2mean)^2 x2mea
n) * (y
-ymea n) * (y
n) –
ymean)
21 31 44 -3.16 -3.33 -2.5 9.9856 11.0889 7.9 8.325
22 36 45 -2.16 1.67 -1.5 4.6656 2.7889 3.24 -2.505

23 32 46 -1.16 2.33 -0.5 2.5056 5.4289 0.58 1.165

24 35 47 -0.16 0.67 0.5 0.0256 0.4489 -0.08 0.335

28 34 48 3.84 -0.33 1.5 14.7456 0.1089 5.76 -0.495

27 38 49 2.84 3.67 2.5 8.0656 13.4689 7.1 9.175

24.16 34.33 46.5 38.8336 32.8334 24.5 16

b1 = sum[(x1 – x1mean) * (y – ymean)] / sum(x1 – x1mean)**2

b1 = 24.5 / 38.8336

b1 = 0.63089

b2 = sum[(x2 – x2mean) * (y – ymean)] / sum(x2 – x2mean)**2

b2 = 16 / 33.3334
b2 = 0.48

y_mean = b1x1_mean + b2x2_mean +b0

b0 = y_mean – b1x1_mean – b2x2_mean

b0 = 46.5 – (0.63089)(24.16) – (0.48)(34.33)

b0 = 46.5 – 15.24 – 16.4784

b0 = 14.7

check the prediction:

y_pred = b1x1 + b2x2 + b0

y_pred = 0.630922 + 0.4836 +14.78

y_pred = 13.87 + 17.28 + 14.78

y_pred = 45.93

Date: 18.09.2020

simple Multiple
Single input x,y Multiple input x1,x2,x3,y
x y X1 X2 X3 y
Reshape(-1,1) to fit in the columns Reshape(1,-1) to fit row wise

Get dummies

It will convert the string data in int by analysing it

Refer get dummies excel sheet

New york California Florida

1 0 0
0 1 0
We can’t do plotting in multiple linear regressions because of multiple inputs and one output

Score(used r squared method)

Date. 21.09.2020

Under fitting model (not creates much problem)

Score of the model is very poor--- (50 or 60)

Not fully fit

Training is give60-50% accuracy

Actual value and predicted value has large difference which mean testing is also not accurate

Over fitting model

Model tries to cover the whole model

Type of data

Training data- is wholly memorised and 100% accurate

Testing data-but gives bad prediction

In this model the training model is 99% accurate but the testing data is not that accurate

Under fitting problem is very low and over fitting problem is very high.

Generalized model

It almost cover all the points and give as must as low error as possible

We should get low error in training as well as in testing

This is the ideal model

R squared method

R2=1-RSS/TSS

Recedual = y-y_pred

T=y-y_mean
Rss=(sum(y-y_pred))^2

Tss=(sum(y-y_mean))^2

Polynomial regression
One input one output

Y=b0+b1x

Polynomial regression

Y=b0+b1x+b2x^2+b2x^3

More the curve more increase the degree

It is used where we have to predict values continuously there regression algorithm is used it predicts
continuous numbers

Generalize form

n Sum(x) Sum(x^2) B0
Sum(x) Sum(x^2) Sum(x^3) B1
Sum(x^2) Sum(x^3) Sum(x^4) B1
=
Sum(y)
Sum(yx)
Sum(yx^2)

Date: 22.09.2020

Date: 23.09.2020

Regression:- regression means predict the continuous value.

Classification:- categorising the data , it means it predicts output from multiple classes

(Categorical data) Animal data example

Categorical data

Class1 is cat, class2 is dog

Cat
Dog
Dog
Cat
Cat
Cat
Dog
Dog

Logistic Regression:
(It is a classification algorithm), it is similar to linear regression.

For binary classification best algorithm used is logistic regression

Sygmoid curve

The value is either 0 or 1

It works on probability
Formula of sigmoid curve:simple linear

y=eb0+b1x/1+eb0+b1
Or

y=1/1+e-(bo+b1x)

multiple linear

y=eb0+b1x1+b2x2+…...+bnxn/1+ eb0+b1x1+b2x2…..+bnxn
or

y=1/1+e-(b0+b1x1+b2x2+…...+bnxn)

Example:

y- (x-
x- (x-
x y y_mea x_mean)*(
x_mean x_mean)**2
n y-y_mean)
21 1 -1.5 0.5 2.25 -0.75
22 0 -0.5 -0.5 0.25 0.25
23 0 0.5 -0.5 0.25 -0.25
24 1 1.5 0.5 2.25 0.75
22.5 0.5 0 0 5 0

B1=0

B0=0.5

if we have new value x=46 predict y

y=b0+b1x

y=0.5

y=eb0+b1x/1+eb0+b1
y=1/1+e-(bo+b1x)
y=0.622459

this will go in 1’s category

Code of logistic regression with sklearn for single variable i.e single input

import pandas as pd

import numpy as np

import [Link] as plt

%matplotlib inline

Df=pd.read_csv(“insurance_data.csv”)

[Link]().sum()

X=df[“age”].values

y=df[“bought_insuarance”].values

#plotting the actual data

[Link](X,y)

from sklearn,model_selection import train_test_split

x_train, x_test,y_train,y_test=train_test.split(X,y,test_size=0.2,random_state=0)

#creating a model

from sklearn.linear_model import LogisticRegression

log_model=LogisticRegression()

x_train.x_train.reshape(-1,1)

y_train=y_train.reshape(-1,1)

x_test.x_test.reshape(-1,1)

y_test=y_test.reshape(-1,1)

log_model.fit(x_train,y_train)

y_pred=log_model.predict(x_test)

log_model.score(x_test,y_test)*100
#prediction for external value

test=[Link]([[25]])

log_model.predict(test)

[Link](X,y)

y_pred=log_model.predict(X)

[Link](X,y_pred)

Date:24.09.2020

Logistic regression (titanic project)

 Step analyse the data by see what information and columns are gives and remove the
unnecessary data.
 If there is categorical data given use getdummies to separate them.
 life cycle of model steps:
1. Collecting data
2. Analyse the data---which variable have most or many correlation(between labels and ).i.e.
which column to remove an what to keep
3. Data Wrangling—clean the data I.e. to remove unwanted data, fill or remove null values
as given.
4. Training or testing the data---we have to train the model and separate the training and
testing data
5. Validation-----weather the model gives right output or not. Checking the accuracy of the
model

Confusion matrix:
Pred0 pred 1
Prednot pred survive

Actual not survived [74, 8] actual 0

Actual survived [18, 43]actual 1
used to check the right and wrongness of the probability
always seen in the diagonal
18 value is the error of the model

Date: 25.09.2020

K Nearest Neighbour
We count the distance of the all points from the new data point entered.

Euclidean distance formula:

root((x1-x2)^2+(y1-y2)^2)

male=0,female=1

Name Age Gender Sports Distance

A 32 0 Football 27.01
M 40 0 Neither 35.01
S 16 1 Cricket 11
Z 34 1 Cricket 29
S 55 0 Neither 50.01
R 40 0 Cricket 35.01
A 20 1 Neither 15
A 15 0 Cricket 10.1
P 55 1 Football 50
A 15 0 Football 10
5 1 ?===cricket

X1=32 #existing data point k=3 #k is always odd number

Y1=0 #existing data point

X2=5 #new data point

Y2=1 #new data point

[Link] ((x1-x2)**2+(y1-y2)**2)

Use case is used to solve the data manually without sklearn

Date: 28.09.2020

Knn is mostly used in YouTube, online shopping apps and websites

For recommending things

Date. 30.09.2020

SVM (Support Vector Machine)

Hyper plane is nothing but linear line used for classification (decision boundary)

Support vectors are the data points which are near to hyper plane

Opposite data points –d and +d from the hyper plane

The margin is created by using the support vectors

Maximum margin hyper plane

On this basis we choose how our hyper plane will be formed.

How to decide which margin we have to pick

In the below example we choose the margin 2 because it has more distance or width
There are two types of SVM

1. Linear support vector machine

We can easy separate data by linear line
2. Non Linear support vector machine
Kernel: converts low dimensional data to high dimensional data
I.e. if the data is in 2D then it coverts in into 3D

There are 4 types of kernel

1. Linear(by default)
2. Polynomial
3. RBF-Radial basis function(best for vector)(non-linear)
4. Sigmoid

Example: maths sum

How mathematically hyper plane is drawn

Q. draw the hyper plane data given

(1,1)(2,1)(1,-1)(2,-1)(4,0)(5,1)(5,-1)(6,0)

1 1
2 1
1 -1
2 -1
4 0
5 1
5 -1
6 0

Step1: plot the graph

s1 = 2 1 1 s2= 2 -1 1 s3= 4 0 1

the last added 1 is bias

we will use linear method as the by default is linear

and vector separation

α1 s1 s1 + α2 s1 s2 + α3 s1 s3=-1----s1 is constant

α1 s1 s2 + α2 s2 s2 + α3 s2 s3=-1----s2 is constant

α1 s1 s3 + α2 s2 s3 + α3 s3 s3=1----s3 is constant

solving the above equations

for vector s1

α1 s1 s1 + α2 s1 s2 + α3 s1 s3

6 α1+4 α2+9 α3

Similarly

For s2

4 α1+6 α2+9 α3
For s3

9 α1+9 α2+17 α3

α1=-3.25 α2=-3.25 α3=3.5

Ɯ=∑αisi

Ɯ=α1*s1+α2*s2+α3+s3=2

Ɯ= 1 0 -3(matrix)

Equation of a plane

y=b0+b1x

b0=3---offset/bias

b1=1 0

Date: 01.10.2020

Bydefault the kernel is taking RBF

Benefit
We can do parameter tuning to increase accuracy

Disadvantage of SVM
SVM cannot handle very large scale data

Because its training time is very large

To see al the build in datasets in sklearn

from sklearn import datasets

dir(datasets)

Homework for cancer dataset

Convert in dataframe

check 0 and 1 clases

Date: 02.10.2020

Decision tree
Helps to take decision like whether is yes or no, profit or loss.

Example of salesman for Loan Company

This is called decision tree

It either classifies or regression : mostly used in classification

Target: is to sale the loan

The rectangular boxes are nodes

Two nodes are very imp

1. Starting node: Root node

The starting of decision
2. Ending node: leaf node
Where the decision has been taken

Step one find the target

Step two find the root note

Old=50-50 above mid= 20 -50 new =20 and below 20

Age Competition Type Profit

Old Yes Software Down
Old No Software Down
Old No Hardware Down
Mid Yes Software Down
Mid Yes Hardware Down
Mid No Hardware Up
Mid No Software Up
New Yes Software Up
New No Hardware Up
New No software Up

Step1: find the target attribute

Target attribute: profit

Step 2: find the information gain of the target attribute

IG(information gain)

if from both even if one of the value is 0 then its IG is zero and if both values are same then its IG is
1

Formula:

IG= (-P/P+N) log2 (P/P+N)-(N/P+N) log2 (N/P+N)

P=no of up i.e. profit gain

N=no of down i.e. profit loss

P=5

N=5

IG=1

Step 3: find the gain of each feature attribute

1. Age:
Information gain (IG) for each old, mid, new

age down Up
Old 3 0
Mid 2 2
new 0 3
For old
IG= (-3/3+0) log2 (3/3+0)-(0/3+0)log2(0/3+0)
IG=0
Similarly
For mid IG=1
For new IG=0
Find the Entropy ---summation of IG’s
E(A)=∑(Pi+Ni/P+N)IG(PiNi)

E(A)=(3+0/10)*0+(4/10)1+(0+3/10)*0
Gain
Gain=IG(target)-E(A)=1-0.4
Gain=0.6
2. Competition
Information gain (IG) for yes and no

Comp Down N Up P
Yes 3 1
No 2 4
For yes
IG= (-P/P+N) log2 (P/P+N)-(N/P+N) log2 (N/P+N)
IG (yes)=0.81127
IG (no)=0.91829
E(C) = (1+3/10)*0.81127+(4+4/10)0.91829)
E=0.87548
Gain=1-0.87548=0.12452
3. Type

Type Up down
Software 3 3
Hardware 2 2
IG(s)=1
IG(h)=1.
E(T)= (6/10)*1+(4/10)1
Gain=0

Now compare the gain

Age=0.6
Competition= 0.12
Type=0

As the gain of age is most age is the root node

Decision tree
Id3 algorithm

Date: 06.10.2020

prediction
D
A
T
A SVM
S
E
T

Ensemble learning: mobile in ss

Random Forest Algorithmde

Uses ensemble learning for choosing

And used only decision tree

Like multiples of decision tree

Original dataset

SR A B C Target
1 Y
2 N
3 Y
4 Y
5 N
Bootstrap dataset:duplicasy is allowed

Sr A B C Target
2 N
1 Y
1 Y
3 Y
4 Y

Date: 07.10.2020

Absent

Date:08.10.2020

Naive Bayes algorithm

We find the probability

Used in predicting spam messages

It uses conditional probability.

It is classification algorithm but can also perform regression

It haves different Varian’s

Intotal it has tree varian’s

1. Bernoulli distribution/naive bayes

For binary classification
Best for success and failures ,no yes,true false
2. Multinomial naive bayes
For multiple classes
3. Gaussian naive bayes
To predict the continuous value

Mathematical approach

Sample of fruits

Fruits(target) Yellow(feature Sweet(feature Long(feature) total

) )
Mango 350 450 0 650
Banana 400 300 350 400
others 50 100 50 150
total 800 850 400 1200

Predict=(yellow,sweet,long) =which fruit is this

Formula: to find probability

p(A|B) = probability of A when B is true

p(A|B)=p(B|A)*p(A)/p(B)

Step1: Find the probability of mango

Probability for mango:

(yellow,sweet,long)=x(considered value which changes accordingly

1. Probability of yellow mango x=yellow

P(yellow|mango)=p(mango|yellow)*p(yellow)/p(mango)
=(350/800)*(800/1200)/(650/1200)
=(0.4375)*(0.6667)/(0.54667)
=0.5386
2. x=sweet
p(sweet|mango)=(450/850)*(850/1200)/(650/1200)
=0.692
3. x=long
p(long|mango)=(0/400)*(400/1200)/(650/1200)
=0
Total=0

Probability for banana:

1. probability for yellow banana x=yellow

p(yellow|banana)=(400/800)*(800/1200)/(400/1200)=1
2. x=sweet
p(sweet/banana)=(300/850)*(850/1200)/(400/1200) =0.75
3. x=long
p(long/banana)=(350/400)*(400/1200)/(400/1200)=0.875

total:0.6562

Probability for other:

1. probability for yellow other x=yellow

p(yellow|other)=(50/800)*(800/1200)/(150/1200)=0.333
2. x=sweet
p(sweet/others)=(100/850)*(850/1200)/(150/1200) =0.666
3. x=long
p(long/others)=(50/400)*(400/1200)/(150/1200)=0.333
total: 0.0738

Therefore the predict fruit is banana

Example 2

colour type origin Stolen

Red Sports Domestic Y
Red Sports Domestic N
Red Sports Domestic Y
Yellow Sports Domestic N
Yellow Sports Imported Y
Yellow SUV Imported N
Yellow SUV Imported Y
Yellow SUV Domestic N
Red SUV Imported N
Red sports Imported Y
Red=5 yellow=5 total 10

Sports=6 SUV=4 total=10

Domestic=5 imported=5 total=10

Y=5 n=5 total 10

(red,SUV,domestic)=whether it is stolen or not?

Prob(red|yes)=(3/5)(5/10)/(5/10)=0.6

Prob(red|no)=(2/5)(5/10)/(5/10)=0.4

Prob(yellow|yes)=(2/5)(5/10)/(5/10)=0.4

Prob(yellow|no)=(2/5)(5/10)/(5/10)=0.6

For color yess: 0.24 no:0.24

Prob(sports|yes)=(4/6)(6/10)/(5/10)

Prob(Sports|no)=(2/6)(6/10)/(5/10)=0.4

Pron(suv|yes)=(1/4)(4/10)/(5/10)=0.2

Date: 13.10.2020

K mean Clustering
Unsupervised machine learning

Amazon

Home
Eletronics
appliances

sports
swim

Amazon cluster
Example 1:

Data={2,3,4,10,11,12,20,25,30 }

K=2

Form 2 clusters

Step 1: pick any two random values

m1=4 mid/mean val for c1 m2=12 mid/mean val for c2

we see the nearest val to the mean or mid val; by calculation for eg take the val 10 then we subtract it
with means 1st 10-4=6 2nd 12-10=2 therefore 10 belongs to c2 always take +ve difference

c1={2,3,4}

c2={10,11,12,20,25,30}

find the mean of c1 and c2

actual m1=2+3+4/3=3

m2=18

same steps will be performed with actual calculated m1 and m2

c1={2,3,4,10}

c2={11,12,20,25,30}

m1= 4.75 m2=19.6

c1={2,3,4,10,11,12}

c2={20,25,30}

m1= 7 m2=25

c1={2,3,4,10,11,12}

c2={20,25,30}

This is the final cluster as the m1 and m2 are same as before

Euclidean formula

√(XH-H1)**2+(XW-W1)**2)

H->height
Sr Height Weight
no
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76

Form 2 clusters

height weight Centroid value

C1 185 72 (185,72)
C2 170 56 (170,56)
C1={1,4,5}

C2={2,3,}

Row3:168,60

For 1.

√(XH-H1)**2+(XW-W1)**2)

√(168-185)**2+(60-72)**2)=20.8086

For 2

√(168-170)**2+(60-56)**2)=4.47

height weight Centroid value

C1 185 72 (185,72)
C2 170 56 (169,58)
C1={1,}

C2={2,3,}

Update centroid value of c2

Mean of r2 =170+168/2=169,58

Row 4: 179,68
For 1.

√(XH-H1)**2+(XW-W1)**2)

√(179-185)**2+(68-72)**2)=7.211

For 2

√(179-169)**2+(68-58)**2)=14.14

height weight Centroid value

C1 185 72 (182,70)
C2 170 56 (169,58)
C1={1,4}

C2={2,3,}

Row5:182,72

For 1.

√(182-182)**2+(72-70)**2)=2

For 2

√(182-169)**2+(72-58)**2)=19.798

height weight Centroid value

C1 185 72 (182,71)
C2 170 56 (169,58)
C1={1,4,5}

C2={2,3,}

Row 6:188,77

For 1.

√(188-182)**2+(77-71)**2)=8.485

For 2

√(188-169)**2+(77-58)**2)=26.8700

height weight Centroid value

C1 185 72 (185,74)
C2 170 56 (169,58)
C1={1,4,5,6}
C2={2,3,}

Row7: 180,71

For 1.

√(180-185)**2+(71-74)**2)=5.830

For 2

√(180-169)**2+(71-58)**2)=17.029

height weight Centroid value

C1 185 72 (182.5,72.5)
C2 170 56 (169,58)
C1={1,4,5,6,7}

C2={2,3,}

Row8: 180,70

For 1.

√(180-182.5)**2+(70-72.5)**2)=3.5355

For 2

√(180-169)**2+(70-58)**2)=16.2788

height weight Centroid value

C1 185 72 (185,74)
C2 170 56 (169,58)
C1={1,4,5,6,7,8}

C2={2,3,}

Date: 14.10.2020

Date: 15.10.2020

NLP : Natural Language Processing

It’s a part of deep learning and AI.

It is a part of computer science, machine learning and artificial intelligence which deal with the
human language.

Tokenization
Types

1. Bigram
2. Trigram
3. Ngram

Library: NLTK (natural language tool kit)

Helps in text analysis

Application:

1. Sentiments analysis: it analysed our moods or our words or politeness

2. Alexa, siri, chatbot

NLP has two part

1. NLU (Natural language understanding)

It map the text or sentence from the database
2. NLG (Natural language generation)
It answers with meaning full sentence

Ambiguity is the errors that occur in NLP

NLU ambiguity

1. Lexical ambiguity
Error of word which have two meaning
Ex: she looking for a match
Here match has two meaning like one is games match ore partner match
2. Syntactic ambiguity
A sentence which has two different meaning because of wrong grammar/sentence
misformation
Ex: chicken ready to eat
3. Referential ambiguity
Sentence with wrong reference like
Ex: this is that and that is this

Step1: Pip install nltk

Step2: import nltk on python idle shell

Step3: [Link]()

A nltk downloader windows gets open

Date: 16.10.2020

Tokenization

To take only the imp things with dividing the Para in small token stop words will be removed

Stemming

Words which are similar to each other finds unique ness in the words and then it generates a new
word. Stemming is faster

There a possibility that the word for has any meaning

Ex history and historicalhistori

Finally and finalizationfinal

Lemmatization

It is similar to stemming but it always gives meaningful words. It used more time to generate words

Ex history and historical history

Stopwords-> (I,me,your,of,them,for,on,to,….)

Word which are not usefull

Date: 17.10.2020

Date: 19.10.2020

Bag of Word

It is the function which tells which word is important

For ex: to tell whether the review is good or not

It is also called binary filtration

It finds with the help of frequency

YouTube also uses bag of words

Ex.

He is a nice boy 1

She is a nice girl 1

Boy and girl are nice 0

Nice boy

Nice girl

Boy girl nice

Nice boy Girl

1 1 1 0 110
2 1 0 1 101
3 1 1 1 111

Date: 20.10.2020

Neural Network:

Layers:

1. Input layer
2. Output layer
Date: 22.10.2020

Forward propagation:

Activation function

It converts input signal to output signal

Weight and bias are some random numbers

f->indicates activation function

v1=f(u1*wa+u2*wx+b1) O=f(v1*W1+v2*W2+v3*W3+B)
v2=f(u1*wb+ u2*wy+b2) Final activation function O
v3=f(u1*wc+ u2*wz+b3)

Do neural networks need activation function?

Answer is yes also and no also

When the data turn toward non-linear or the complexity increases at that time activation function is
necessary

Activation function

1. Identity function\Linear activation function

Completes the activation function place in node
Equation: f(x)=x for all x
Generates Output signal same as input signal
Used when the problems are very simple

2. Heviside activation function/binary step function

Helps in complex decision
For example multiple classes
Use in single layer network convert net input to output signal should be 0 or 1
Equation: f(x)={1 if x >=t}
{0 if x<t}
T is threshold value

3. Sigmoid activation function

Used for backward propagation
Equation: f(x) =1/1+e (-x)

4. Hyperbolic tangent activation function/Relu

It is called as tanh activation function
Optimization is easy
-1 to 1
(-2x) (-2x)
Equation: f(x) =1- e /1+e
5. Remp activation function
Equation: f(x)={x,x>0}
{0,x<0}

Length Width Color Of

H(X1) W(X2) Flower
2 1.5 R(1)
3 1 B(0)
4 1.5 R(1)
2 0.5 ?

Code in python

Cost function

Difference of actual data point and calculated data point

1. R2/square error
2. MSE-mean squared error
3. RMSE(root mean squared error)

Date: 23-10-2020

--------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------

To import and merge many file(1lakh) All_file)names=[I for I in

[Link](f””(file_extention))]
Import os
Df=pd.read_csv(“csv file name”)
Import pandas as pd
Read _all_data=[]
Import glob #for directories
[Link](r”path of file for output dir”)
[Link]()
Combine_file.to_csv(“combine_file.csv”)
[Link](r”path of file where all the csv files are saved”)

File_extention=”.csv”

2. MSE-mean squared error

Formula:
MSE=sum(y^i –yi )2/n

3. RMSE(root mean squared error)

Formula:
RMSE=root(sum(y^i –yi )2/n)

Cost function does not give value in percentage it gives continuous values

How to reduce error using cost function:

Actual data: 2

Predicted value=1.2

Cost=actual data-predicted value=2-1.2=0.8

Date: 24.10.2020

Date: 26.10.2020-last day

Name

Common questions