0% found this document useful (0 votes)
8 views33 pages

Python Data Handling with Pandas

The document outlines a series of Python programming tasks related to data manipulation, preprocessing, and machine learning using libraries such as pandas, numpy, and scikit-learn. It covers importing/exporting data, handling missing values, implementing PCA for dimensionality reduction, and various supervised learning algorithms including linear and logistic regression. Each week focuses on different aspects of data science, providing code examples and explanations for practical implementation.

Uploaded by

leebux64
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views33 pages

Python Data Handling with Pandas

The document outlines a series of Python programming tasks related to data manipulation, preprocessing, and machine learning using libraries such as pandas, numpy, and scikit-learn. It covers importing/exporting data, handling missing values, implementing PCA for dimensionality reduction, and various supervised learning algorithms including linear and logistic regression. Each week focuses on different aspects of data science, providing code examples and explanations for practical implementation.

Uploaded by

leebux64
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DEPARTMENT OF CSE

Week1:Write a python program to import and export the data using pandas library
1. Manual Function

def load_csv(filepath):
data = []
col = []
checkcol = False
with open(filepath) as f:
for val in [Link]():
val = [Link]("\n","")
val = [Link](',')
if checkcol is False:
col = val
checkcol = True
else:
[Link](val)
df = [Link](data=data, columns=col)
return df
2. [Link] function
df = [Link]('[Link]', delimeter = ',')
print(df[:5,:])
3. [Link]()
data = [Link]('100 Sales [Link]', delimiter=',')
>>> [Link](data)
4. Pandas.read_csv()
>>> pdDf = pd.read_csv('100 Sales [Link]')
>>> [Link]()
5. Pickle
with open('[Link]','wb') as f:
[Link](pdDf, f)

Machine learning lab Page 1


DEPARTMENT OF CSE

WEEK-2: Data preprocessing


1. Handling missing values
• isnull()
• notnull()
• dropna()
• fillna()
• replace()
• interpolate()

# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, [Link], 95],
'Second Score': [30, 45, 56, [Link]],
'Third Score':[[Link], 40, 80, 98]}

# creating a dataframe from list


df = [Link](dict)

# using isnull() function


[Link]()

# importing pandas package


import pandas as pd

# making data frame from csv file


data = pd.read_csv("[Link]")

Machine learning lab Page 2


DEPARTMENT OF CSE

# creating bool series True for NaN values


bool_series = [Link](data["Gender"])

# filtering data
# displaying data only with Gender = NaN
data[bool_series]
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, [Link], 95],
'Second Score': [30, 45, 56, [Link]],
'Third Score':[[Link], 40, 80, 98]}

# creating a dataframe using dictionary


df = [Link](dict)

# using notnull() function


[Link]()
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("[Link]")

# creating bool series True for NaN values


bool_series = [Link](data["Gender"])

Machine learning lab Page 3


DEPARTMENT OF CSE

# filtering data
# displaying data only with Gender = Not NaN
data[bool_series]

# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, [Link], 95],
'Second Score': [30, 45, 56, [Link]],
'Third Score':[[Link], 40, 80, 98]}

# creating a dataframe from dictionary


df = [Link](dict)

# filling missing value using fillna()


[Link](0)
# importing pandas as pd

import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, [Link], 95],
'Second Score': [30, 45, 56, [Link]],
'Third Score':[[Link], 40, 80, 98]}

Machine learning lab Page 4


DEPARTMENT OF CSE

# creating a dataframe from dictionary


df = [Link](dict)

# filling a missing value with


# previous ones
[Link](method ='pad')
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, [Link], 95],
'Second Score': [30, 45, 56, [Link]],
'Third Score':[[Link], 40, 80, 98]}

# creating a dataframe from dictionary


df = [Link](dict)

# filling null value using fillna() function


[Link](method ='bfill')

Machine learning lab Page 5


DEPARTMENT OF CSE

WEEK-3: Dimensionality Reduction


1. Implementation of PCA
import pandas as pd
import numpy as np
import [Link] as plt
%matplotlib inline
from [Link] import PCA
from [Link] import StandardScaler
#import the breast _cancer dataset
from [Link] import load_breast_cancer
data=load_breast_cancer()
[Link]()

# Check the output classes


print(data['target_names'])

# Check the input attributes


print(data['feature_names'])
# construct a dataframe using pandas
df1=[Link](data['data'],columns=data['feature_names'])

# Scale data before applying PCA


scaling=StandardScaler()

# Use fit and transform method


[Link](df1)
Scaled_data=[Link](df1)
# Set the n_components=3
principal=PCA(n_components=3)
[Link](Scaled_data)
x=[Link](Scaled_data)

Machine learning lab Page 6


DEPARTMENT OF CSE

# Check the dimensions of data after PCA


print([Link])
# Check the values of eigen vectors
# prodeced by principal components
principal.components_
[Link](figsize=(10,10))
[Link](x[:,0],x[:,1],c=data['target'],cmap='plasma')
[Link]('pc1')
[Link]('pc2')
# import relevant libraries for 3d graph
from mpl_toolkits.mplot3d import Axes3D
fig = [Link](figsize=(10,10))

# choose projection 3d for creating a 3d graph


axis = fig.add_subplot(111, projection='3d')

# x[:,0]is pc1,x[:,1] is pc2 while x[:,2] is pc3


[Link](x[:,0],x[:,1],x[:,2], c=data['target'],cmap='plasma')
axis.set_xlabel("PC1", fontsize=10)
axis.set_ylabel("PC2", fontsize=10)
axis.set_zlabel("PC3", fontsize=10)

Machine learning lab Page 7


DEPARTMENT OF CSE

WEEK-4: Write a python program to demonstrate various data visualisation


# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("[Link]")
# Printing the first 10 to 24 rows of
# the data frame for visualization
data[10:25]

# importing pandas package


import pandas as pd
# making data frame from csv file
data = pd.read_csv("[Link]")

# Printing the first 10 to 24 rows of


# the data frame for visualization
data[10:25]

# importing pandas package


import pandas as pd

# making data frame from csv file


data = pd.read_csv("[Link]")

# will replace Nan value in dataframe with value -99


[Link](to_replace = [Link], value = -99)

# importing pandas as pd
import pandas as pd

# Creating the dataframe

Machine learning lab Page 8


DEPARTMENT OF CSE
df = [Link]({"A":[12, 4, 5, None, 1],
"B":[None, 2, 54, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})

# Print the dataframe


Df

# importing the required module


import [Link] as plt

# x axis values
x = [1,2,3]
# corresponding y axis values
y = [2,4,1]

# plotting the points


[Link](x, y)

# naming the x axis


[Link]('x - axis')
# naming the y axis
[Link]('y - axis')

# giving a title to my graph


[Link]('My first graph!')

# function to show the plot


[Link]()

Machine learning lab Page 9


DEPARTMENT OF CSE

return probabilities

def predict(info, test):


probabilities = calculateClassProbabilities(info, test)
bestLabel, bestProb = None, -1
for classValue, probability in [Link]():
if bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel = classValue
return bestLabel

def getPredictions(info, test):


predictions = []
for i in range(len(test)):
result = predict(info, test[i])
[Link](result)
return predictions

def accuracy_rate(test, predictions):


correct = 0
for i in range(len(test)):
if test[i][-1] == predictions[i]:
correct += 1
return (correct / float(len(test))) * 100.0

filename = r'E:\user\MACHINE LEARNING\machine learning algos\Naive bayes\[Link]'


mydata = [Link](open(filename, "rt"))
mydata = list(mydata)
mydata = encode_class(mydata)
for i in range(len(mydata)):
mydata[i] = [float(x) for x in mydata[i]]
ratio = 0.7

Machine learning lab Page 10


DEPARTMENT OF CSE

train_data, test_data = splitting(mydata, ratio)


print('Total number of examples are: ', len(mydata))
print('Out of these, training examples are: ', len(train_data))
print("Test examples are: ", len(test_data))
info = MeanAndStdDevForClass(train_data)
predictions = getPredictions(info, test_data)
accuracy = accuracy_rate(test_data, predictions)
print("Accuracy of your model is: ", accuracy)
1. Implementation of SVM Classification

# importing scikit learn with make_blobs


from [Link].samples_generator import make_blobs
# creating datasets X containing n_samples
# Y containing two classes
X, Y = make_blobs(n_samples=500, centers=2,random_state=0, cluster_std=0.40)
import [Link] as plt
# plotting scatters
[Link](X[:, 0], X[:, 1], c=Y, s=50, cmap='spring');
[Link]()
# creating linspace between -1 to 3.5
xfit = [Link](-1, 3.5)
# plotting scatter
[Link](X[:, 0], X[:, 1], c=Y, s=50, cmap='spring')
# plot a line between the different sets of data
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
[Link](xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)
[Link](-1, 3.5);
[Link]()

Machine learning lab Page 11


DEPARTMENT OF CSE

# importing required libraries

import numpy as np
import pandas as pd
import [Link] as plt
x = pd.read_csv("C:\...\[Link]")
a = [Link](x)
y = a[:,30] # classes having 0 and 1
x = np.column_stack(([Link],[Link]))
[Link]
print (x),(y)

Machine learning lab Page 12


DEPARTMENT OF CSE

WEEK-5: Supervised Learning


1. Implementation of Linear Regression
import numpy as np
import [Link] as plt
def estimate_coef(x, y):
n = [Link](x)
m_x = [Link](x)
m_y = [Link](y)
SS_xy = [Link](y*x) - n*m_y*m_x
SS_xx = [Link](x*x) - n*m_x*m_x
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
[Link](x, y, color = "m",marker = "o", s = 30)
y_pred = b[0] + b[1]*x
[Link](x, y_pred, color = "g")
[Link]('x')
[Link]('y')
[Link]()
def main():
x = [Link]([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = [Link]([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {}\\nb_1 = {}".format(b[0], b[1]))
plot_regression_line(x, y, b)
if name == " main ":

Machine learning lab Page 13


DEPARTMENT OF CSE

WEEK-6 : Implementation of Logistic regression

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import warnings
[Link]( "ignore" )
class LogitRegression() :
def init ( self, learning_rate, iterations ) :
self.learning_rate = learning_rate
[Link] = iterations
def fit( self, X, Y ) :
self.m, self.n = [Link]
self.W = [Link]( self.n )
self.b = 0
self.X = X
self.Y = Y
for i in range( [Link] ) :
self.update_weights()
return self
def update_weights( self ) :
A = 1 / ( 1 + [Link]( - ( [Link]( self.W ) + self.b ) ) )
tmp = ( A - self.Y.T )
tmp = [Link]( tmp, self.m )
dW = [Link]( self.X.T, tmp ) / self.m
db = [Link]( tmp ) / self.m
self.W = self.W - self.learning_rate * dW
self.b = self.b - self.learning_rate * db
return self
def predict( self, X ) :
Z = 1 / ( 1 + [Link]( - ( [Link]( self.W ) + self.b ) ) )

Machine learning lab Page 14


DEPARTMENT OF CSE

Y = [Link]( Z > 0.5, 1, 0 )

return Y
def main() :
df = pd.read_csv( "[Link]" )
X = [Link][:,:-1].values
Y = [Link][:,-1:].values
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size = 1/3, random_state = 0 )
model = LogitRegression( learning_rate = 0.01, iterations = 1000 )
[Link]( X_train, Y_train )
model1 = LogisticRegression()
[Link]( X_train, Y_train)
Y_pred = [Link]( X_test )
Y_pred1 = [Link]( X_test )
correctly_classified = 0
correctly_classified1 = 0
count = 0
for count in range( [Link]( Y_pred ) ) :
if Y_test[count] == Y_pred[count] :
correctly_classified = correctly_classified + 1
if Y_test[count] == Y_pred1[count] :
correctly_classified1 = correctly_classified1 + 1
count = count + 1
print( "Accuracy on test set by our model : ", (
correctly_classified / count ) * 100 )
print( "Accuracy on test set by sklearn model : ", (
correctly_classified1 / count ) * 100 )
if name == " main " :
main()
# importing pandas package
import pandas as pd
Machine learning lab Page 15
DEPARTMENT OF CSE

# making data frame from csv file


data = pd.read_csv("[Link]")
# Printing the first 10 to 24 rows of
# the data frame for visualization
data[10:25]

Machine learning lab Page 16


DEPARTMENT OF CSE

WEEK-7: Supervised Learning


1. Implementation of Decision tree classification
import numpy as np
import pandas as pd
from [Link] import confusion_matrix
from sklearn.model_selection import train_test_split
from [Link] import DecisionTreeClassifier
from [Link] import accuracy_score
from [Link] import classification_report
def importdata():
balance_data = pd.read_csv('[Link]
'+'databases/balance-scale/[Link]',sep= ',', header = None)
print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)
print ("Dataset: ",balance_data.head())
return balance_data
def splitdataset(balance_data):
X = balance_data.values[:, 1:5]
Y = balance_data.values[:, 0]
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)
return X, Y, X_train, X_test, y_train, y_test
def train_using_gini(X_train, X_test, y_train):
clf_gini = DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=3,
min_samples_leaf=5)
clf_gini.fit(X_train, y_train)
return clf_gini
def tarin_using_entropy(X_train, X_test, y_train):
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,max_depth = 3,
min_samples_leaf = 5)
clf_entropy.fit(X_train, y_train)

Machine learning lab Page 17


DEPARTMENT OF CSE

return clf_entropy
def prediction(X_test, clf_object):
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred
def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ",confusion_matrix(y_test, y_pred))print ("Accuracy :
",accuracy_score(y_test,y_pred)*100)
print("Report : ",
classification_report(y_test, y_pred))
def main():
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)
print("Results Using Gini Index:")
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
print("Results Using Entropy:")
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)
if name ==" main ":
main()
1. Implementation of K-nearest Neighbor
from [Link] import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from [Link] import load_iris
import numpy as np
import [Link] as plt

y = [Link]

Machine learning lab Page 18


DEPARTMENT OF CSE

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)neighbors


= [Link](1, 9)
train_accuracy = [Link](len(neighbors))
test_accuracy = [Link](len(neighbors))
for i, k in enumerate(neighbors):
knn = KNeighborsClassifier(n_neighbors=k)
[Link](X_train, y_train)
train_accuracy[i] = [Link](X_train, y_train)
test_accuracy[i] = [Link](X_test, y_test)
[Link](neighbors, test_accuracy, label = 'Testing dataset Accuracy')
[Link](neighbors, train_accuracy, label = 'Training dataset Accuracy')
[Link]()
[Link]('n_neighbors')
[Link]('Accuracy')
[Link]()

Machine learning lab Page 19


DEPARTMENT OF CSE

WEEK-8

Implementation of Naïve Bayes classifier algorithm


import math
import random
import csv
def encode_class(mydata):
classes = []
for i in range(len(mydata)):
if mydata[i][-1] not in classes:
[Link](mydata[i][-1])
for i in range(len(classes)):
for j in range(len(mydata)):
if mydata[j][-1] == classes[i]:
mydata[j][-1] = i
return mydata
def splitting(mydata, ratio):
train_num = int(len(mydata) * ratio)
train = []
test = list(mydata)
while len(train) < train_num:
index = [Link](len(test))
[Link]([Link](index))
return train, test
def groupUnderClass(mydata):
dict = {}
for i in range(len(mydata)):
if (mydata[i][-1] not in dict):
dict[mydata[i][-1]] = []
dict[mydata[i][-1]].append(mydata[i])
return dict

Machine learning lab Page 20


DEPARTMENT OF CSE

return sum(numbers) / float(len(numbers))

def std_dev(numbers):
avg = mean(numbers)
variance = sum([pow(x - avg, 2) for x in numbers]) / float(len(numbers) - 1)
return [Link](variance)

def MeanAndStdDev(mydata):
info = [(mean(attribute), std_dev(attribute)) for attribute in zip(*mydata)]
del info[-1]
return info

def MeanAndStdDevForClass(mydata):
info = {}
dict = groupUnderClass(mydata)
for classValue, instances in [Link]():
info[classValue] = MeanAndStdDev(instances)
return info

def calculateGaussianProbability(x, mean, stdev):


expo = [Link](-([Link](x - mean, 2) / (2 * [Link](stdev, 2))))
return (1 / ([Link](2 * [Link]) * stdev)) * expo
def calculateClassProbabilities(info, test):
probabilities = {}

for classValue, classSummaries in [Link]():


probabilities[classValue] = 1
for i in range(len(classSummaries)):
mean, std_dev = classSummaries[i]
x = test[i]
probabilities[classValue] *= calculateGaussianProbability(x, mean, std_dev)

Machine learning lab Page 21


DEPARTMENT OF CSE

Week-9: Implementation of K-nearest Neighbor


from [Link] import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from [Link] import load_iris
import numpy as np
import [Link] as plt

y = [Link]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
neighbors = [Link](1, 9)
train_accuracy = [Link](len(neighbors))
test_accuracy = [Link](len(neighbors))
for i, k in enumerate(neighbors):
knn = KNeighborsClassifier(n_neighbors=k)
[Link](X_train, y_train)
train_accuracy[i] = [Link](X_train, y_train)
test_accuracy[i] = [Link](X_test, y_test)
[Link](neighbors, test_accuracy, label = 'Testing dataset Accuracy')
[Link](neighbors, train_accuracy, label = 'Training dataset Accuracy')
[Link]()
[Link]('n_neighbors')
[Link]('Accuracy')
[Link]()

Machine learning lab Page 22


DEPARTMENT OF CSE

WEEK-10: Build Artificial Neural Network model with back propagation


Let’s first understand the term neural networks. In a neural network, where neurons are
fed inputs which then neurons consider the weighted sum over them and pass it by an
activation function and passes out the output to next neuron.

Python: To run our script


Pip: Necessary to install Python packages
pip install tensorflow
pip install keras
# Importing libraries
from [Link] import imdb
from [Link] import Sequential
from [Link] import Dense
from [Link] import Flatten
from [Link] import Conv1D
from [Link] import MaxPooling1D
from [Link] import Embedding
from [Link] import sequence# Our dictionary will contain
only of the top 7000 words appearing most frequently
top_words = 7000# Now we split our data-set into training and test data
(X_train, y_train), (X_test, y_test) =
imdb.load_data(num_words=top_words)# Looking at the nature of training
data
print(X_train[0])
print(y_train[0])print('Shape of training data: ')
print(X_train.shape)
print(y_train.shape)print('Shape of test data: ')
print(X_test.shape)
print(y_test.shape)

Machine learning lab Page 23


DEPARTMENT OF CSE

Output :
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36,
256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172,
112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192,
50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16,
43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62,
386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12,
16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28,
77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766,
5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4,
381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134,
476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65,
16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19,
178, 32]
1
Shape of training data:
(25000,)
(25000,)
Shape of test data:
(25000,)
(25000,)

# Padding the data samples to a maximum review length in words


max_words = 450X_train = sequence.pad_sequences(X_train,
maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)# Building the
CNN Model
model = Sequential() # initilaizing the Sequential nature for CNN
model# Adding the embedding layer which will take in maximum of 450
words as input and provide a 32 dimensional output of those words which

Machine learning lab Page 24


DEPARTMENT OF CSE

belong in the top_words dictionary


[Link](Embedding(top_words, 32, input_length=max_words))
[Link](Conv1D(32, 3, padding='same', activation='relu'))
[Link](MaxPooling1D())
[Link](Flatten())
[Link](Dense(250, activation='relu'))
[Link](Dense(1, activation='sigmoid'))

[Link](loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
[Link]()

Machine learning lab Page 25


DEPARTMENT OF CSE

WEEK-11
Implementing Random Forest
# Importing the libraries
import numpy as np
import [Link] as plt
import pandas as pd
data = pd.read_csv('[Link]')
print(data)
# Fitting Random Forest Regression to the dataset
# import the regressor
from [Link] import RandomForestRegressor
# create regressor object
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
# fit the regressor with x and y data
[Link](x, y)
Y_pred = [Link]([Link]([6.5]).reshape(1, 1)) # test the output by changing values
# Visualising the Random Forest Regression results
# arrange for creating a range of values
# from min value of x to max
# value of x with a difference of 0.01
# between two consecutive values
X_grid = [Link](min(x), max(x), 0.01)
# reshape for reshaping the data into a len(X_grid)*1 array,
# i.e. to make a column out of the X_grid value
X_grid = X_grid.reshape((len(X_grid), 1))
# Scatter plot for original data
[Link](x, y, color = 'blue')
# plot predicted data
[Link](X_grid, [Link](X_grid),color = 'green')
[Link]('Random Forest Regression')

Machine learning lab Page 26


DEPARTMENT OF CSE

[Link]('Position level') [Link]('Salary')

WEEK-11(B) : Model Selection, Bagging and Boosting


1. Cross Validation
# This code may not be run on GFG IDE
# as required packages are not found.
# importing cross-validation from sklearn [Link] sklearn import cross_validation
# value of K is 10.
data = cross_validation.KFold(len(train_set), n_folds=10, indices=False)
2. Implementing AdaBoost
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from [Link] import AdaBoostClassifier
import warnings
[Link]("ignore")
# Reading the dataset from the csv file
# separator is a vertical line, as seen in the dataset
data = pd.read_csv("[Link]")

# Printing the shape of the dataset


print([Link])
data = [Link]('Id',axis=1)
X = [Link][:,:-1]
y = [Link][:,-1]
print("Shape of X is %s and shape of y is %s"%([Link],[Link]))
total_classes = [Link]()
print("Number of unique species in dataset are: ",total_classes)
distribution = y.value_counts()
print(distribution)
X_train,X_val,Y_train,Y_val = train_test_split(X,y,test_size=0.25,random_state=28)

Machine learning lab Page 27


DEPARTMENT OF CSE

print("The accuracy of the model on validation set is", adb_model.score(X_val,Y_val))

Machine learning lab Page 28


DEPARTMENT OF CSE

WEEK-12: Unsupervised Learning


Implementing K-means Clustering
def ReadData(fileName):
# Read the file, splitting by lines
f = open(fileName, 'r');
lines = [Link]().splitlines();
[Link]();
items = [];
for i in range(1, len(lines)):
line = lines[i].split(',');
itemFeatures = [];
for j in range(len(line)-1):
# Convert feature value to float
v = float(line[j]);
# Add feature value to dict
[Link](v);
[Link](itemFeatures);
shuffle(items);
return items;

def FindColMinMax(items):n
= len(items[0]);
minima = [[Link] for i in range(n)];
maxima = [-[Link] -1 for i in range(n)];
for item in items:
for f in range(len(item)):
if (item[f] < minima[f]):
minima[f] = item[f];
if (item[f] > maxima[f]):
maxima[f] = item[f];
return minima,maxima;

Machine learning lab Page 29


DEPARTMENT OF CSE

def InitializeMeans(items, k, cMin, cMax):


# Initialize means to random numbers between
# the min and max of each column/feature
f = len(items[0]); # number of features
means = [[0 for i in range(f)] for j in range(k)];
for mean in means:
for i in range(len(mean)):
# Set value to a random float
# (adding +-1 to avoid a wide placement of a mean)
mean[i] = uniform(cMin[i]+1, cMax[i]-1);
return means;

def EuclideanDistance(x, y):


S = 0; # The sum of the squared differences of the elements
for i in range(len(x)):
S += [Link](x[i]-y[i], 2)
#The square root of the sum
return [Link](S)

def UpdateMean(n,mean,item):
for i in range(len(mean)):
m = mean[i];
m = (m*(n-1)+item[i])/float(n);
mean[i] = round(m, 3);
return mean;

def Classify(means,item):
# Classify item to the mean with minimum distance
minimum = [Link];
index = -1;
for i in range(len(means)):
# Find distance from item to mean

Machine learning lab Page 30


DEPARTMENT OF CSE

dis = EuclideanDistance(item, means[i]);


if (dis < minimum):
minimum = dis;
index = i;
return index;

def CalculateMeans(k,items,maxIterations=100000):
# Find the minima and maxima for columns
cMin, cMax = FindColMinMax(items);
# Initialize means at random points
means = InitializeMeans(items,k,cMin,cMax);
# Initialize clusters, the array to hold
# the number of items in a class
clusterSizes= [0 for i in range(len(means))];
# An array to hold the cluster an item is in
belongsTo = [0 for i in range(len(items))];
# Calculate means
for e in range(maxIterations):
# If no change of cluster occurs, halt
noChange = True;
for i in range(len(items)):
item = items[i];
# Classify item into a cluster and update the
# corresponding means.
index = Classify(means,item);
clusterSizes[index] += 1;
cSize = clusterSizes[index];
means[index] = UpdateMean(cSize,means[index],item);
# Item changed cluster
if(index != belongsTo[i]):
noChange = False;
belongsTo[i] = index;
Machine learning lab Page 31
DEPARTMENT OF CSE

# Nothing changed, return


if (noChange):
break;
return means;

def FindClusters(means,items):
clusters = [[] for i in range(len(means))]; # Init clusters
for item in items:
# Classify item into a cluster
index = Classify(means,item);
# Add item to cluster
clusters[index].append(item);
return clusters;

Machine learning lab Page 32


Machine learning lab Page 33

Common questions

Powered by AI

Preprocessing functions serve critical roles in dataset preparation by managing missing values: 1. **isnull()**: Identifies NaN values within the dataset, facilitating the identification of fields that need treatment for continuity and accuracy ensuring. 2. **fillna()**: Imputes missing values by substituting them with specified methods or constants (e.g., mean, median, or a constant value), preserving the dataset's integrity. 3. **dropna()**: Removes entries with missing values, which can reduce dataset size but ensures no missing data affects the model's training outcomes. These functions enable data cleaning, quality assurance, and stability in machine learning models by resolving inconsistencies in input data .

The Naïve Bayes algorithm predicts by estimating the probability of different outcomes based on Bayes' Theorem, which involves calculating the likelihood of the data from the class conditional probability distribution. It uses Gaussian distributions for continuous data: for a given feature, it computes the probability density function using the feature's mean and standard deviation. For each class, the product of probabilities for each feature is calculated, and the class with the highest probability is selected. This process assumes feature independence given class, a significant assumption allowing simplified calculations but potentially limiting in specific contexts where dependencies exist .

Implementing an SVM involves several key steps: 1. **Data Preparation**: Organizing data into input features (X) and output targets (Y), often involving preprocessing such as scaling. 2. **Model Training**: Using 'sklearn’s' SVM module to fit the model to the data, often requiring parameter tuning (e.g., kernel type). 3. **Visualization of Decision Boundaries**: It’s crucial to visualize decision boundaries to understand the model's classification behavior, identify potential overfitting or underfitting, and ensure that classification results make intuitive sense. Decision boundaries highlight how effectively classes are being separated, revealing insights into the SVM algorithm’s margin maximization approach .

Principal Component Analysis (PCA) in Python is implemented using the 'sklearn' library. The process involves standardizing data using 'StandardScaler', fitting the PCA model to the standardized data with 'PCA(n_components)', and transforming the data to its principal components. The benefits of PCA include reducing the dimensionality of the dataset while preserving variance, which simplifies models and reduces computational resources, allowing for more efficient data processing. PCA aids in visualizing high-dimensional data and combating the curse of dimensionality by identifying the most significant features .

Data scaling, typically using 'StandardScaler', ensures that all features are centered around zero with unit variance before applying PCA, critical because PCA is sensitive to the relative scaling of initial variables. Without scaling, PCA might distort component importance by emphasizing features with larger magnitudes. Proper scaling allows PCA to correctly identify variance-driven directions, resulting in effective dimensionality reduction. This leads to a transformed dataset that preserves meaningful variance, improving interpretability, computational efficiency, and enhancing further analytics or machine learning tasks .

Implementing a decision tree involves the following steps: 1. **Data Splitting**: Dividing datasets into training and testing subsets with functions like 'train_test_split'. 2. **Model Training**: Fitting a decision tree classifier with data, using either 'gini' index or 'entropy' criteria to measure the quality of a split. 3. **Prediction and Evaluation**: Using the trained model to make predictions on test data and evaluating its performance using metrics such as accuracy, confusion matrix, and classification report. The choice between 'gini' and 'entropy' affects model outcomes; 'gini' is a measure of impurity while 'entropy' considers information gain, leading to potentially different tree structures and generalization capabilities .

Model accuracy in assessing based on how well the predictions match the actual labels. In the context of the Naïve Bayes classifier, accuracy is calculated as the fraction of correctly predicted instances over the total instances. Several factors can influence reported accuracy including data quality, feature independence assumption violations, class imbalance, and runtime parameter settings. Overfitting to training data can also skew accuracy metrics, making it essential to validate models on unseen datasets (e.g., cross-validation) to ensure reliability and generalizability of accuracy metrics .

Several methods can be used to import CSV data into a DataFrame in Python, including manual implementation using the 'csv' library, 'numpy.loadtxt', 'numpy.genfromtxt', and 'pandas.read_csv'. Each method has its considerations: 1. **Manual Function with CSV Library**: Offers full control over the import process but requires manual handling of data and headers. 2. **Numpy Loadtxt/Genfromtxt**: Useful for numerical data but may require additional handling for complex CSV structures. 'genfromtxt' is more robust for missing values. 3. **Pandas Read_csv**: Highly versatile and efficient, automatically handles headers, missing values, and data types. It is often the preferred choice for most use cases due to its simplicity and power .

Data preprocessing improves machine learning model quality by handling missing data to ensure dataset completeness and reliability. Techniques such as 'isnull()', 'notnull()', 'dropna()', 'fillna()', 'replace()', and 'interpolate()' are employed to manage NaN values, which can introduce bias and reduce model accuracy if not addressed. For instance, 'fillna()' can replace missing values with statistical substitutes (mean, median) or carry forward the last valid observation ('pad' method), maintaining data continuity . Effective preprocessing ensures the model is trained on a representative dataset, increasing its predictive performance .

The 'fillna()' method using 'pad' (carry forward) or 'bfill' (carry backward) provides continuity by filling NaN values with previously or subsequently observed data, respectively. This technique assumes stationarity in data patterns across time or space, which may not always hold. While these methods prevent data loss, they can introduce biases if the conditions assumed aren’t met, leading to skewed results. In downstream machine learning tasks, such biases can affect model predictive performance, as the dataset may not accurately reflect underlying distributions, hence necessitating careful selection based on domain knowledge .

You might also like