INDEX
[Link]. Practical Date sing
1 Install, configure and run Hadoop and HDFS ad explore
HDFS
2 Implement Decision tree classification techniquesb.
3 . Implement SVM classification techniques.
4 Implement of REGRESSION MODLE.
5 Implement of Simple Linear Regaression.
6 Implement of Multiple Linear Regression.
7 Implement of Logistic regression.
8 Read a datafile grades_km_input.csv and apply k-means
clustering.
9 Perform Apriori algorithm using Groceries dataset from
the R arules package.
Shanti Chourasiya
Roll no.2023ITI1
[Link]
Practical 1
Aim: -Install, configure and run Hadoop and HDFS and explore HDFS on Windows
Code:
Steps to Install Hadoop
1. Install Java JDK 1.8
2. Download Hadoop and extract and place under C drive
3. Set Path in Environment Variables
4. Config files under Hadoop directory
5. Create folder datanode and namenode under data directory
6. Edit HDFS and YARN files
7. Set Java Home environment in Hadoop environment
8. Setup Complete. Test by executing [Link]
There are two ways to install Hadoop, i.e.
9. Single node
10. Multi node
Here, we use multi node cluster.
1. Install Java
11. – Java JDK Link to download
[Link]
12. – extract and install Java in C:\Java
13. – open cmd and type -> javac -version
2. Download Hadoop
[Link]
[Link]
right click .[Link] file -> show more options -> 7-zip->and extract to C:\Hadoop-
3.3.0\
3 Set the path JAVA_HOME Environment variable
4 Set the path HADOOP_HOME Environment variable
Click on New to both user variables and system variables.
Click on user variable -> path -> edit-> add path for Hadoop and java upto ‘bin’
Click Ok, Ok, Ok.
5. Configurations
Edit file C:/Hadoop-3.3.0/etc/hadoop/[Link],
paste the xml code in folder and save
======================================================
<configuration>
<property>
<name>[Link]</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
======================================================
Rename “[Link]” to “[Link]” and edit this file C:/Hadoop-
3.3.0/etc/hadoop/[Link], paste xml code and save this file.
======================================================
<configuration>
<property>
<name>[Link]</name>
<value>yarn</value>
</property>
</configuration>
======================================================
Create folder “data” under “C:\Hadoop-3.3.0”
Create folder “datanode” under “C:\Hadoop-3.3.0\data”
Create folder “namenode” under “C:\Hadoop-3.3.0\data”
======================================================
Edit file C:\Hadoop-3.3.0/etc/hadoop/[Link],
paste xml code and save this file.
<configuration>
<property>
<name>[Link]</name>
<value>1</value>
</property>
<property>
<name>[Link]</name>
<value>/hadoop-3.3.0/data/namenode</value>
</property>
<property>
<name>[Link]</name>
<value>/hadoop-3.3.0/data/datanode</value>
</property>
</configuration>
======================================================
Edit file C:/Hadoop-3.3.0/etc/hadoop/[Link],
paste xml code and save this file.
<configuration>
</configuration>
======================================================
6. Edit file C:/Hadoop-3.3.0/etc/hadoop/[Link]
Find “JAVA_HOME=%JAVA_HOME%” and replace it as
set JAVA_HOME="C:\Java\jdk1.8.0_361"
======================================================
7. Download “redistributable” package
Download and run VC_redist.[Link]
This is a “redistributable” package of the Visual C runtime code for 64-bit applications, from
Microsoft. It contains certain shared code that every application written with Visual C expects to
have available on the Windows computer it runs on.
8. Hadoop Configurations
Download bin folder from
[Link]
– Copy the bin folder to c:\hadoop-3.3.0. Replace the existing bin folder.
9. copy "[Link]" from ~\hadoop-
3.0.3\share\hadoop\yarn\timelineservice to ~\hadoop-3.0.3\share\hadoop\yarn
folder.
10. Format the NameNode
– Open cmd ‘Run as Administrator’ and type command “hdfs namenode –format”
11. Testing
– Open cmd ‘Run as Administrator’ and change directory to C:\Hadoop-
3.3.0\sbin
– type [Link]
OR
- type [Link]
– type [Link]
– You will get 4 more running threads for Datanode, namenode, resouce manager and node
manager
Output:
12. Type JPS command to [Link] command prompt, you will get following output.
13. Run [Link] from any browser
Or [Link]
Practical 2
Aim: - Implement Decision tree classification techniquesb.
[Link]('datasets')
[Link]('caTools')
[Link]('party')
[Link]('dplyr')
[Link]('magrittr')
library(datasets)
library(caTools)
library(party)
library(dplyr)
library(magrittr)
data("readingSkills")
head(readingSkills)
sample_data = [Link](readingSkills, SplitRatio = 0.8)
train_data <- subset(readingSkills, sample_data == TRUE)
test_data <- subset(readingSkills, sample_data == FALSE)
model<- ctree(nativeSpeaker ~ ., train_data)
plot(model)
Practical 3
Aim:- Implement SVM classification techniques
#Code for installation of all necessary packages
[Link]("caret")
[Link]("ggplot2")
[Link]("GGally")
[Link]("psych")
[Link]("ggpubr")
[Link]("reshape")
# Code for importation of all necessary packages
library(caret)
library(ggplot2)
library(GGally)
library(psych)
library(ggpubr)
library(reshape)
# Code
df <- [Link]("D:\\[Link]")
head(df)
# Code
sum([Link](df))
# Code
dim(df)
# Code
sapply(df, class)
# Code
summary(df) # to calculate the summary of our dataset
# Code
a <- ggplot(data = df, aes(x = Pregnancies)) +
geom_histogram( color = "red", fill = "blue", alpha = 0.1) +
geom_density()
b <- ggplot(data = df, aes(x = Glucose)) +
geom_histogram( color = "red", fill = "blue", alpha = 0.1) +
geom_density()
c <- ggplot(data = df, aes(x = BloodPressure)) +
geom_histogram( color = "red", fill = "blue", alpha = 0.1) +
geom_density()
d <- ggplot(data = df, aes(x = SkinThickness)) +
geom_histogram( color = "red", fill = "blue", alpha = 0.1) +
geom_density()
e <- ggplot(data = df, aes(x = Insulin)) +
geom_histogram( color = "red", fill = "blue", alpha = 0.1) +
geom_density()
f <- ggplot(data = df, aes(x = BMI)) +
geom_histogram( color = "red", fill = "blue", alpha = 0.1) +
geom_density()
g <- ggplot(data = df, aes(x = DiabetesPedigreeFunction)) +
geom_histogram( color = "red", fill = "blue", alpha = 0.1) +
geom_density()
h <- ggplot(data = df, aes(x = Age)) +
geom_histogram( color = "red", fill = "blue", alpha = 0.1) +geom_density()
ggarrange(a, b, c, d,e,f,g, h + rremove("[Link]"),
labels = c("a", "b", "c", "d","e", "f", "g", "h"),
ncol = 3, nrow = 3)
# Code
ggplot(data = df, aes(x =Outcome, fill = Outcome)) +
geom_bar()
# Code to label our categorical variable as a factor
df$Outcome<- factor(df$Outcome,
levels = c(0, 1),
labels = c("Negative", "Positive"))
out <- subset(df,
select = c(Pregnancies,Glucose,
BloodPressure,SkinThickness,
Insulin,BMI,
DiabetesPedigreeFunction,Age))
# Code for boxplot
ggplot(data = melt(out),
aes(x=variable, y=value)) +
geom_boxplot(aes(fill=variable))
corPlot(df[, 1:8])
cutoff <- createDataPartition(df$Outcome, p=0.85, list=FALSE)
# select 15% of the data for validation
testdf <- df[-cutoff,]
# use the remaining 85% of data to training and testing the models
traindf <- df[cutoff,]
# Code to train the SVM
[Link](1234)
# set the 10 fold crossvalidation with AU
# to pick for us what we call the best model
control <- trainControl(method="cv",number=10, classProbs = TRUE)
metric <- "Accuracy"
model <- train(Outcome ~., data = traindf, method = "svmRadial",
tuneLength = 8,preProc = c("center","scale"),
metric=metric, trControl=control)
# Code for model summary
Model
# Code
plot(model)
Practical 4
Aim: - Implement of REGRESSION MODLE.
# Generate random IQ values with mean = 30 and sd =2
IQ <- rnorm(40, 30, 2)
# Sorting IQ level in ascending order
IQ <- sort(IQ)
# Generate vector with pass and fail values of 40 students
result <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
1, 1, 1, 0, 1, 1, 1, 1, 0, 1)
# Data Frame
df <- [Link](cbind(IQ, result))
# Print data frame
print(df)
# Plotting IQ on x-axis and result on y-axis
plot(IQ, result, xlab = "IQ Level",
ylab = "Probability of Passing")
# Create a logistic model
g = glm(result~IQ, family=binomial, df)
# Create a curve based on prediction using the regression model
curve(predict(g, [Link](IQ=x), type="resp"), add=TRUE)
# Based on fit to the regression model
points(IQ, fitted(g), pch=30)
# Summary of the regression model
summary(g)
Practical 5
Aim :- Implement of Simple Linear Regaression.
years_of_exp = c(7,5,1,3)
salary_in_lakhs = c(21,13,6,8)
#[Link] = [Link](satisfaction_score, years_of_exp, salary_in_lakhs)
[Link] = [Link](years_of_exp, salary_in_lakhs)
[Link]
# Estimation of the salary of an employee, based on his year of experience and satisfaction score in
his company.
model <- lm(salary_in_lakhs ~ years_of_exp, data = [Link])
summary(model)
# The formula of Regression becomes
# Y = 2 + 2.5*year_of_Exp
# Visualization of Regression
plot(salary_in_lakhs ~ years_of_exp, data = [Link])
abline(model)
Practical 6
Aim :- Implement of Multiple Linear Regression.
# Importing the dataset
dataset = [Link]('D:\\[Link]')
# Encoding categorical data
dataset$State = factor(dataset$State,
levels = c('New York', 'California', 'Florida'),
labels = c(1, 2, 3))
dataset$State
# Splitting the dataset into the Training set and Test set
[Link]('caTools')
library(caTools)
[Link](123)
split = [Link](dataset$Profit, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Multiple Linear Regression to the Training set
regressor = lm(formula = Profit ~ .,
data = training_set)
# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)
Practical 7
Aim :- Implement of Logistic regression.
Source code:
[Link]("ISLR")
library(ISLR)
#load dataset
data <- ISLR::Default
print (head(ISLR::Default))
#view summary of dataset
summary(data)
#find total observations in dataset
nrow(data)
#Create Training and Test Samples
#split the dataset into a training set to train the model on and a testing set to test the model
[Link](1)
#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7,0.3))
print (sample)
train <- data[sample, ]
test <- data[!sample, ]
nrow(train)
nrow(test)
# Fit the Logistic Regression Model
# use the glm (general linear model) function and specify family="binomial"
#so that R fits a logistic regression model to the dataset
model <- glm(default~student+balance+income, family="binomial", data=train)
#view model summary
summary(model)
#Model Diagnostics
[Link]("InformationValue")
library(InformationValue)
predicted <- predict(model, test, type="response")
confusionMatrix(test$default, predicted)
Practical 8
Aim: Read a datafile grades_km_input.csv and apply k-means clustering.
Datafile:
# install required packages
[Link]("plyr")
[Link]("ggplot2")
[Link]("cluster")
[Link]("lattice")
[Link]("grid")
[Link]("gridExtra")
# Load the package
library(plyr)
library(ggplot2)
library(cluster)
library(lattice)
library(grid)
library(gridExtra)
# A data frame is a two-dimensional array-like structure in which each column contains values of one
variable and each row contains one set of values from each column.
grade_input=[Link]([Link]("D:\\grades_km_input.csv"))
kmdata_orig=[Link](grade_input[, c ("Student","English","Math","Science")])
kmdata=kmdata_orig[,2:4]
kmdata[1:10,]
# the k-means algorithm is used to identify clusters for k = 1, 2, .. . , 15. For each value of k, the WSS
is calculated.
wss=numeric(15)
# the option n start=25 specifies that the k-means algorithm will be repeated 25 times, each starting
with k random initial centroids
for(k in 1:15)wss[k]=sum(kmeans(kmdata,centers=k,nstart=25)$withinss)
plot(1:15,wss,type="b",xlab="Number of Clusters",ylab="Within sum of square")
#As can be seen, the WSS is greatly reduced when k increases from one to two. Another substantial
reduction in WSS occurs at k = 3. However, the improvement in WSS is fairly linear fork > 3.
km = kmeans(kmdata,3,nstart=25)
km
c( wss[3] , sum(km$withinss))
df=[Link](kmdata_orig[,2:4])
df$cluster=factor(km$cluster)
centers=[Link](km$centers)
g1=ggplot(data=df, aes(x=English, y=Math, color=cluster )) +
geom_point() + theme([Link]="right") +
geom_point(data=centers,aes(x=English,y=Math, color=[Link](c(1,2,3))),size=10, alpha=.3,
[Link] =FALSE)
g2=ggplot(data=df, aes(x=English, y=Science, color=cluster )) +
geom_point () +geom_point(data=centers,aes(x=English,y=Science,
color=[Link](c(1,2,3))),size=10, alpha=.3, [Link]=FALSE)
g3 = ggplot(data=df, aes(x=Math, y=Science, color=cluster )) +
geom_point () + geom_point(data=centers,aes(x=Math,y=Science,
color=[Link](c(1,2,3))),size=10, alpha=.3, [Link]=FALSE)
tmp=ggplot_gtable(ggplot_build(g1))
[Link](arrangeGrob(g1 + theme([Link]="none"),g2 +
theme([Link]="none"),g3 + theme([Link]="none"),top ="High School Student
Cluster Analysis" ,ncol=1))
Practical 9
Aim: Perform Apriori algorithm using Groceries dataset from the R arules package.
[Link]("arules")
[Link]("arulesViz")
[Link]("RColorBrewer")
# Loading Libraries
library(arules)
library(arulesViz)
library(RColorBrewer)
# import dataset
data(Groceries)
Groceries
summary(Groceries)
class(Groceries)
# using apriori() function
rules = apriori(Groceries, parameter = list(supp = 0.02, conf = 0.2))
summary (rules)
# using inspect() function
inspect(rules[1:10])
# using itemFrequencyPlot() function
arules::itemFrequencyPlot(Groceries, topN = 20,
col = [Link](8, 'Pastel2'),
main = 'Relative Item Frequency Plot',
type = "relative",
ylab = "Item Frequency (Relative)")
itemsets = apriori(Groceries, parameter = list(minlen=2, maxlen=2,support=0.02, target="frequent
itemsets"))
summary(itemsets)
# using inspect() function
inspect(itemsets[1:10])
itemsets_3 = apriori(Groceries, parameter = list(minlen=3, maxlen=3,support=0.02, target="frequent
itemsets"))
summary(itemsets_3)
# using inspect() function
inspect(itemsets_3)