0% found this document useful (0 votes)
60 views35 pages

Practical Big Data Analytics Guide

The document outlines a series of practical exercises focused on data science techniques, including the installation and configuration of Hadoop, various classification and regression models, and clustering methods. Each practical includes specific aims, installation instructions, code snippets, and data analysis steps using R programming. The exercises cover decision trees, SVM, linear regression, logistic regression, and k-means clustering, providing a comprehensive overview of machine learning applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views35 pages

Practical Big Data Analytics Guide

The document outlines a series of practical exercises focused on data science techniques, including the installation and configuration of Hadoop, various classification and regression models, and clustering methods. Each practical includes specific aims, installation instructions, code snippets, and data analysis steps using R programming. The exercises cover decision trees, SVM, linear regression, logistic regression, and k-means clustering, providing a comprehensive overview of machine learning applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

INDEX

[Link]. Practical Date sing

1 Install, configure and run Hadoop and HDFS ad explore


HDFS
2 Implement Decision tree classification techniquesb.

3 . Implement SVM classification techniques.

4 Implement of REGRESSION MODLE.

5 Implement of Simple Linear Regaression.

6 Implement of Multiple Linear Regression.

7 Implement of Logistic regression.

8 Read a datafile grades_km_input.csv and apply k-means


clustering.

9 Perform Apriori algorithm using Groceries dataset from


the R arules package.

Shanti Chourasiya

Roll no.2023ITI1

[Link]
Practical 1

Aim: -Install, configure and run Hadoop and HDFS and explore HDFS on Windows

Code:

Steps to Install Hadoop


1. Install Java JDK 1.8
2. Download Hadoop and extract and place under C drive
3. Set Path in Environment Variables
4. Config files under Hadoop directory
5. Create folder datanode and namenode under data directory
6. Edit HDFS and YARN files
7. Set Java Home environment in Hadoop environment
8. Setup Complete. Test by executing [Link]

There are two ways to install Hadoop, i.e.


9. Single node
10. Multi node
Here, we use multi node cluster.
1. Install Java
11. – Java JDK Link to download
[Link]
12. – extract and install Java in C:\Java
13. – open cmd and type -> javac -version

2. Download Hadoop
[Link]
[Link]

 right click .[Link] file -> show more options -> 7-zip->and extract to C:\Hadoop-
3.3.0\
3 Set the path JAVA_HOME Environment variable

4 Set the path HADOOP_HOME Environment variable


Click on New to both user variables and system variables.

Click on user variable -> path -> edit-> add path for Hadoop and java upto ‘bin’
Click Ok, Ok, Ok.
5. Configurations
Edit file C:/Hadoop-3.3.0/etc/hadoop/[Link],
paste the xml code in folder and save

======================================================

<configuration>
<property>
<name>[Link]</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
======================================================

Rename “[Link]” to “[Link]” and edit this file C:/Hadoop-


3.3.0/etc/hadoop/[Link], paste xml code and save this file.
======================================================

<configuration>
<property>
<name>[Link]</name>
<value>yarn</value>
</property>
</configuration>
======================================================

Create folder “data” under “C:\Hadoop-3.3.0”

Create folder “datanode” under “C:\Hadoop-3.3.0\data”

Create folder “namenode” under “C:\Hadoop-3.3.0\data”

======================================================

Edit file C:\Hadoop-3.3.0/etc/hadoop/[Link],

paste xml code and save this file.

<configuration>
<property>
<name>[Link]</name>
<value>1</value>
</property>

<property>
<name>[Link]</name>
<value>/hadoop-3.3.0/data/namenode</value>
</property>
<property>
<name>[Link]</name>
<value>/hadoop-3.3.0/data/datanode</value>
</property>
</configuration>
======================================================

Edit file C:/Hadoop-3.3.0/etc/hadoop/[Link],

paste xml code and save this file.

<configuration>

</configuration>
======================================================

6. Edit file C:/Hadoop-3.3.0/etc/hadoop/[Link]


Find “JAVA_HOME=%JAVA_HOME%” and replace it as
set JAVA_HOME="C:\Java\jdk1.8.0_361"
======================================================

7. Download “redistributable” package

Download and run VC_redist.[Link]

This is a “redistributable” package of the Visual C runtime code for 64-bit applications, from
Microsoft. It contains certain shared code that every application written with Visual C expects to
have available on the Windows computer it runs on.

8. Hadoop Configurations
Download bin folder from

[Link]

– Copy the bin folder to c:\hadoop-3.3.0. Replace the existing bin folder.

9. copy "[Link]" from ~\hadoop-


3.0.3\share\hadoop\yarn\timelineservice to ~\hadoop-3.0.3\share\hadoop\yarn
folder.

10. Format the NameNode


– Open cmd ‘Run as Administrator’ and type command “hdfs namenode –format”
11. Testing

– Open cmd ‘Run as Administrator’ and change directory to C:\Hadoop-


3.3.0\sbin

– type [Link]

OR

- type [Link]

– type [Link]

– You will get 4 more running threads for Datanode, namenode, resouce manager and node
manager
Output:

12. Type JPS command to [Link] command prompt, you will get following output.

13. Run [Link] from any browser


Or [Link]
Practical 2
Aim: - Implement Decision tree classification techniquesb.

[Link]('datasets')

[Link]('caTools')

[Link]('party')

[Link]('dplyr')

[Link]('magrittr')
library(datasets)

library(caTools)

library(party)

library(dplyr)

library(magrittr)

data("readingSkills")

head(readingSkills)

sample_data = [Link](readingSkills, SplitRatio = 0.8)

train_data <- subset(readingSkills, sample_data == TRUE)

test_data <- subset(readingSkills, sample_data == FALSE)

model<- ctree(nativeSpeaker ~ ., train_data)

plot(model)
Practical 3
Aim:- Implement SVM classification techniques

#Code for installation of all necessary packages

[Link]("caret")

[Link]("ggplot2")

[Link]("GGally")

[Link]("psych")

[Link]("ggpubr")

[Link]("reshape")

# Code for importation of all necessary packages

library(caret)

library(ggplot2)

library(GGally)

library(psych)

library(ggpubr)

library(reshape)

# Code
df <- [Link]("D:\\[Link]")

head(df)

# Code

sum([Link](df))

# Code

dim(df)

# Code

sapply(df, class)

# Code

summary(df) # to calculate the summary of our dataset


# Code

a <- ggplot(data = df, aes(x = Pregnancies)) +

geom_histogram( color = "red", fill = "blue", alpha = 0.1) +

geom_density()

b <- ggplot(data = df, aes(x = Glucose)) +

geom_histogram( color = "red", fill = "blue", alpha = 0.1) +

geom_density()

c <- ggplot(data = df, aes(x = BloodPressure)) +

geom_histogram( color = "red", fill = "blue", alpha = 0.1) +

geom_density()

d <- ggplot(data = df, aes(x = SkinThickness)) +

geom_histogram( color = "red", fill = "blue", alpha = 0.1) +

geom_density()

e <- ggplot(data = df, aes(x = Insulin)) +

geom_histogram( color = "red", fill = "blue", alpha = 0.1) +

geom_density()

f <- ggplot(data = df, aes(x = BMI)) +

geom_histogram( color = "red", fill = "blue", alpha = 0.1) +

geom_density()

g <- ggplot(data = df, aes(x = DiabetesPedigreeFunction)) +

geom_histogram( color = "red", fill = "blue", alpha = 0.1) +

geom_density()

h <- ggplot(data = df, aes(x = Age)) +

geom_histogram( color = "red", fill = "blue", alpha = 0.1) +geom_density()

ggarrange(a, b, c, d,e,f,g, h + rremove("[Link]"),

labels = c("a", "b", "c", "d","e", "f", "g", "h"),

ncol = 3, nrow = 3)
# Code

ggplot(data = df, aes(x =Outcome, fill = Outcome)) +

geom_bar()
# Code to label our categorical variable as a factor

df$Outcome<- factor(df$Outcome,

levels = c(0, 1),

labels = c("Negative", "Positive"))

out <- subset(df,

select = c(Pregnancies,Glucose,

BloodPressure,SkinThickness,

Insulin,BMI,

DiabetesPedigreeFunction,Age))

# Code for boxplot

ggplot(data = melt(out),

aes(x=variable, y=value)) +

geom_boxplot(aes(fill=variable))
corPlot(df[, 1:8])
cutoff <- createDataPartition(df$Outcome, p=0.85, list=FALSE)

# select 15% of the data for validation

testdf <- df[-cutoff,]

# use the remaining 85% of data to training and testing the models

traindf <- df[cutoff,]

# Code to train the SVM

[Link](1234)

# set the 10 fold crossvalidation with AU

# to pick for us what we call the best model

control <- trainControl(method="cv",number=10, classProbs = TRUE)

metric <- "Accuracy"

model <- train(Outcome ~., data = traindf, method = "svmRadial",

tuneLength = 8,preProc = c("center","scale"),

metric=metric, trControl=control)

# Code for model summary

Model
# Code

plot(model)
Practical 4
Aim: - Implement of REGRESSION MODLE.

# Generate random IQ values with mean = 30 and sd =2

IQ <- rnorm(40, 30, 2)

# Sorting IQ level in ascending order

IQ <- sort(IQ)

# Generate vector with pass and fail values of 40 students

result <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1,

1, 0, 0, 0, 1, 1, 0, 0, 1, 0,

0, 0, 1, 0, 0, 1, 1, 0, 1, 1,

1, 1, 1, 0, 1, 1, 1, 1, 0, 1)

# Data Frame

df <- [Link](cbind(IQ, result))

# Print data frame

print(df)
# Plotting IQ on x-axis and result on y-axis

plot(IQ, result, xlab = "IQ Level",

ylab = "Probability of Passing")

# Create a logistic model

g = glm(result~IQ, family=binomial, df)

# Create a curve based on prediction using the regression model

curve(predict(g, [Link](IQ=x), type="resp"), add=TRUE)

# Based on fit to the regression model

points(IQ, fitted(g), pch=30)


# Summary of the regression model

summary(g)

Practical 5
Aim :- Implement of Simple Linear Regaression.

years_of_exp = c(7,5,1,3)

salary_in_lakhs = c(21,13,6,8)

#[Link] = [Link](satisfaction_score, years_of_exp, salary_in_lakhs)

[Link] = [Link](years_of_exp, salary_in_lakhs)

[Link]

# Estimation of the salary of an employee, based on his year of experience and satisfaction score in
his company.

model <- lm(salary_in_lakhs ~ years_of_exp, data = [Link])

summary(model)

# The formula of Regression becomes

# Y = 2 + 2.5*year_of_Exp

# Visualization of Regression

plot(salary_in_lakhs ~ years_of_exp, data = [Link])

abline(model)
Practical 6
Aim :- Implement of Multiple Linear Regression.

# Importing the dataset

dataset = [Link]('D:\\[Link]')

# Encoding categorical data

dataset$State = factor(dataset$State,

levels = c('New York', 'California', 'Florida'),

labels = c(1, 2, 3))

dataset$State

# Splitting the dataset into the Training set and Test set

[Link]('caTools')

library(caTools)

[Link](123)

split = [Link](dataset$Profit, SplitRatio = 0.8)

training_set = subset(dataset, split == TRUE)

test_set = subset(dataset, split == FALSE)

# Feature Scaling

# training_set = scale(training_set)

# test_set = scale(test_set)

# Fitting Multiple Linear Regression to the Training set

regressor = lm(formula = Profit ~ .,

data = training_set)

# Predicting the Test set results

y_pred = predict(regressor, newdata = test_set)


Practical 7
Aim :- Implement of Logistic regression.

Source code:

[Link]("ISLR")

library(ISLR)

#load dataset

data <- ISLR::Default

print (head(ISLR::Default))

#view summary of dataset

summary(data)

#find total observations in dataset

nrow(data)

#Create Training and Test Samples

#split the dataset into a training set to train the model on and a testing set to test the model

[Link](1)

#Use 70% of dataset as training set and remaining 30% as testing set

sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7,0.3))

print (sample)

train <- data[sample, ]

test <- data[!sample, ]

nrow(train)

nrow(test)

# Fit the Logistic Regression Model

# use the glm (general linear model) function and specify family="binomial"

#so that R fits a logistic regression model to the dataset

model <- glm(default~student+balance+income, family="binomial", data=train)

#view model summary

summary(model)

#Model Diagnostics

[Link]("InformationValue")
library(InformationValue)

predicted <- predict(model, test, type="response")

confusionMatrix(test$default, predicted)
Practical 8
Aim: Read a datafile grades_km_input.csv and apply k-means clustering.

Datafile:

# install required packages

[Link]("plyr")

[Link]("ggplot2")

[Link]("cluster")

[Link]("lattice")

[Link]("grid")

[Link]("gridExtra")

# Load the package

library(plyr)

library(ggplot2)

library(cluster)

library(lattice)

library(grid)

library(gridExtra)

# A data frame is a two-dimensional array-like structure in which each column contains values of one
variable and each row contains one set of values from each column.

grade_input=[Link]([Link]("D:\\grades_km_input.csv"))

kmdata_orig=[Link](grade_input[, c ("Student","English","Math","Science")])

kmdata=kmdata_orig[,2:4]

kmdata[1:10,]

# the k-means algorithm is used to identify clusters for k = 1, 2, .. . , 15. For each value of k, the WSS
is calculated.

wss=numeric(15)

# the option n start=25 specifies that the k-means algorithm will be repeated 25 times, each starting
with k random initial centroids

for(k in 1:15)wss[k]=sum(kmeans(kmdata,centers=k,nstart=25)$withinss)

plot(1:15,wss,type="b",xlab="Number of Clusters",ylab="Within sum of square")


#As can be seen, the WSS is greatly reduced when k increases from one to two. Another substantial
reduction in WSS occurs at k = 3. However, the improvement in WSS is fairly linear fork > 3.

km = kmeans(kmdata,3,nstart=25)

km

c( wss[3] , sum(km$withinss))

df=[Link](kmdata_orig[,2:4])

df$cluster=factor(km$cluster)

centers=[Link](km$centers)

g1=ggplot(data=df, aes(x=English, y=Math, color=cluster )) +

geom_point() + theme([Link]="right") +

geom_point(data=centers,aes(x=English,y=Math, color=[Link](c(1,2,3))),size=10, alpha=.3,


[Link] =FALSE)

g2=ggplot(data=df, aes(x=English, y=Science, color=cluster )) +

geom_point () +geom_point(data=centers,aes(x=English,y=Science,
color=[Link](c(1,2,3))),size=10, alpha=.3, [Link]=FALSE)

g3 = ggplot(data=df, aes(x=Math, y=Science, color=cluster )) +

geom_point () + geom_point(data=centers,aes(x=Math,y=Science,
color=[Link](c(1,2,3))),size=10, alpha=.3, [Link]=FALSE)

tmp=ggplot_gtable(ggplot_build(g1))

[Link](arrangeGrob(g1 + theme([Link]="none"),g2 +
theme([Link]="none"),g3 + theme([Link]="none"),top ="High School Student
Cluster Analysis" ,ncol=1))
Practical 9
Aim: Perform Apriori algorithm using Groceries dataset from the R arules package.

[Link]("arules")

[Link]("arulesViz")

[Link]("RColorBrewer")

# Loading Libraries

library(arules)

library(arulesViz)

library(RColorBrewer)

# import dataset

data(Groceries)

Groceries

summary(Groceries)

class(Groceries)

# using apriori() function

rules = apriori(Groceries, parameter = list(supp = 0.02, conf = 0.2))

summary (rules)

# using inspect() function

inspect(rules[1:10])

# using itemFrequencyPlot() function

arules::itemFrequencyPlot(Groceries, topN = 20,

col = [Link](8, 'Pastel2'),

main = 'Relative Item Frequency Plot',

type = "relative",

ylab = "Item Frequency (Relative)")

itemsets = apriori(Groceries, parameter = list(minlen=2, maxlen=2,support=0.02, target="frequent


itemsets"))

summary(itemsets)

# using inspect() function

inspect(itemsets[1:10])
itemsets_3 = apriori(Groceries, parameter = list(minlen=3, maxlen=3,support=0.02, target="frequent
itemsets"))

summary(itemsets_3)

# using inspect() function

inspect(itemsets_3)

Common questions

Powered by AI

Visualization techniques like ggplot histograms and density plots are invaluable for understanding dataset distributions before applying machine learning models. They enable the identification of data skewness, potential outliers, and overall shape, which informs necessary preprocessing steps such as normalization or transformation. By visualizing variables, one can assess whether assumptions like normality are met, which is crucial for certain models. These insights guide the choice and tuning of models appropriate for the data's underlying structure and help in anticipating model behavior and potential biases .

The essential packages required for implementing decision tree classification in R include 'datasets', 'caTools', 'party', 'dplyr', and 'magrittr'. Initial steps involve loading these libraries, loading the data using the 'data()' function, then splitting the dataset into training and testing sets using 'sample.split()'. Following this, a decision tree model can be created using the 'ctree()' function on the training data .

Simple linear regression involves modeling the relationship between a single predictor variable and the response variable, exemplified by estimating salary against years of experience using 'lm()'. Multiple linear regression, on the other hand, involves more than one predictor, as seen in the example where state and other factors predict profit. While simple linear regression is useful for straightforward correlations, multiple regression considers the effect of multiple variables simultaneously, providing a more comprehensive analysis of factors influencing the dependent variable .

Installing Hadoop on a multi-node cluster involves configuring namenode and datanode directories on separate nodes, as opposed to a single node setup where everything runs on one machine. Key steps include setting environment paths for Java and Hadoop, configuration of core-site.xml to specify the cluster's NameNode, and modifying mapred-site.xml and yarn-site.xml for resource management. Additionally, setting up datanode and namenode directories, and formatting the NameNode with the command 'hdfs namenode -format' are crucial before starting the Hadoop services using 'start-dfs' and 'start-yarn' commands .

Before training an SVM model in R, preliminary data processing steps include checking for missing values using 'sum(is.na())', determining the dimensions of the data with 'dim()', and obtaining a statistical summary using 'summary()'. Visualization steps include generating histograms and density plots for different variables using 'ggplot()' to understand their distributions. Additionally, labels for categorical variables can be added using 'factor()'. These steps help in understanding the data better and preparing it for model development .

Logistic regression is pivotal in analyzing the relationship between IQ and student pass rates because it models the probability of a binary outcome (pass or fail) based on the predictor variables (in this case, IQ). By fitting the logistic regression model with 'glm()', the analysis evaluates the likelihood of passing as IQ levels vary, highlighting non-linear relationships not captured by linear regression. This technique efficiently handles dichotomous outcomes and provides insights into how changes in IQ levels impact passing probability through the derived model coefficients and fitted values overlaying actual outcomes .

Preparing a dataset for logistic regression analysis involves several steps such as examining the dataset's contents with 'head()' and 'summary()', splitting it into training and testing sets by sampling, and ensuring the sample is representative. The logistic regression model is then fitted using 'glm()', with the formula specifying the predictor and response variables. It is crucial to ensure categorical variables are correctly labeled, and diagnostics are performed to evaluate model accuracy, often utilizing functions like 'predict()' and confusion matrix tools for validation .

A package like 'caTools' is essential in data splitting operations for multiple linear regression because it allows for creating a random sampling for dividing data into training and testing sets, which is critical for avoiding overfitting and ensuring the model's generalizability. The 'sample.split()' function enables the allocation of a predetermined split ratio, facilitating an effective separation of data into subsets, ensuring that model training and evaluation accurately reflect out-of-sample performance. This process is foundational in building robust models that effectively predict outcomes on unseen data .

Applying the Apriori algorithm to the Groceries dataset generates association rules that reveal frequent itemsets purchased together, providing valuable insights for market basket analysis. The support parameter determines how frequently an itemset appears in the dataset, while confidence evaluates the likelihood of an item Y being bought given that item X is bought. Adjusting these parameters affects the number and specificity of the generated rules; higher support reduces rule numbers but increases significance, while lower thresholds may expand potential insights albeit with less robustness .

The k-means clustering algorithm helps determine the optimal number of clusters through the calculation of the within-cluster sum of squares (WSS) for different values of k. In the example, WSS is calculated for k ranging from 1 to 15, and plotted to create an Elbow plot. The 'elbow' point in this plot, where the rate of decrease sharply drops, suggests an optimal number of clusters. This is critical because it indicates a balance between having a compact internal cluster structure and distinct separation between clusters .

You might also like