Data Mining Lab Programs 2023-24
Data Mining Lab Programs 2023-24
Accredited with A++ grade by NAAC, An ISO 9001: 2015 Certified Institution
Peelamedu, Coimbatore-641004
2023-2024
College of Excellence, 2022-4th Rank
Autonomous and Affiliated to Bharathiar University
Accredited with A++ grade by NAAC, An ISO 9001: 2015 Certified Institution
Peelamedu, Coimbatore-641004
REGISTER NUMBER:
PYTHON PROGRAMS
1
1 Linear Regression
4
2 Decision Tree
6
3 Naive Bayes
9
4 Support Vector Machine
13
5 Comparision of KNN, SVM,
Decision Tree, Logistic Regression,
Naïve Bayes
18
6 BIRCH Clustering
R PROGRAMS
21
7 Data Exploration and Visualization
27
8 Linear Regression
30
9 Association Rule Mining
39
10 Classification using SVM
Text Preprocessing 49
12
S
Date Topics Page No Sign
No.
Visualizing Product-Wise 65
18 Sales using Superstore
Dataset
Visualizing Region-Wise and 67
19
Category-Wise Profit using
Superstore Dataset
Power Query Operations using 70
20
Walmart and Masters Dataset
Ex. No: 01
LINEAR REGRESSION
Date:
AIM
ALGORITHM
STEP 4: Import dataset, linear_model from sklearn and mean_squared_error, r2_score from
[Link].
STEP 5: Load the diabetes dataset using datasets.load_diabetes( ) and use one feature.
STEP 6: Split the data and target into training, testing set.
STEP 8: Train the model using training dataset and make predictions using testing set by using
predict( ).
STEP 9: Print coefficients, variance score and mean-squared error using coef_, r2_score( ) and
mean_squared_error( ).
1
PROGRAM
2
OUTPUT
Coefficients:938
Mean squared error:2548
Variance score:0
RESULT
Thus, the above python program to perform linear regression on the diabetes dataset
has been verified and executed successfully.
3
Ex. No: 02
DECISION TREE
Date:
AIM
To write a Python program to classify the samples as man and woman based on the
attributes of height and length of hair using decision tree classifier.
ALGORITHM
STEP 4: Assign X as the input features namely height and length of hair
STEP 5: Split the dataset into training set and test set using train_test_split()
STEP 6: Create a decision tree classifier object and fit the model using the training
set
4
PROGRAM
6,40],[197,20],[150,25],[140,32],[136,35]]
Y=['Man','Woman','Woman','Man','Woman','Man','Woman','Man','Woman','Man','Woman','M
an','Woman','Woman','Woman','Man','Woman','Woman','Man','Woman','Woman','Man','Man',
'Woman','Woman']
data_feature_names = ['height','length of hair']
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state=10)
DTclf = [Link]()
DTclf1 = [Link](X,Y)
prediction = [Link]([[140,37]])
print(prediction)
OUTPUT
['Woman']
RESULT
5
Ex. No: 03
NAIVE BAYES
Date:
AIM
ALGORITHM
STEP 3: Import the numpy, pandas, io packages and train_test_split, preprocessing from
sklearn package.
STEP 9: Print the classification report for the expected and predicted data.
STEP 10: Print the confusion matrix for the expected and predicted data.
6
PROGRAM
import numpy as np
import pandas as pd
from pandas import read_csv
import io
from sklearn.model_selection import train_test_split
path = "D:\\Datasets\\[Link]"
names = ['A1','A2','A3','A4','A5','A6','A7','A8','A9','class']
dataset = read_csv(path, names=names)array = [Link]
print(array)
X = array[:,0:8]print(X)
y = array[:,9]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20,
random_state=1, shuffle=True)
from sklearn import preprocessingfrom sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
[Link](X_train, Y_train)
print(model)
expected = Y_validation
predicted = [Link](X_validation) print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
7
OUTPUT
RESULT
8
Ex. No: 04
SPPORT VECTOR MACHINE
Date:
AIM
To write a Python program to implement Support Vector Machine using wine quality
dataset
ALGORITHM
STEP 1: Start the process
STEP 2: Open Jupyter notebook and create Python3 notebook
STEP 3: Import read_csv from pandas, train_test_split from sklearn.model_selection and
classification_report, confusion_matrix from [Link]
STEP 4: Load wine quality dataset using read_csv( ) and split out validation dataset using
train_test_split( )
STEP 5: Make predictions using model and summarize the fit of model
STEP 6: Print classification report and confusion matrix using excepted and predicted
STEP 7: Save the program and run it
STEP 8: Stop the process
9
PROGRAM
10
OUTPUT
...
SVC()
11
accuracy 0.44 0.55 0.53 36
[[10 4 0]
[4 9 0]
[5 4 0]]
RESULT
12
Ex. No: 05
COMPARISION OF KNN, SVM, DECISION
Date: TREE, LOGISTIC REGRESSION, NAÏVE
BAYES
AIM
To write a Python program to compare the algorithm for classification of iris dataset
ALGORITHM
STEP 1: Start the process
STEP 2: Open Jupyter Notebook and create new Python3 notebook
STEP 3: Import read_csv from pandas and pyplot from matplotlib
STEP 4: Import train_test_split, cross_val_score and StratifiedKFold from sklearn.model_
selection
STEP 5: Import LogisticRegression from sklearn.linear_model, DecisionTreeClassifier from
[Link], KNeighborsClassifier from [Link], GaussianNB from sklearn.naive_bayes
and SVC from [Link]
STEP 6: Load iris dataset and their column names using read_csv( )
STEP 7: Split out validation dataset using train_test_split( )
STEP 8: Create model using LogisticRegression( ), KNeighbors( ), GaussianNB( ),
DecisionTreeClassifier( ), SVC( )
STEP 9: Check algorithms and evaluate each model in turn using cross_val_score( )
STEP 10: Compare the algorithmsusing [Link]( ) and display the graph using
[Link]( )
STEP 11: Save and run the program
STEP 12: Stop the process
13
PROGRAM
14
[Link]('ALGORITHM Comparison')
[Link]()
OUTPUT
15
CART: 0.941667 (0.053359)
16
SVM: 0.983333 (0.033333)
RESULT
17
Ex. No: 06
BIRCH CLUSTERING
Date:
AIM
ALGORITHM
STEP 3: Import unique, where from numpy, make_classification and Birch from sklearn and
pyplot from matplotlib
18
PROGRAM
19
OUTPUT
RESULT
20
Ex. No: 07 DATA EXPLORATION AND
Date: VISUALIZATION
AIM
To perform the data exploration of the iris dataset and to implement various statistical
operations.
ALGORITHM
STEP 4: Display the descriptive features of the dataset using commands like dim(iris),
names(iris), str(iris), attributes(iris), iris[1:3], head(iris,10), tail(iris,10), summary(iris),
iris[1:10,"[Link]"].
STEP 5: Find the mean, range, median and histogram using following commands,
mean(iris$[Link]), range(iris$[Link]), median(iris$[Link]),
quantile(iris$[Link]), hist(iris$[Link]).
21
PROGRAM
> View(iris)
> dim(iris)
[1] 150 5
> names(iris)
[1] "[Link]" "[Link]" "[Link]" "[Link]"
[5] "Species"
> structure(iris)
> str(iris)
'[Link]': 150 obs. of 5 variables:
$ [Link]: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ [Link] : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ [Link]: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ [Link] : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> attributes(iris)
$names
[1] "[Link]" "[Link]" "[Link]" "[Link]"
[5] "Species"
$class
[1] "[Link]"
$[Link]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
[17] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
[49] 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
[81] 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
[113] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
[129] 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[145] 145 146 147 148 149 150
22
> head(iris)
[Link] [Link] [Link] [Link] Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> head(iris,10)
[Link] [Link] [Link] [Link] Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
> summary(iris)
[Link] [Link] [Link] [Link]
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
Versicolor :50
virginica :50
> tail(iris,3)
[Link] [Link] [Link] [Link] Species
23
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
> mean(iris$[Link])
[1] 5.843333
> range(iris$[Link])
[1] 4.3 7.9
> cov(iris$[Link],iris$[Link])
[1] -0.042434
> median(iris$[Link])
[1] 5.8
> plot(density(iris$[Link]))
24
> hist(iris$[Link])
> table(iris$[Link])
4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6
1 3 1 4 2 5 6 10 9 4 1 6 7 6 8 7 3 6
6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.6 7.7 7.9
6 4 9 7 5 2 8 3 4 1 1 3 1 1 1 4 1
> pairs(iris)
25
> pie(table(iris$Species))
RESULT
Thus, the data exploration of the iris dataset is performed and various statistical
operations are implemented.
26
Ex. No: 08
LINEAR REGRESSION
Date:
AIM
To perform linear regression using the salary dataset to predict salary based on
experience.
ALGORITHM
STEP 6: Find the line of fit using the lm formula, using years of experience and salary.
27
PROGRAM
>dataset = [Link]("C:/Users/2msccs02/Documents/Salary_Data.csv")
> View(dataset)
> [Link]('caTools')
package ‘caTools’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\2msccs02\AppData\Local\Temp\RtmpkXv9S1\downloaded_packages
> library(caTools)
> [Link](123)
> split = [Link](dataset$Salary, SplitRatio = 2/3)
> training_set = subset(dataset, split == TRUE)
> test_set = subset(dataset, split == FALSE)
> regressor = lm(formula = Salary ~ YearsExperience,
+ data = training_set)
> y_pred = predict(regressor, newdata = test_set)
> library(ggplot2)
> ggplot() +
+ geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
+ colour = 'red') +
+ geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata =
training_set)),
+ colour = 'blue') +
+ ggtitle('Salary vs Experience (Training set)') +
+ xlab('Years of experience') +
+ ylab('Salary')
28
OUTPUT
RESULT
29
Ex. No: 9
ASSOCIATION RULE MINING
Date:
AIM
ALGORITHM
STEP 4: Calculate the support for frequent items using eclat() function
STEP 5: Displays all the frequent items in the grocery dataset Inspect(frequentItems)
STEP 7: Sort the items by high support, high confidence and high lift
30
PROGRAM
> [Link]("arules")
> [Link]("arulesViz")
> library(arules)
> library(arulesViz)
> data("Groceries")
> transactions<-Groceries
> summary(Groceries)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
most frequent items:
whole milk other vegetables rolls/buns soda
2513 1903 1809 1715
yogurt (Other)
1372 34055
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77
15 16 17 18 19 20 21 22 23 24 26 27 28 29
55 46 29 14 14 9 11 4 6 1 1 1 1 3
32
1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 4.409 6.000 32.000
31
> itemFrequencyPlot(transactions,support=0.05,[Link]=0.8)
32
> itemFrequencyPlot(transactions,topN=20)
> frequentItems<-eclat(Groceries,parameter=list(supp=0.07,maxlen=15))
Eclat
parameter specification:
tidLists support minlen maxlen target ext
FALSE 0.07 1 15 frequent itemsets TRUE
algorithmic control:
sparse sort verbose
7 -2 TRUE
Absolute minimum support count: 688
create itemset ...
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [18 item(s)] done [0.00s].
creating sparse bit matrix ... [18 row(s), 9835 column(s)] done [0.00s].
writing ... [19 set(s)] done [0.00s].
Creating S4 object ... done [0.00s].
> inspect(frequentItems)
items support count
[1] {other vegetables, whole milk} 0.07483477 736
[2] {whole milk} 0.25551601 2513
33
[3] {other vegetables} 0.19349263 1903
[4] {rolls/buns} 0.18393493 1809
[5] {yogurt} 0.13950178 1372
[6] {soda} 0.17437722 1715
[7] {root vegetables} 0.10899847 1072
[8] {tropical fruit} 0.10493137 1032
[9] {bottled water} 0.11052364 1087
[10] {sausage} 0.09395018 924
[11] {shopping bags} 0.09852567 969
[12] {citrus fruit} 0.08276563 814
[13] {pastry} 0.08896797 875
[14] {pip fruit} 0.07564820 744
[15] {whipped/sour cream} 0.07168277 705
[16] {fruit/vegetable juice} 0.07229283 711
[17] {newspapers} 0.07981698 785
[18] {bottled beer} 0.08052872 792
[19] {canned beer} 0.07768175 764
> rules<-apriori(Groceries,parameter=list(supp=0.001,conf=0.5))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support
0.5 0.1 1 none FALSE TRUE 5 0.001
minlen maxlen target ext
1 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 9
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [157 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 done [0.01s].
writing ... [5668 rule(s)] done [0.00s].
34
creating S4 object ... done [0.00s].
> rules_conf<-sort(rules,by="support",decreasing=TRUE)[1:20]
> inspect((rules_conf))
lhs rhs support confidence coverage
lift count
[1] {other vegetables,
yogurt} => {whole milk} 0.022267412 0.5128806
0.04341637 2.007235 219
[2] {tropical fruit,
yogurt} => {whole milk} 0.015149975 0.5173611
0.02928317 2.024770 149
[3] {other vegetables,
whipped/sour cream} => {whole milk} 0.014641586 0.5070423 0.02887646
1.984385 144
[4] {root vegetables,
yogurt} => {whole milk} 0.014539908 0.5629921
0.02582613 2.203354 143
RESULT
38
Ex. No: 10
CLASSIFICATION USING SVM
Date:
AIM
ALGORITHM
39
PROGRAM
> [Link]("kernlab")
package ‘kernlab’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\2msccs02\AppData\Local\Temp\Rtmpyw8HQk\downloaded_packages
> library(kernlab)
> rbf<-rbfdot(sigma=0.1)
> [Link]("partykit")
package ‘libcoin’ successfully unpacked and MD5 sums checked
package ‘mvtnorm’ successfully unpacked and MD5 sums checked
package ‘Formula’ successfully unpacked and MD5 sums checked
package ‘inum’ successfully unpacked and MD5 sums checked
package ‘partykit’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\2msccs02\AppData\Local\Temp\RtmpM1EoLr\downloaded_packages
> library(partykit)
Loading required package: grid
Loading required package: libcoin
Loading required package: mvtnorm
> ind <- sample(2, nrow(iris),replace=TRUE, prob = c(0.7,0.3))
> trainData <- iris[ind==1,]
> testData <- iris[ind==2,]
> irisSVM <- ksvm(Species~., data=trainData, type="C-bsvc", kernel=rbf, C=10,
+ [Link]=TRUE)
> irisSVM1 <-ksvm(Species~.,data=iris,type="C-bsvc",kernel="rbfdot",[Link]=TRUE)
> fitted(irisSVM)
[1] setosa setosa setosa setosa setosa setosa setosa setosa
setosa setosa
[11] setosa setosa setosa setosa setosa setosa setosa setosa
setosa setosa
[21] setosa setosa setosa setosa setosa setosa setosa setosa
setosa setosa
[31] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
40
versicolor versicolor
[41] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
versicolor versicolor
[51] versicolor versicolor versicolor versicolor versicolor versicolor virginica versicolor
versicolor versicolor
[61] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
virginica virginica
[71] virginica virginica virginica virginica virginica virginica virginica virginica
virginica virginica
[81] virginica virginica virginica virginica virginica virginica virginica virginica
virginica virginica
[91] virginica virginica versicolor virginica virginica virginica virginica virginica
virginica virginica
[101] virginica virginica
Levels: setosa versicolor virginica
> fitted(irisSVM1)
[1] setosa setosa setosa setosa setosa setosa setosa setosa
setosa setosa
[11] setosa setosa setosa setosa setosa setosa setosa setosa
setosa setosa
[21] setosa setosa setosa setosa setosa setosa setosa setosa
setosa setosa
[31] setosa setosa setosa setosa setosa setosa setosa setosa
setosa setosa
[41] setosa setosa setosa setosa setosa setosa setosa setosa
setosa setosa
[51] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
versicolor versicolor
[61] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
versicolor versicolor
[71] versicolor versicolor versicolor versicolor versicolor versicolor versicolor virginica
versicolor versicolor
[81] versicolor versicolor versicolor virginica versicolor versicolor versicolor versicolor
versicolor versicolor
[91] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
41
versicolor versicolor
[101] virginica virginica virginica virginica virginica virginica virginica virginica
virginica virginica
[111] virginica virginica virginica virginica virginica virginica virginica virginica
virginica versicolor
[121] virginica virginica virginica virginica virginica virginica virginica virginica
virginica virginica
[131] virginica virginica virginica versicolor virginica virginica virginica virginica
virginica virginica
[141] virginica virginica virginica virginica virginica virginica virginica virginica
virginica virginica
Levels: setosa versicolor virginica
RESULT
Thus, the given dataset has been verified and executed successfully.
42
Ex. No: 11 K-MEANS CLUSTERING AND
Date: OUTLIERS DETECTION
AIM
To create K-means clustering and to detect outliers of Iris dataset using R tool.
ALGORITHM
STEP 4: The clustering is done using K-means command by specifying the total number of
clusters.
STEP 5: Finally using the table command, the K-means cluster is displayed.
STEP 8: Find the outliers using the functions like sqrt, and order.
43
PROGRAM
> iris2<-iris
> iris2$Species<-NULL
> [Link]<-kmeans(iris2,3)
> table(iris$Species, [Link]$cluster)
1 2 3
setosa 0 50 0
versicolor 2 0 48
virginica 36 0 14
> [Link]
K-means clustering with 3 clusters of sizes 38, 50, 62
Cluster means:
[Link] [Link] [Link] [Link]
1 6.850000 3.073684 5.742105 2.071053
2 5.006000 3.428000 1.462000 0.246000
3 5.901613 2.748387 4.393548 1.433871
Clustering vector:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3
3133333
[59] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1
11111331
[117] 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1 1 3
Available components:
44
> table(iris$Species, [Link]$cluster)
1 2 3
setosa 0 50 0
versicolor 2 0 48
virginica 36 0 14
> plot(iris2[c("[Link]", "[Link]")],col=[Link]$cluster)
> [Link]$centers
[Link] [Link] [Link] [Link]
45
1 6.850000 3.073684 5.742105 2.071053
2 5.006000 3.428000 1.462000 0.246000
3 5.901613 2.748387 4.393548 1.433871
> [Link]$cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
33133333
[59] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1
11111331
[117] 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1 1 3
> centers<-[Link]$centers[[Link]$cluster,]
> distances<-sqrt(rowSums((iris2 - centers)^2))
> outliers<-order(distances,decreasing=T)[1:5]
> print(outliers)
[1] 99 58 94 61 119
> print(iris2[outliers,])
[Link] [Link] [Link] [Link]
99 5.1 2.5 3.0 1.1
58 4.9 2.4 3.3 1.0
94 5.0 2.3 3.3 1.0
61 5.0 2.0 3.5 1.0
119 7.7 2.6 6.9 2.3
>plot(iris2[,c("[Link]","[Link]")],pch="o",col=[Link]$cluster,cex=0.3)
46
>points([Link]$centers[,c("[Link]","[Link]")],col=1:3,pch=8 , cex=1.5)
>points(iris2[outliers,c("[Link]","[Link]")],pch="+",col=4,cex=1.5)
47
RESULT
48
Ex. No: 12 TEXT PREPROCESSING
Date:
AIM
To perform text preprocessing and to create a term document matrix using Reuters
dataset.
ALGORITHM
STEP 3: Create a corpus of the text documents from the texts/crude folder of tm package and
name it as Reuters dataset.
STEP 4: Perform pre-processing steps like punctuation removal, stop words removal, white
space removal and lowercase conversion.
STEP 5: Create a term document matrix using dtm -> Document Term Matrix(reuters) for
reuters dataset and display the same.
49
PROGRAM
> [Link]("tm")
> library(tm)
Loading required package: NLP
> cor <- [Link]("texts","crude",package="tm")
> reuters <- Corpus(DirSource(cor))
> reuters
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 20
> inspect(reuters[1])
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 1
[Link]
<?xml version="1.0"?>\n<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN"
CGISPLIT="TRAINING-SET" OLDID="5670" NEWID="127">\n <DATE>26-
FEB-1987 17:00:56.04</DATE>\n <TOPICS>\n <D>crude</D>\n </TOPICS>\n
<PLACES>\n <D>usa</D>\n </PLACES>\n <PEOPLE/>\n
<ORGS/>\n
<EXCHANGES/>\n <COMPANIES/>\n <UNKNOWN>Y\n f0119 reute\nu f BC-
DIAMOND-SHAMROCK-(DIA 02-26 0097</UNKNOWN>\n
<TEXT>\n
<TITLE>DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES</TITLE>\n
<DATELINE>NEW YORK, FEB 26 -</DATELINE>\n <BODY>Diamond Shamrock
Corp said that\n effective today it had cut its contract prices for crude oil by\n1.50 dlrs
a barrel.\n The reduction brings its posted price for West Texas\nIntermediate to 16.00
dlrs a barrel, the copany said.\n "The price reduction today was made in the light
of falling\noil product prices and a weak crude oil market," a
company\nspokeswoman said.\n Diamond is the latest in a line of U.S. oil companies
that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil
markets.\n Reuter</BODY>\n </TEXT>\n</REUTERS>
> reuters<-tm_map(reuters,stripWhitespace)
> writeLines([Link](reuters[1]))
50
<?xml version="1.0"?> <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN"
CGISPLIT="TRAINING-SET" OLDID="5670" NEWID="127"> <DATE>26-FEB-
1987 17:00:56.04</DATE> <TOPICS> <D>crude</D> </TOPICS> <PLACES>
<D>usa</D> </PLACES> <PEOPLE/> <ORGS/>
<EXCHANGES/>
<COMPANIES/> <UNKNOWN>Y f0119 reute u f BC-DIAMOND-SHAMROCK-
(DIA 02-26 0097</UNKNOWN> <TEXT> <TITLE>DIAMOND SHAMROCK
(DIA) CUTS CRUDE PRICES</TITLE> <DATELINE>NEW YORK, FEB 26
</DATELINE> <BODY>Diamond Shamrock Corp said that effective today it had cut
its contract prices for crude oil by 1.50 dlrs a barrel. The reduction brings its posted
price for West Texas Intermediate to 16.00 dlrs a barrel, the copany said. "The
price reduction today was made in the light of falling oil product prices and a weak
crude oil market," a company spokeswoman said. Diamond is the latest in a line
of U.S. oil companies that have cut its contract, or posted, prices over the last two days
citing weak oil markets. Reuter</BODY> </TEXT> </REUTERS>
list(language = "en")
list()
> reuters<-tm_map(reuters,tolower)
> writeLines([Link](reuters[1]))
<?xml version="1.0"?> <reuters topics="yes" lewissplit="train" cgisplit="training-set"
oldid="5670" newid="127"> <date>26-feb-1987 17:00:56.04</date>
<topics>
<d>crude</d> </topics> <places> <d>usa</d> </places> <people/>
<orgs/>
<exchanges/> <companies/> <unknown>y f0119 reute u f bc-diamond-shamrock-(dia
02-26 0097</unknown> <text> <title>diamond shamrock (dia) cuts crude
prices</title> <dateline>new york, feb 26 -</dateline> <body>diamond shamrock corp
said that effective today it had cut its contract prices for crude oil by 1.50 dlrs a barrel.
the reduction brings its posted price for west texas intermediate to 16.00 dlrs a barrel,
the copany said. "the price reduction today was made in the light of falling oil
product prices and a weak crude oil market," a company spokeswoman said.
diamond is the latest in a line of u.s. oil companies that have cut its contract, or posted,
prices over the last two days citing weak oil markets. reuter</body> </text> </reuters>
list(language = "en")
list()
> reuters<-tm_map(reuters,removePunctuation)
> writeLines([Link](reuters[1]))
xml version10 reuters topicsyes lewissplittrain cgisplittrainingset oldid5670 newid127
date26feb1987 17005604date topics dcruded topics places dusad places people orgs
list()
> reuters<-tm_map(reuters,removeNumbers)
> writeLines([Link](reuters[1]))
xml version reuters topicsyes lewissplittrain cgisplittrainingset oldid newid datefeb date
topics dcruded topics places dusad places people orgs exchanges companies unknownyf
reute u f bcdiamondshamrockdia unknown text titlediamond shamrock dia cuts crude
pricestitle datelinenew york feb dateline bodydiamond shamrock corp said that
effective today it had cut its contract prices for crude oil by dlrs a barrel the reduction
brings its posted price for west texas intermediate to dlrs a barrel the copany said
quotthe price reduction today was made in the light of falling oil product prices and a
weak crude oil marketquot a company spokeswoman said diamond is the latest in a line
of us oil companies that have cut its contract or posted prices over the last two days
citing weak oil markets reuterbody text reuters
list(language = "en")
list()
> reuters<-tm_map(reuters,removeWords,stopwords("english"))
> writeLines([Link](reuters[1]))
xml version reuters topicsyes lewissplittrain cgisplittrainingset oldid newid datefeb date
topics dcruded topics places dusad places people orgs exchanges companies unknowny
f reute u f bcdiamondshamrockdia unknown text titlediamond shamrock dia cuts crude
pricestitle datelinenew york feb dateline bodydiamond shamrock corp said effective
today cut contract prices crude oil dlrs barrel reduction brings posted price west texas
intermediate dlrs barrel copany said quotthe price reduction today made light falling oil
product prices weak crude oil marketquot company spokeswoman said diamond latest
line us oil companies cut contract posted prices last two days citing weak oil markets
reuterbody text reuters
list(language = "en")
list()
> dtm<-DocumentTermMatrix(reuters)
52
inspect(dtm)
<<DocumentTermMatrix (documents: 20, terms: 1134)>>
Non-/sparse entries: 2351/20329
Sparsity : 90%
Maximal term length: 22
Weighting : term frequency (tf)
Sample :
Terms
Docs mln oil opec orgs places prices reuters said text topics
[Link] 4 12 10 2 2 6 3 11 2 2
[Link] 4 7 7 2 2 4 2 10 2 2
[Link] 1 3 1 2 2 1 2 1 2 2
[Link] 0 3 2 2 2 2 2 3 2 2
[Link] 0 5 1 1 2 1 2 5 2 2
[Link] 3 9 7 2 2 9 2 7 2 2
[Link] 10 5 5 2 2 5 2 8 2 2
[Link] 3 5 0 1 2 2 2 2 2 2
[Link] 3 6 0 1 2 2 2 2 2 2
[Link] 0 3 0 1 2 3 2 4 2 2
> findFreqTerms(dtm, 5)
[1] "barrel" "cgisplittrainingset" "companies" "company" "contract"
[6] "crude" "date" "datefeb" "dateline" "datelinenew"
[11] "dcruded" "dlrs" "dusad" "exchanges" "feb"
[16] "last" "lewissplittrain" "markets" "newid" "oil"
[21] "oldid" "orgs" "people" "places" "posted"
[26] "price" "prices" "reute" "reuterbody" "reuters"
[31] "said" "text" "today" "topics" "topicsyes"
[36] "unknown" "unknowny" "version" "west" "xml"
[41] "york" "ability" "agreement" "analysts" "april"
53
[46] "bpd" "december" "demand" "dopecd" "emergency"
[51] "energy" "hold" "industry" "march" "market"
[56] "may" "meet" "meeting" "mln" "new"
[61] "now" "one" "opec" "output" "petroleum"
[66] "production" "quota" "research" "sell" "sources"
[71] "will" "world" "pct" "present" "reserve"
[76] "reserves" "study" "year" "ali" "also"
[81] "arab" "barrels" "ceiling" "daily" "datemar"
[86] "expected" "gulf" "international" "kuwait" "members"
[91] "minister" "month" "official" "plans" "qatar"
[96] "quoted" "recent" "says" "sheikh" "traders"
[101] "united" "unknownrm" "week" "economic" "economy"
[106] "exports" "government" "group" "help" "imports"
[111] "report" "states" "dsaudiarabiad" "abdulaziz" "billion"
[116] "budget" "expenditure" "riyals" "accord" "agency"
[121] "arabia" "commitment" "nazer" "saudi" "february"
[126] "january" "exchange" "futures" "nymex"
> findAssocs(dtm, "opec", 0.7)
$opec
RESULT
54
Ex. No: 13 VISUALIZING MAXIMUM CONFIRMED,
CURED COVID-19 PATIENTS IN COVID-19
Date: DATASET
AIM
To visualize the covid dataset using tableau and find the cured case, death rate, and
confirmed case.
ALGORITHM
STEP 3: In the worksheet choose the dimension and measure as state and cured to find the
state with the maximum number of people cured from covid.
STEP 4: In the worksheet choose the dimension and measure as Month and death rate to find
the month in 2021 with the highest death rate.
STEP 5: In the next worksheet choose the dimension and measure as state and confirmed case
to find the state with the maximum number of confirmed covid cases.
55
OUTPUT
RESULT
56
Ex. No: 14 VISUALIZING THE DIFFERENCE IN
VOLUME YEAR-WISE AND QUARTER
Date:
WISE
AIM
To create a tableau program using the stock dataset 2010-13 and create a chart to
visualize the percent difference in volume for each company by year and quarter and find how
many quarters the company biogen idec show a positive percent difference in volume.
ALGORITHM
57
OUTPUT
RESULT
58
Ex. No: 15
VISUALIZING DELAY OF FLIGHTS
Date:
AIM
To use the flights table, create a bar chart showing the average of minutes of delay per
flight broken down by carrier name, and filtered by a state to only show Minnesota (MN).
ALGORITHM
STEP 4: In the worksheet choose the dimension as carrier name, state and measure as minutes
of delay per flight.
STEP 5: Select State attribute Filter Select MN from the list Ok.
59
OUTPUT
RESULT
60
Ex. No: 16 VISUALIZING THE SALES OF THE
Date: PRODUCT IN BAKERY DATASET
AIM
To use the bakery dataset summarizes the sales of products for a day of the week and
also find the product which has the maximum sales on Monday using Tableau.
ALGORITHM
STEP 3: In the worksheet choose the dimensions as items, weekday(date), and measures as
transactions.
STEP 4: Select Weekday attribute Filter Select Monday from the list Ok.
61
OUTPUT
RESULT
AIM
Create a chart to visualize the monthly change in volumes of stocks, from the beginning
of 2010 to the end of 2013 for the two consecutive months that have the least fluctuation in
increase or decrease.
ALGORITHM
STEP 3: Choose the chart and choose the x-axis as date (year, month) and the y-axis as
volume(min). To find the product which was the least fluctuation in order.
STEP 4: In the next page choose the x-axis as date(year, month) and the y-axis as
volume(min). To find the product which was the least fluctuation in order and choose any
two months and years in filter type
63
OUTPUT
RESULT
Thus the above program is executed successfully and the output is verified.
64
Ex. No: 18 VISUALIZING PRODUCT-WISE SALES
AIM
To visualize the superstore dataset using Power BI and find the product that
produce sthe maximum sales
ALGORITHM
STEP 4: In Page three create a Bar Chart. Drag and drop the Product Name in X-axis and Sales
Field in Y-axis. To view the Maximum Sales choose Maximum from the drop-down list to
view the Maximum Sales of each Product.
65
OUTPUT
RESULT
Thus the above program is executed successfully and the output is verified
66
Ex. No: 19 VISUALIZING REGION-WISE AND
CATEGORY-WISE PROFIT USING
Date:
SUPERSTORE DATASET
AIM
To visualize the superstore dataset using Power BI and find the region that produces
the maximum profit.
ALGORITHM
STEP 3: Select and Load the Orders, Return and People Tables.
STEP 4: In Page One create a Pie Chart which displays the Maximum Sales and in the filter
add the condition to display the maximum sales of Product Names that starts with the Letter A.
Drag and drop the Product Name field to Legend to view the sales of each product.
STEP 5: In Page two create a Funnel Chart. Drag and drop the Profit field from the table in X-
axis and Region Field in Y-axis. To view the Maximum profit choose Maximum from the drop-
down list to view the Maximum Profit in each region
STEP 6: In Page three create a Bar Chart. Drag and drop the Product Name in X-axis and Sales
Field in Y-axis. To view the Maximum Sales choose Maximum from the drop-down list to
view the Maximum Sales of each Product.
STEP 7: In Page four create a dashboard view of all the charts and create a Q&A from
Visualization -> Build Visualization -> Select Q&A. In the search box type the query to get
the answer.
67
OUTPUT
68
RESULT
Thus, the Superstore dataset is executed using Power BI and the output is verified
successfully.
69
Ex. No: 20 POWER QUERY OPERATIONS
USING WALMART AND MASTERS
Date:
DATASET
AIM
To perfoem the power query operations like append, merge, custom column and conditional
column on the walmart and the masters dataset in Power Bi
ALGORITHM
70
OUTPUT
71
c) change Temperature, Fuel_Price ,CPI .Unemployment datatype to whole number
72
d) Calculate price using custom column
73
e) Check if the date is holiday using the Holiday Flag=1, bring the weekday, using
conditional column
74
RESULT
Thus the above program is executed successfully and the output is verified
75