0% found this document useful (0 votes)
21 views11 pages

Wine Data Analysis and Classification

The document outlines practical exercises in data mining under the supervision of Dr. Bhavya Deep. It includes tasks such as data cleaning, pre-processing, applying the Apriori algorithm, using classification algorithms, and clustering with K-Means. Each section provides code examples and expected outputs for datasets, primarily focusing on the wine dataset.

Uploaded by

kavyachauhan374
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views11 pages

Wine Data Analysis and Classification

The document outlines practical exercises in data mining under the supervision of Dr. Bhavya Deep. It includes tasks such as data cleaning, pre-processing, applying the Apriori algorithm, using classification algorithms, and clustering with K-Means. Each section provides code examples and expected outputs for datasets, primarily focusing on the wine dataset.

Uploaded by

kavyachauhan374
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PRACTICAL RECORD FILE

DATA MINING
(Under the supervision of Dr. Bhavya Deep sir)

DEVESH MEENA
2302016
2nd YEAR 4th SEMESTER
BSC(H).COMPUTER SCIENCE
INDEX

Sr.
Practical Question sign
No.

Apply data cleaning techniques on any dataset (e.g., wine dataset). Techniques may include
1 handling missing values, outliers, inconsistent values. A set of validation rules can be prepared
based on the dataset and validations can be performed.

Apply data pre-processing techniques such as standardization/normalization, transformation,


2
aggregation, discretization/binarization, sampling etc. on any dataset.

Run Apriori algorithm to find frequent item sets and association rules on 2 real datasets and use
appropriate evaluation
a) Use minimum measures
support to compute
as 50% and minimumcorrectness
confidenceofasobtained
75%. patterns.
3
b) Use minimum support as 60% and minimum confidence as 60%.

Use Naive Bayes, K-Nearest, and Decision Tree classification algorithms and build classifiers on
any two datasets. Divide the dataset into training and test sets. Compare the accuracy of the
different classifiers under the following situations:
I. a) Training set = 75%, Test set = 25%.
b) Training set = 66.6%, Test set = 33.3%.
4
II. Training set is chosen by:
i) Hold-out method
ii) Random subsampling
iii) Cross-validation.
Compare the accuracy of the classifiers obtained. Data needs to be scaled to standard format.

Use Simple K-Means algorithm for clustering on any dataset. Compare the performance of clusters
5 by changing the parameters involved in the algorithm. Plot MSE computed after each iteration
using a line plot for any set of parameters.
[Link] data cleaning techniques on any dataset (e,g, wine dataset). Techniques may include
handling missing values, outliers, inconsistent values. A set of validation rules can be prepared
based on the dataset and validations can be performed.

Code:
import pandas as pd

# 1. Load dataset (semicolon-delimited)


df = pd.read_csv("[Link]", sep=';')

# 2. for missing values


print("Missing values per column:\n", [Link]().sum())

# 3. Handle missing values (if any appear—this dataset has none by default)
[Link]([Link](), inplace=True)

# 4. Normalize/standardize text columns


if 'type' in [Link]:
df['type'] = df['type'].[Link]()

print("\nPost-cleaning summary statistics:")


print([Link]())
print("\nData cleaning completed.")

Output:
[Link] data pre-processing techniques such as standardization/normalization, transformation,
aggregation, discretization/binarization, sampling etc. on any dataset

Code:
import pandas as pd
from [Link] import StandardScaler

# 1. Load dataset
df = pd.read_csv("[Link]", sep=';')

# 2. Select all numeric feature columns for scaling


numeric_cols = [c for c in [Link] if df[c].dtype in ['float64','int64'] and c != 'quality']

# 3. Initialize the scaler


scaler = StandardScaler()

# 4. Fit & transform the numeric features


scaled_array = scaler.fit_transform(df[numeric_cols])

# 5. Convert back to a DataFrame


df_scaled = [Link](scaled_array, columns=numeric_cols)

# 6. Re-attach the target column


df_scaled['quality'] = df['quality']

print("Standardized feature summary:")


print(df_scaled.describe().loc[['mean','std']])

output:

Q3. . Run Apriori algorithm to find frequent item sets and association rules on 2 real datasets and
use appropriate evaluation measures to compute correctness of obtained patterns

a) Use minimum support as 50% and minimum confidence as 75%


b) Use minimum support as 60% and minimum confidence as 60 %

code:

import pandas as pd
from [Link] import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# Load dataset
df = pd.read_csv("[Link]", sep=';')
# Discretize selected features
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'alcohol']
for col in features:

q1 = df[col].quantile(0.25) q3 = df[col].quantile(0.75) bins = [df[col].min()-1,


q1, q3, df[col].max()+1] df[col + '_cat'] = [Link](df[col], bins=bins, labels=
['low', 'medium', 'high'])

transactions = df[[c + '_cat' for c in features]].astype(str).[Link]()


# Encode transactions
te = TransactionEncoder()
te_ary = [Link](transactions).transform(transactions)
df_trans = [Link](te_ary, columns=te.columns_)

(a) Support ≥ 50%, Confidence ≥ 75%


itemsets_50 = apriori(df_trans, min_support=0.50, use_colnames=True)
rules_50 = association_rules(itemsets_50, metric="confidence", min_threshold=0.75)
print("Support ≥ 50%, Confidence ≥ 75%")
print(rules_50[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

(b) Support ≥ 60%, Confidence ≥ 60%


itemsets_60 = apriori(df_trans, min_support=0.60, use_colnames=True)
rules_60 = association_rules(itemsets_60, metric="confidence", min_threshold=0.60)

print("Support ≥ 60%, Confidence ≥ 60%")

print(rules_60[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

OUTPUT:
[Link] Naive bayes, K-nearest, and Decision tree classification algorithms and build classifiers on
any two datasets. Divide the data set into training and test set. Compare the accuracy of the
different classifiers under the following situations: I. a) Training set = 75% Test set = 25% b) Training
set = 66.6% (2/3rd of total), Test set = 33.3% II. Training set is chosen by i) hold out method ii)
Random subsampling iii) Cross-Validation. Compare the accuracy of the classifiers obtained. Data
needs to be scaled to standard format.

Code:

import pandas as pd
import numpy as np

from [Link] import load_iris, load_wine


from sklearn.model_selection import train_test_split, cross_val_score

from [Link] import StandardScaler


from sklearn.naive_bayes import GaussianNB
from [Link] import KNeighborsClassifier

from [Link] import DecisionTreeClassifier


from [Link] import accuracy_score

# Function to evaluate classifiers

def evaluate_models(X, y, dataset_name):

results = []

classifiers = {

'Naive Bayes': GaussianNB(),

'KNN': KNeighborsClassifier(),
'Decision Tree': DecisionTreeClassifier()
}

# Standardize features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# I.a) 75/25 split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)


for name, clf in [Link]():

[Link](X_train, y_train)
y_pred = [Link](X_test)

acc = accuracy_score(y_test, y_pred)


[Link]((dataset_name, name, "75/25 Split", acc))

# I.b) 66.6/33.3 split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.333, random_state=42)


for name, clf in [Link]():

[Link](X_train, y_train)

y_pred = [Link](X_test)

acc = accuracy_score(y_test, y_pred)

[Link]((dataset_name, name, "66.6/33.3 Split", acc))

# II.i) Hold Out Method


for name, clf in [Link]():

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=0)

[Link](X_train, y_train)
acc = [Link](X_test, y_test)

[Link]((dataset_name, name, "Hold Out", acc))

# [Link]) Random Subsampling (avg of 5)


for name, clf in [Link]():
scores = []

for _ in range(5):

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3)

[Link](X_train, y_train)

[Link]([Link](X_test, y_test))

[Link]((dataset_name, name, "Random Subsampling", [Link](scores)))

# [Link]) Cross Validation (5-fold)

for name, clf in [Link]():

scores = cross_val_score(clf, X_scaled, y, cv=5)

[Link]((dataset_name, name, "5-Fold CV", [Link](scores)))

return results

# Load datasets
iris = load_iris()

wine = load_wine()
# Run evaluation

iris_results = evaluate_models([Link], [Link], "Iris")

wine_results = evaluate_models([Link], [Link], "Wine")

# Combine all results


combined_results = [Link](iris_results + wine_results, columns=["Dataset", "Classifier",
"Evaluation Method", "Accuracy"])
print(combined_results)
output:
[Link] Simple K-means algorithm for clustering on any dataset. Compare the performance of
clusters by changing the parameters involved in the algorithm. Plot MSE computed after each
iteration using a line plot for any set of parameters.
Code:

import pandas as pd
import numpy as np

import [Link] as plt


from [Link] import KMeans

from [Link] import StandardScaler


from [Link] import load_wine

from [Link] import mean_squared_error

# Load the Wine dataset

wine = load_wine()
X = [Link]

# Standardize the dataset


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Function to run KMeans and collect MSE after each iteration


def kmeans_with_mse(X, n_clusters=3, max_iter=10):

mse_list = []

kmeans = KMeans(n_clusters=n_clusters, init='random', n_init=1, max_iter=1, random_state=42)

for i in range(max_iter):

kmeans.max_iter = i + 1 # Increase iterations step by step


[Link](X)
labels = [Link](X)
mse = mean_squared_error(X, kmeans.cluster_centers_[labels])

mse_list.append(mse)
return mse_list

# Parameters

clusters = 3
iterations = 10

# Run and collect MSEs

mse_values = kmeans_with_mse(X_scaled, n_clusters=clusters, max_iter=iterations)

# Plotting MSE vs Iterations

[Link](figsize=(8, 5))
[Link](range(1, iterations + 1), mse_values, marker='o', linestyle='-', color='blue')

[Link](f'K-Means Clustering MSE vs Iterations (k={clusters})')


[Link]('Iteration')

[Link]('Mean Squared Error (MSE)')

[Link](True)

plt.tight_layout()
[Link]()

Output:

Common questions

Powered by AI

Data scaling and standardization, by transforming features to a comparable scale, ensure that machine learning models are not biased towards features with larger magnitudes. This process enhances the consistency of models, especially those relying on distance measurements, like K-Nearest Neighbors. It improves accuracy by allowing models to equally weigh features, thus preventing the dominance of features with inherently larger scales in the learning process .

The size of the training and test sets significantly influences the accuracy of classifiers. A larger training set (e.g., 75% division) often allows models such as Naive Bayes, K-Nearest Neighbors, and Decision Trees to learn better from the data, potentially improving accuracy. Conversely, a smaller training set (e.g., 66.6%) might not capture sufficient information, thereby reducing accuracy. In testing, larger test sets provide a more robust evaluation of the model's generalization performance, but too large a test set may lead to undertraining .

The Simple K-Means algorithm adjusts cluster centroids iteratively by reassessing which points belong to each cluster and recalculating the centroids as the mean of all points within each cluster. This iterative process ensures convergence towards a set of clusters that minimize variance within clusters while maximizing variance between clusters. Adjusting centroids iteratively helps to fine-tune the clusters to better reflect the inherent structure of the data, improving the results over initial assignments .

Plotting the MSE during K-Means iterations provides insights into the convergence process of the algorithm. A decreasing MSE indicates that the within-cluster variance is reducing, suggesting improving clustering quality. Monitoring MSE helps identify whether the algorithm has reached a stable configuration, or if additional iterations might continue to improve the partitioning. It also assists in detecting any volatility or anomalies in convergence, ensuring that the chosen parameters effectively guide the clustering process .

Data cleaning involves handling missing values, outliers, and inconsistent data to enhance the quality and accuracy of the dataset for better analysis. After applying cleaning techniques, validation is crucial to ensure the integrity and representativeness of data. Validation checks if the transformations have not introduced errors and if the dataset conforms to expected standards, which is necessary for subsequent data mining processes to be based on accurate data .

Cross-validation, unlike hold-out and random subsampling, divides the dataset into multiple partitions and trains-testing on each, ensuring all data points are used for both training and validation. This method reduces overfitting and delivers a more robust estimate of model performance by averaging results over several runs. In contrast, the hold-out method could lead to variance due to a single division of data, and random subsampling, though more repeated, might not explore data comprehensively without systematic partitioning .

The Naive Bayes classifier calculates probabilities using Bayes' theorem, assuming that the presence of a particular feature is independent of the presence of any other feature. It computes the posterior probability for each class by multiplying the prior probability of the class by the likelihood of the data given the class. The assumptions of feature independence facilitate faster computation but may not hold in real-world datasets, which could affect predictive performance .

Changing minimum support and confidence thresholds significantly affects the output of the Apriori algorithm. Lowering these thresholds generally increases the number of association rules generated, capturing more patterns, but at the risk of including weaker, less significant rules. Conversely, higher thresholds result in fewer rules that are stronger and potentially more useful, as they represent more frequent and confident associations. The balance between these parameters is crucial to derive meaningful insights without overfitting or underrepresenting the underlying data patterns .

Data pre-processing techniques include standardization/normalization, transformation, aggregation, discretization/binarization, and sampling. These techniques help in preparing the data for better performance in data mining and machine learning tasks by ensuring that each feature contributes equally to the analysis. For example, standardization transforms features to have a mean of zero and standard deviation of one, reducing the bias in analytical models towards features with higher magnitudes .

The Apriori algorithm determines frequent item sets by iteratively expanding item sets and evaluating them against a minimum support threshold. Only those with support above this threshold are considered frequent. The evaluation of correctness of patterns is often based on support, which measures the frequency of item sets, and confidence, which assesses the likelihood of consequent items given antecedent items. Metrics such as lift and conviction further help evaluate the strength and novelty of the association rules derived .

You might also like