0% found this document useful (0 votes)

6 views10 pages

Weka Data Preprocessing Guide

The document is a lab manual detailing Experiment No. 06, which focuses on data pre-processing using the WEKA tool. It outlines steps for loading datasets, computing statistics, filtering attributes, and discretizing numerical data, along with tasks for students to complete. Additionally, it covers the importance of preprocessing in data analysis, types of data, and various preprocessing techniques.

Uploaded by

yuvrajzamindar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views10 pages

Weka Data Preprocessing Guide

Uploaded by

yuvrajzamindar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

LAB Manual

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment No.06
1. Demonstration of pre-processing on the available datasets

Aim: To understand some of the basic data pre-processing operations that can be performed
using WEKA-Explorer. The sample dataset used for this example is the student/labor data
available in .arff format.

Prerequisite:
Weka Downloaded (Open source), Fundamental Knowledge of Database Management
Learning Outcomes:
Pre-processing, working of WEKA TOOL, Pre-processing using weka tool.

Theory:
Step 1: Loading the data. We can load the dataset into weka by clicking on open button in
pre-processing interface and selecting the appropriate file.

Step 2: Once the data is loaded, weka will recognize the attributes and during the scanning of
the data, weka will compute some basic statistics on each attribute. The left panel in the
explorer shows the list of recognized attributes while the top panel indicates the names of the
base relation or table and the current working relation (which are same initially).

Step 3: Clicking on an attribute in the left panel will show the basic statistics on the attributes
for the categorical attributes the frequency of each attribute value is shown, while for
continuous attributes we can obtain min, max, mean, standard deviation and deviation etc.,

Step 4: Create visualization in the right button panel in the form of cross-tabulation across
two attributes.

Note: we can select another attribute using the attribute list.

Step 5: Selecting or filtering attributes

1
Removing an attribute
When we need to remove an attribute, we can do this by using the attribute filters in weka. In
the filter model panel, click on choose button, this will show a popup window with a list of
available filters.

Scroll down the list and select the “[Link]” filters.

Step 6:
a) Next click the textbox immediately to the right of the choose button. In the resulting dialog
box, enter the index of the attribute to be filtered out.
b) Make sure that invert selection option is set to false. The click OK now in the filter box.
You will see “Remove-R-7”.
c) Click the apply button to apply filter to this data. This will remove the attribute and create
new working relation.
d) Save the new working relation as an .arff file by clicking save button on the top (button)
panel.
Discretization of an attribute
Sometimes association rule mining can only be performed on the categorical data. This
requires performing discretization on numeric or continuous attributes. In the following
example, let us discretize any numerical attribute.

• Divide the values of any numerical attribute into three bins (intervals).
• First load the dataset into weka ([Link])
• Select any of the numerical attribute.
• Activate filter-dialog box and select “[Link]”
from the list.
• To change the defaults for the filters, click on the box immediately to the right of the
choose button.
• We enter the index for the attribute to be discretized. In this case, if the attribute is
coming at serial no. [Link] have to enter ‘1’ corresponding to this attribute.
• Enter ‘3’ as the number of bins. Leave the remaining field values as they are.
• Click OK button.

2
• Click apply in the filter panel. This will result in a new working relation with the
selected attribute partition into 3 bins.
• Save the new working relation in a new file.

Tasks:

1. Download Weka tool and understand the working.

2. Explore [Link] and [Link] dataset.

3. Work with another dataset called [Link].

4. Apply pre-processing for all the datasets as explained in PART 1.

Removing an attribute called “Pension”

3
Filter and Normalize Data (Optional): Use filters in the "Filters" panel to apply data preprocessing
techniques such as smoothing, normalization, or discretization.

4
Handling Missing Values (Optional): If your dataset has missing values, you can handle them by
selecting the "Edit" button in the Preprocess panel and choosing how to deal with missing values
(e.g., replace them with mean, median, or a specific value).

Attribute Selection (Optional): Use the "Select attributes" panel to perform attribute selection if
needed.

5
5. Try to apply some data mining techniques on above dataset and visualize the
output.

6
6. Record all the screen shots.

PART B
(PART B: TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the
practical slot. The soft copy must be uploaded on the LMS (Teams/Portal) or emailed to the
concerned lab in charge faculty at the end of the practical in case there is no LMS access
available)

Roll No. 70022200455 Name: Avni Bhardwaj

Class : Btech CE Batch : 1
Date of Experiment: 23/9/23 Date of Submission: 23/9/23

B.1 Preprocessing of [Link], [Link] and [Link]:

(Paste your screen shots of all tasks completed during the 2 hours of practical here)

B.2 Observations and learning:

(Students are expected to comment on the output obtained with clear observations and
learning for each task/ sub part assigned).

B.3 Questions of Curiosity:

1. What is preprocessing? Why it is required?
Preprocessing in the context of data analysis and machine learning refers to a series of steps
and techniques applied to raw data before it is used for modeling or analysis. The primary
purpose of preprocessing is to prepare the data in a clean, structured, and suitable format for
the specific task at [Link] ensures that the data is clean, relevant, and properly formatted,
which in turn helps improve the accuracy and performance of models and analysis results.
The specific preprocessing steps used may vary depending on the nature of the data and the
goals of the analysis or modeling task.
Preprocessing is required for several reasons:
• Data Quality Improvement: Remove errors, inconsistencies, and handle missing values.

• Normaliza on and Scaling: Ensure consistent feature scales.

• Handling Missing Data: Address missing values to prevent model issues.

• Feature Engineering: Create or transform features for be er model performance.

7
ti
tt
• Dimensionality Reduc on: Reduce features to avoid over ng and save computa on.

• Categorical Data Encoding: Convert text or labels into numerical data.

• Outlier Detec on: Iden fy and manage outliers that can skew results.

• Data Spli ng: Separate data into training and tes ng sets.

• Normaliza on of Distribu ons: Ensure data follows the expected distribu on.

• Text Data Preprocessing: Tokenize, stem, and clean text data.

• Balancing Imbalanced Data: Address class imbalance in classi ca on tasks.

• Data Scaling for Algorithms: Standardize data for speci c algorithms.

2. List and explain the different types of data types (Numerical, categorical (ordinal and
nominal)
—>In the realm of data analysis and machine learning, understanding different data types is
fundamental. Data can be categorized into several types, with numerical and categorical data
being two primary classifications. This assignment delves into these data types, providing
explanations and examples for each.

1. Numerical Data:
- Numerical data comprises numbers and can be further subdivided into two subtypes:
a. Continuous Numerical Data: This type involves values that can take any real number
within a certain range. For instance, attributes like age, height, temperature, and income fall
under this category. Continuous data is measurable and can assume an infinite number of
values within its defined range.
b. Discrete Numerical Data: In contrast, discrete numerical data entails values that are
counted and are typically whole numbers. Examples encompass the count of products sold,
the number of children in a family, and the quantity of cars in a parking lot. Discrete data
assumes distinct, separate values.

2. Categorical Data:
- Categorical data represents distinct groups or categories and lacks a natural numerical
order. Categorical data can be further divided into two subtypes:

8
tti
ti
ti
ti
ti
ti
ti
fi
fi
tti
fi
ti
ti
ti
a. Nominal Data:Nominal data denotes categories with no inherent order or ranking.
Classic instances include colors (e.g., red, blue, green), types of animals (e.g., cat, dog, bird),
and country names. Nominal data can only be categorized and compared based on equality or
inequality; it does not permit arithmetic operations.
b. Ordinal Data: Ordinal data signifies categories with a specific order or ranking, though
the intervals between categories may not be uniformly meaningful. Examples encompass
education levels (e.g., high school, bachelor's, master's), customer satisfaction ratings (e.g.,
poor, fair, good, excellent), and star ratings for products (e.g., 1 star, 2 stars, 3 stars). While
ordinal data possesses an order, the differences between categories may not be consistently
interpretable.

3. Which are the preprocessing techniques currently used in data analysis. Explain.
(Please refer to the relevant websites/latest research papers to answer this question.)
—>
1. Data Cleaning: This involves handling missing values, dealing with duplicate records, and
addressing outliers. Techniques include imputation, outlier detection, and removal.

2. Data Transformation:Data transformation methods include log transformation, scaling

(e.g., Min-Max scaling), and standardization (mean-removal and variance-scaling). These
techniques help make data suitable for certain algorithms and improve model performance.

3. Feature Engineering: Feature engineering involves creating new features from existing
ones or selecting the most relevant features. Techniques include one-hot encoding for
categorical variables, creating interaction terms, and using dimensionality reduction
techniques like Principal Component Analysis (PCA).

4. Handling Categorical Data: Categorical data often requires special preprocessing,

including one-hot encoding for nominal data and ordinal encoding for ordinal data. This
transforms categorical variables into a numerical format for analysis.

9
5. Text Preprocessing:In natural language processing (NLP) tasks, text data preprocessing
techniques include tokenization, stemming, lemmatization, and stop word removal to prepare
text for analysis.

6. Data Sampling:Data imbalance is common in classification tasks. Techniques such as

oversampling (creating more instances of the minority class) and undersampling (reducing
instances of the majority class) can help balance the dataset.

7. Data Splitting: Datasets are typically split into training, validation, and test sets to evaluate
model performance. This ensures that models are tested on unseen data.

8. Normalization of Distributions: Some algorithms assume specific data distributions (e.g.,

Gaussian). Techniques like the Box-Cox transformation can help normalize data distributions.

9. Handling Time-Series Data: For time-series data, preprocessing may involve resampling,
handling missing time points, and feature extraction from temporal data.

10. Data Integration: When working with multiple data sources, integrating and merging data
is crucial. Techniques include data alignment, join operations, and data aggregation.

11. Noise Reduction: In signal processing and image analysis, noise reduction techniques,
such as filtering, are used to remove unwanted noise from the data.

12. Handling Spatial Data: In GIS and spatial analysis, preprocessing techniques involve
georeferencing, coordinate transformation, and spatial data filtering.

Data Preprocessing with WEKA Guide
No ratings yet
Data Preprocessing with WEKA Guide
14 pages
Weka Tool: Pros and Cons
No ratings yet
Weka Tool: Pros and Cons
19 pages
M.Sc. Computer Science Practical Record
No ratings yet
M.Sc. Computer Science Practical Record
72 pages
Data Mining Lab Manual Overview
No ratings yet
Data Mining Lab Manual Overview
47 pages
Data Mining Lab Manual 2024-25
No ratings yet
Data Mining Lab Manual 2024-25
45 pages
Data Exploration and Preprocessing in WEKA
No ratings yet
Data Exploration and Preprocessing in WEKA
7 pages
Data Exploration & Preprocessing with WEKA
No ratings yet
Data Exploration & Preprocessing with WEKA
8 pages
Data Processing in WEKA: Student & Labor Datasets
No ratings yet
Data Processing in WEKA: Student & Labor Datasets
19 pages
Data Mining Lab Experiments Overview
No ratings yet
Data Mining Lab Experiments Overview
41 pages
Data Warehousing & Mining Lab Manual
No ratings yet
Data Warehousing & Mining Lab Manual
49 pages
Khandu
No ratings yet
Khandu
31 pages
Data Mining Lab Manual with Weka
33% (3)
Data Mining Lab Manual with Weka
44 pages
Machine Learning Practical File Overview
No ratings yet
Machine Learning Practical File Overview
31 pages
Data Warehousing Lab Manual Guide
No ratings yet
Data Warehousing Lab Manual Guide
60 pages
Data Preprocessing with Student Dataset
No ratings yet
Data Preprocessing with Student Dataset
4 pages
MCA Data Mining Lab Manual
No ratings yet
MCA Data Mining Lab Manual
42 pages
Data Pre-processing Essentials
No ratings yet
Data Pre-processing Essentials
10 pages
Data Mining with Weka Lab Manual
No ratings yet
Data Mining with Weka Lab Manual
20 pages
Clustering Analysis of Iris Dataset
No ratings yet
Clustering Analysis of Iris Dataset
41 pages
Data Warehousing Lab Manual 2021
No ratings yet
Data Warehousing Lab Manual 2021
34 pages
Data Mining Lab Manual for WEKA
No ratings yet
Data Mining Lab Manual for WEKA
40 pages
FDS (Module 2& 3)
No ratings yet
FDS (Module 2& 3)
22 pages
Understanding DWBI in Data Mining Lab
No ratings yet
Understanding DWBI in Data Mining Lab
40 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
11 pages
Data Preprocessing in WEKA: A Guide
No ratings yet
Data Preprocessing in WEKA: A Guide
6 pages
Data Mining with WEKA: Employee & Weather Tables
No ratings yet
Data Mining with WEKA: Employee & Weather Tables
16 pages
Linear Regression and Data Processing Techniques
No ratings yet
Linear Regression and Data Processing Techniques
19 pages
9 - DM-LAB-Manual
No ratings yet
9 - DM-LAB-Manual
117 pages
Data Mining: Preprocessing Techniques Guide
No ratings yet
Data Mining: Preprocessing Techniques Guide
6 pages
Data Pre-processing in Data Mining
No ratings yet
Data Pre-processing in Data Mining
5 pages
WEKA Tool Installation and Data Preprocessing
No ratings yet
WEKA Tool Installation and Data Preprocessing
20 pages
Data Warehousing Lab Manual
No ratings yet
Data Warehousing Lab Manual
42 pages
DWM Laboratory Manual for IT Students
No ratings yet
DWM Laboratory Manual for IT Students
47 pages
Predictive Analytics with WEKA
No ratings yet
Predictive Analytics with WEKA
65 pages
DMLB Classification Process in Weka
No ratings yet
DMLB Classification Process in Weka
3 pages
Data Mining Experiments Overview
No ratings yet
Data Mining Experiments Overview
27 pages
Data Transformation in Preprocessing
No ratings yet
Data Transformation in Preprocessing
8 pages
Weka Data Preprocessing Techniques
No ratings yet
Weka Data Preprocessing Techniques
69 pages
Data Mining Lab Experiments Guide
No ratings yet
Data Mining Lab Experiments Guide
52 pages
WEKA Tool: Data Mining Techniques Guide
No ratings yet
WEKA Tool: Data Mining Techniques Guide
36 pages
Data Mining Lab Manual Using WEKA
No ratings yet
Data Mining Lab Manual Using WEKA
20 pages
Weka Expts
No ratings yet
Weka Expts
22 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
19 pages
Preprocessing Student ARFF Dataset in WEKA
100% (1)
Preprocessing Student ARFF Dataset in WEKA
4 pages
Data Warehousing & Mining Lab Record
No ratings yet
Data Warehousing & Mining Lab Record
54 pages
Weka Data Exploration and Preprocessing
No ratings yet
Weka Data Exploration and Preprocessing
54 pages
Weka Tool Installation and Usage Guide
No ratings yet
Weka Tool Installation and Usage Guide
88 pages
Weka Data Processing Experiments Guide
No ratings yet
Weka Data Processing Experiments Guide
52 pages
Data Preprocessing with WEKA: Student Dataset
No ratings yet
Data Preprocessing with WEKA: Student Dataset
6 pages
Data Warehousing Lab Experiments Guide
No ratings yet
Data Warehousing Lab Experiments Guide
52 pages
WEKA Data Mining Lab Manual
No ratings yet
WEKA Data Mining Lab Manual
53 pages
Data Mining Lab Report - WEKA Techniques
No ratings yet
Data Mining Lab Report - WEKA Techniques
25 pages
ML Manual
No ratings yet
ML Manual
49 pages
WEKA Data Preprocessing Guide
No ratings yet
WEKA Data Preprocessing Guide
56 pages
Data Mining Lab Manual for B.Tech Students
No ratings yet
Data Mining Lab Manual for B.Tech Students
44 pages
Data Mining Experiments and Techniques
No ratings yet
Data Mining Experiments and Techniques
44 pages
WEKA Data Mining Lab Experiments Guide
No ratings yet
WEKA Data Mining Lab Experiments Guide
48 pages
Data Mining Lab
No ratings yet
Data Mining Lab
42 pages
Data Preprocessing Techniques in Weka
No ratings yet
Data Preprocessing Techniques in Weka
76 pages
Data Engineering Virtual Internship Report
No ratings yet
Data Engineering Virtual Internship Report
18 pages
2009 - Applying Cluster Analysis To Build A Patient-Centric Healthcare Service Strategy For Elderly
No ratings yet
2009 - Applying Cluster Analysis To Build A Patient-Centric Healthcare Service Strategy For Elderly
16 pages
Machine Learning for Crop Recommendations
No ratings yet
Machine Learning for Crop Recommendations
13 pages
Pandas Data Processing Techniques
No ratings yet
Pandas Data Processing Techniques
3 pages
CatBoost for Personalized Medicine Recommendations
No ratings yet
CatBoost for Personalized Medicine Recommendations
19 pages
Lecture 4: Data Cleaning and Preprocessing
No ratings yet
Lecture 4: Data Cleaning and Preprocessing
5 pages
Text Preprocessing in Orange Tool
No ratings yet
Text Preprocessing in Orange Tool
6 pages
Deepmriprep: Fast VBM Preprocessing Tool
No ratings yet
Deepmriprep: Fast VBM Preprocessing Tool
26 pages
Data Science All Modules Detailed Notes
No ratings yet
Data Science All Modules Detailed Notes
5 pages
Malware Detection with Machine Learning
No ratings yet
Malware Detection with Machine Learning
25 pages
Gully Cricket Coordinator Profile
No ratings yet
Gully Cricket Coordinator Profile
1 page
Sample Project Report
No ratings yet
Sample Project Report
30 pages
Real-Time Disaster Management Systems
No ratings yet
Real-Time Disaster Management Systems
23 pages
Big Mart Sales Prediction Report
No ratings yet
Big Mart Sales Prediction Report
62 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
8 pages
Machine Learning in Biomedical Time Series
No ratings yet
Machine Learning in Biomedical Time Series
17 pages
Plagiarism Detection with LSA & Cosine Similarity
No ratings yet
Plagiarism Detection with LSA & Cosine Similarity
6 pages
Understanding Data Science Basics
No ratings yet
Understanding Data Science Basics
19 pages
Crop Disease Detection Using Image Processing
No ratings yet
Crop Disease Detection Using Image Processing
19 pages
Project Management and Communication Strategies
No ratings yet
Project Management and Communication Strategies
14 pages
Heart Disease Prediction with KNN Algorithm
No ratings yet
Heart Disease Prediction with KNN Algorithm
19 pages
Real-Time Fake News Detection System
No ratings yet
Real-Time Fake News Detection System
27 pages
HR Analytics: Attrition & Performance Insights
No ratings yet
HR Analytics: Attrition & Performance Insights
69 pages
AI Resume Screening with Machine Learning
No ratings yet
AI Resume Screening with Machine Learning
8 pages
Deep Learning for Diabetic Retinopathy Diagnosis
No ratings yet
Deep Learning for Diabetic Retinopathy Diagnosis
22 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
45 pages
Image-Based Plant Disease Detection
No ratings yet
Image-Based Plant Disease Detection
13 pages
AI Blood Cell Detection and Classification
No ratings yet
AI Blood Cell Detection and Classification
58 pages
Satellite Anomaly Detection Using CATS
No ratings yet
Satellite Anomaly Detection Using CATS
12 pages
Machine Learning in Big Data Mining
No ratings yet
Machine Learning in Big Data Mining
28 pages

Weka Data Preprocessing Guide

Uploaded by

Weka Data Preprocessing Guide

Uploaded by

LAB Manual

Note: we can select another attribute using the attribute list.

Step 5: Selecting or filtering attributes

Scroll down the list and select the “[Link]” filters.

1. Download Weka tool and understand the working.

2. Explore [Link] and [Link] dataset.

3. Work with another dataset called [Link].

4. Apply pre-processing for all the datasets as explained in PART 1.

Removing an attribute called “Pension”

Roll No. 70022200455 Name: Avni Bhardwaj

B.1 Preprocessing of [Link], [Link] and [Link]:

B.2 Observations and learning:

B.3 Questions of Curiosity:

• Normaliza on and Scaling: Ensure consistent feature scales.

• Handling Missing Data: Address missing values to prevent model issues.

• Feature Engineering: Create or transform features for be er model performance.

• Categorical Data Encoding: Convert text or labels into numerical data.

• Text Data Preprocessing: Tokenize, stem, and clean text data.

• Balancing Imbalanced Data: Address class imbalance in classi ca on tasks.

• Data Scaling for Algorithms: Standardize data for speci c algorithms.

2. Data Transformation:Data transformation methods include log transformation, scaling

4. Handling Categorical Data: Categorical data often requires special preprocessing,

6. Data Sampling:Data imbalance is common in classification tasks. Techniques such as

8. Normalization of Distributions: Some algorithms assume specific data distributions (e.g.,

You might also like