0% found this document useful (0 votes)
6 views10 pages

Weka Data Preprocessing Guide

The document is a lab manual detailing Experiment No. 06, which focuses on data pre-processing using the WEKA tool. It outlines steps for loading datasets, computing statistics, filtering attributes, and discretizing numerical data, along with tasks for students to complete. Additionally, it covers the importance of preprocessing in data analysis, types of data, and various preprocessing techniques.

Uploaded by

yuvrajzamindar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Weka Data Preprocessing Guide

The document is a lab manual detailing Experiment No. 06, which focuses on data pre-processing using the WEKA tool. It outlines steps for loading datasets, computing statistics, filtering attributes, and discretizing numerical data, along with tasks for students to complete. Additionally, it covers the importance of preprocessing in data analysis, types of data, and various preprocessing techniques.

Uploaded by

yuvrajzamindar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

LAB Manual

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment No.06
1. Demonstration of pre-processing on the available datasets

Aim: To understand some of the basic data pre-processing operations that can be performed
using WEKA-Explorer. The sample dataset used for this example is the student/labor data
available in .arff format.

Prerequisite:
Weka Downloaded (Open source), Fundamental Knowledge of Database Management
Learning Outcomes:
Pre-processing, working of WEKA TOOL, Pre-processing using weka tool.

Theory:
Step 1: Loading the data. We can load the dataset into weka by clicking on open button in
pre-processing interface and selecting the appropriate file.

Step 2: Once the data is loaded, weka will recognize the attributes and during the scanning of
the data, weka will compute some basic statistics on each attribute. The left panel in the
explorer shows the list of recognized attributes while the top panel indicates the names of the
base relation or table and the current working relation (which are same initially).

Step 3: Clicking on an attribute in the left panel will show the basic statistics on the attributes
for the categorical attributes the frequency of each attribute value is shown, while for
continuous attributes we can obtain min, max, mean, standard deviation and deviation etc.,

Step 4: Create visualization in the right button panel in the form of cross-tabulation across
two attributes.

Note: we can select another attribute using the attribute list.

Step 5: Selecting or filtering attributes

1
Removing an attribute
When we need to remove an attribute, we can do this by using the attribute filters in weka. In
the filter model panel, click on choose button, this will show a popup window with a list of
available filters.

Scroll down the list and select the “[Link]” filters.

Step 6:
a) Next click the textbox immediately to the right of the choose button. In the resulting dialog
box, enter the index of the attribute to be filtered out.
b) Make sure that invert selection option is set to false. The click OK now in the filter box.
You will see “Remove-R-7”.
c) Click the apply button to apply filter to this data. This will remove the attribute and create
new working relation.
d) Save the new working relation as an .arff file by clicking save button on the top (button)
panel.
Discretization of an attribute
Sometimes association rule mining can only be performed on the categorical data. This
requires performing discretization on numeric or continuous attributes. In the following
example, let us discretize any numerical attribute.

• Divide the values of any numerical attribute into three bins (intervals).
• First load the dataset into weka ([Link])
• Select any of the numerical attribute.
• Activate filter-dialog box and select “[Link]”
from the list.
• To change the defaults for the filters, click on the box immediately to the right of the
choose button.
• We enter the index for the attribute to be discretized. In this case, if the attribute is
coming at serial no. [Link] have to enter ‘1’ corresponding to this attribute.
• Enter ‘3’ as the number of bins. Leave the remaining field values as they are.
• Click OK button.

2
• Click apply in the filter panel. This will result in a new working relation with the
selected attribute partition into 3 bins.
• Save the new working relation in a new file.

Tasks:

1. Download Weka tool and understand the working.

2. Explore [Link] and [Link] dataset.

3. Work with another dataset called [Link].

4. Apply pre-processing for all the datasets as explained in PART 1.

Removing an attribute called “Pension”

3
Filter and Normalize Data (Optional): Use filters in the "Filters" panel to apply data preprocessing
techniques such as smoothing, normalization, or discretization.

4
Handling Missing Values (Optional): If your dataset has missing values, you can handle them by
selecting the "Edit" button in the Preprocess panel and choosing how to deal with missing values
(e.g., replace them with mean, median, or a specific value).

Attribute Selection (Optional): Use the "Select attributes" panel to perform attribute selection if
needed.

5
5. Try to apply some data mining techniques on above dataset and visualize the
output.

6
6. Record all the screen shots.

PART B
(PART B: TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the
practical slot. The soft copy must be uploaded on the LMS (Teams/Portal) or emailed to the
concerned lab in charge faculty at the end of the practical in case there is no LMS access
available)

Roll No. 70022200455 Name: Avni Bhardwaj


Class : Btech CE Batch : 1
Date of Experiment: 23/9/23 Date of Submission: 23/9/23

B.1 Preprocessing of [Link], [Link] and [Link]:


(Paste your screen shots of all tasks completed during the 2 hours of practical here)

B.2 Observations and learning:


(Students are expected to comment on the output obtained with clear observations and
learning for each task/ sub part assigned).

B.3 Questions of Curiosity:


1. What is preprocessing? Why it is required?
Preprocessing in the context of data analysis and machine learning refers to a series of steps
and techniques applied to raw data before it is used for modeling or analysis. The primary
purpose of preprocessing is to prepare the data in a clean, structured, and suitable format for
the specific task at [Link] ensures that the data is clean, relevant, and properly formatted,
which in turn helps improve the accuracy and performance of models and analysis results.
The specific preprocessing steps used may vary depending on the nature of the data and the
goals of the analysis or modeling task.
Preprocessing is required for several reasons:
• Data Quality Improvement: Remove errors, inconsistencies, and handle missing values.

• Normaliza on and Scaling: Ensure consistent feature scales.

• Handling Missing Data: Address missing values to prevent model issues.

• Feature Engineering: Create or transform features for be er model performance.

7
ti
tt
• Dimensionality Reduc on: Reduce features to avoid over ng and save computa on.

• Categorical Data Encoding: Convert text or labels into numerical data.

• Outlier Detec on: Iden fy and manage outliers that can skew results.

• Data Spli ng: Separate data into training and tes ng sets.

• Normaliza on of Distribu ons: Ensure data follows the expected distribu on.

• Text Data Preprocessing: Tokenize, stem, and clean text data.

• Balancing Imbalanced Data: Address class imbalance in classi ca on tasks.

• Data Scaling for Algorithms: Standardize data for speci c algorithms.

2. List and explain the different types of data types (Numerical, categorical (ordinal and
nominal)
—>In the realm of data analysis and machine learning, understanding different data types is
fundamental. Data can be categorized into several types, with numerical and categorical data
being two primary classifications. This assignment delves into these data types, providing
explanations and examples for each.

1. Numerical Data:
- Numerical data comprises numbers and can be further subdivided into two subtypes:
a. Continuous Numerical Data: This type involves values that can take any real number
within a certain range. For instance, attributes like age, height, temperature, and income fall
under this category. Continuous data is measurable and can assume an infinite number of
values within its defined range.
b. Discrete Numerical Data: In contrast, discrete numerical data entails values that are
counted and are typically whole numbers. Examples encompass the count of products sold,
the number of children in a family, and the quantity of cars in a parking lot. Discrete data
assumes distinct, separate values.

2. Categorical Data:
- Categorical data represents distinct groups or categories and lacks a natural numerical
order. Categorical data can be further divided into two subtypes:

8
tti
ti
ti
ti
ti
ti
ti
fi
fi
tti
fi
ti
ti
ti
a. Nominal Data:Nominal data denotes categories with no inherent order or ranking.
Classic instances include colors (e.g., red, blue, green), types of animals (e.g., cat, dog, bird),
and country names. Nominal data can only be categorized and compared based on equality or
inequality; it does not permit arithmetic operations.
b. Ordinal Data: Ordinal data signifies categories with a specific order or ranking, though
the intervals between categories may not be uniformly meaningful. Examples encompass
education levels (e.g., high school, bachelor's, master's), customer satisfaction ratings (e.g.,
poor, fair, good, excellent), and star ratings for products (e.g., 1 star, 2 stars, 3 stars). While
ordinal data possesses an order, the differences between categories may not be consistently
interpretable.

3. Which are the preprocessing techniques currently used in data analysis. Explain.
(Please refer to the relevant websites/latest research papers to answer this question.)
—>
1. Data Cleaning: This involves handling missing values, dealing with duplicate records, and
addressing outliers. Techniques include imputation, outlier detection, and removal.

2. Data Transformation:Data transformation methods include log transformation, scaling


(e.g., Min-Max scaling), and standardization (mean-removal and variance-scaling). These
techniques help make data suitable for certain algorithms and improve model performance.

3. Feature Engineering: Feature engineering involves creating new features from existing
ones or selecting the most relevant features. Techniques include one-hot encoding for
categorical variables, creating interaction terms, and using dimensionality reduction
techniques like Principal Component Analysis (PCA).

4. Handling Categorical Data: Categorical data often requires special preprocessing,


including one-hot encoding for nominal data and ordinal encoding for ordinal data. This
transforms categorical variables into a numerical format for analysis.

9
5. Text Preprocessing:In natural language processing (NLP) tasks, text data preprocessing
techniques include tokenization, stemming, lemmatization, and stop word removal to prepare
text for analysis.

6. Data Sampling:Data imbalance is common in classification tasks. Techniques such as


oversampling (creating more instances of the minority class) and undersampling (reducing
instances of the majority class) can help balance the dataset.

7. Data Splitting: Datasets are typically split into training, validation, and test sets to evaluate
model performance. This ensures that models are tested on unseen data.

8. Normalization of Distributions: Some algorithms assume specific data distributions (e.g.,


Gaussian). Techniques like the Box-Cox transformation can help normalize data distributions.

9. Handling Time-Series Data: For time-series data, preprocessing may involve resampling,
handling missing time points, and feature extraction from temporal data.

10. Data Integration: When working with multiple data sources, integrating and merging data
is crucial. Techniques include data alignment, join operations, and data aggregation.

11. Noise Reduction: In signal processing and image analysis, noise reduction techniques,
such as filtering, are used to remove unwanted noise from the data.

12. Handling Spatial Data: In GIS and spatial analysis, preprocessing techniques involve
georeferencing, coordinate transformation, and spatial data filtering.

10

You might also like