0% found this document useful (0 votes)

16 views8 pages

Understanding Data Classification Techniques

Classification is a data categorization technique aimed at identifying the class of new data based on training sets with known labels. It is essential for managing large, diverse datasets in the era of big data, and involves creating models that can predict outcomes based on attributes. The process includes model construction and usage, with accuracy being a key metric for evaluating classification models.

Uploaded by

petersonmwero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views8 pages

Understanding Data Classification Techniques

Uploaded by

petersonmwero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

CLASSIFICATION:

Classification is a technique where we categorize data into a given number of classes.

The main goal of a classification problem is to identify the category/class to which a new data
will fall under.
Classification analysis is the organization of data in given classes.
Also known as supervised classification, the classification uses given class labels to order the
objects in the data collection
Why Classification?
Very large databases are becoming the norm in today's world of big data. Imagine a database
with terabytes of data—a terabyte is one trillion bytes of data. Facebook alone crunches 600
terabytes of new data every single day (as of 2014, the last time it reported these specs). The
primary challenge of big data is how to make sense of it. And sheer volume is not the only
problem: big data also tends to be diverse, unstructured and fast-changing. Consider audio and
video data, social media posts, 3D data, or geospatial data. This kind of data is not easily
categorized or organized.
To meet this challenge, a range of automatic methods for extracting useful information has been
developed, among them classification.
Classification approaches normally use a training set where all objects are already associated
with known class labels.
The classification algorithms learns from the training set and builds a model. The model is used
to classify new objects.
For example, after starting a credit policy, the video store managers could analyze the customers
behaviors vis-à-vis their credit, and labels accordingly the customers who received credits with
three
possible labels “safe”, “risky”, “very risky” . the classification analysis would generate a model
that could be used to either accept or reject credit requests in the future.
How Classification Works
An analyst's goal is to create a set of classification rules that answer a question, make a decision,
or predict behavior. To start, a set of training data is developed that contains a certain set of
attributes as well as the likely outcome. The job of the classification algorithm is to discover how
that set of attributes reaches its conclusion.
Consider a credit-card company trying to determine which prospects should receive a credit card
offer.
The company's training data might include:

1
Name Age Gender Annual Income Credit Card Offer
John Doe 25 M $39,500 No
Jane Doe 56 F $125,000 Yes
Training Data
The predictor columns Age, Gender, and Annual Income determine the value of the "predictor
attribute" Credit Card Offer. In a training set, the predictor attribute is known. The classification
algorithm then tries to determine how the value of the predictor attribute was reached: what
relationships exist between the predictors and the decision? It will develop a set of prediction
rules, usually an
IF/THEN statement.
this is a simple example, and the algorithm would need a far larger data sampling than the two
records shown here. Further, the prediction rules are likely to be far more complex, including
subrules to capture attribute details.
Next, the algorithm is given a "prediction set" of data to analyze, but
this set lacks the prediction attribute (or decision):
Name Age Gender Annual Income Credit Card Offer
Jack Frost 42 M $88,000 ? (Y)
Mary Murray 16 F $0 ? (N)
Example
A classification task begins with a data set in which the class assignments are known.
For example, a classification model that predicts credit risk could be developed based on
observed data for many loan applicants over a period of time. In addition to the historical credit
rating, the data might track employment history, home ownership or rental, years of residence,
number and type of investments, and so on. Credit rating would be the target, the other attributes
would be the predictors, and the data for each customer would constitute a case.
classifier: An algorithm that maps the input data to a specific category.
Classification model: A classification model tries to draw some conclusion from the input values
given for training. It will predict the class labels/categories for the new data.
Feature: A feature is an individual measurable property of a phenomenon being observed.
Binary Classification: Classification task with two possible outcomes. Eg: Gender classification
(Male / Female)
Multi-class classification: Classification with more than two classes. In multi class classification
each sample is assigned to one and only one target label. Eg: An animal can be cat or dog but not

2
both at the same time
Multi-label classification: Classification task where each sample is mapped to a set of target
labels (more than one class). Eg: A news article can be about sports, a person, and location at the
same time.
Classification—A Two-Step Process
1. Model construction: describing a set of predetermined classes Each tuple/sample is assumed to
belong to a predefined class, as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or mathematical formulae
2. Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
Process (1): Model Construction

Classification
Training Data Algorithm

NAM RANK YEARS TENURED

E
Mike Ass Prof 3 No
Mary Ass Prof 7 yes Classifier Model
Bill Prof 2 yes
Jim Asso Prof 7 yes
Dave Ass Prof 6 no
Anne Asso Prof 3 no If Rank=”professor”

OR Years>6

THEN Tenured=”yes”

3
Model usage
-for classifying future or unknown objects
o Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Test accuracy rate is the percentage of test set samples that are correctly classified by the
model
Test set is independent of training set but from the same probability distribution

Testing Data Classifier

Unseen Data

NAME RANK YEARS TENURED

Tom Ass Prof 2 No
Melissa Ass Prof 7 No
George Prof 5 yes (Jeff,Professor,4)
Joseph Ass Prof 7 yes

Tenured?Yes

Accuracy is one metric for evaluating classification models.

Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has
the following definition:

4
Accuracy = Number of correct prediction / Total number of prediction
ISSUES REGARDING CLASSIFICATION
• The following preprocessing steps may be applied to the data which helps to improve the
accuracy, efficiency and scalability of the classification
• Data cleaning:
This refers to the processes data in order to reduce noise and handling missing values.
Relevance Analysis(feature selection)
remove the irrelevant or redundant attributes
Data Transformation and reduction
Generalize and / or normalize data
• Overfitting: Good performance on the training data, poor generliazation to other data.
• Underfitting: Poor performance on the training data and poor generalization to other data
• Examples of Overfitting
• Let’s say we want to predict if a student will land a job interview based on her resume.
• Now, assume we train a model from a dataset of 10,000 resumes and their outcomes.
• Next, we try the model out on the original dataset, and it predicts outcomes with 99%
accuracy… wow!
• But now comes the bad news.
• When we run the model on a new (“unseen”) dataset of resumes, we only get 50% accuracy…
uh-oh!
Techniques Reduce Overfitting
1. Increase the Training Data
2. Reduce the model Complexity
Techniques reduce underfitting
1. Increase model complexity
2. Increase the number of features
3. Remove the noise from the data

NAVIE BAYES ALGORITHM

5
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c). Look at the equation below:
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
EXAMPLE-2
CONDITION:
2M<= HEIGHT TALL
1.7M<HEIGHT<2M MEDIUM
HEIGHT<=1.7M SHORT
Table 1:
[Link] NAME GENDER HEIGHT OUTPUT
1 KRISTINA F 1.6 M SHORT
2 JIM M 2M TALL
3 MAGGIE F 1.9M MEDIUM
4 MARTHA F 1.88.M MEDIUM
5 STEPHANIE F 1.7M SHORT
6 BOB M 1.85M MEDIUM
7 KATHY F 1.6M SHORT
8 DAVE M 1.7M SHORT
9 WORTH M 2.2M TALL
10 STEVEN M 2.1M TALL
11 DEBBIE F 1.8M MEDIUM
12 TODD M 1.95M MEDIUM
13 KIM M 1.9M MEDIUM
14 AMY F 1.8.M MEDIUM
15 WYNETTE F 1.75M MEDIUM
Step 1: Convert the data set into a frequency table
Step 2: Create Likelihood table by finding the probabilities like red
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability
for each class. The class with the highest posterior probability is the
outcome of prediction.
Using the output of the classification results of the table

6
Short – 4 Male: 1 - Female:3
Medium - 8 Male:2 - Female:6
Tall - 3 Male: 3 – Female:0
P(SHORT) = 4/15 = 0.267
P(MEDIUM)= 8/15=0.533
P(TALL)= 3/15=0.2
We divide the height attribute into six ranges
[0-1.6],[1.6-1.7],[1.7-1.8],[1.8-1.9],[1.9-2.0], [2.0- above]
Table 2: shows the counts and subsequent probabilities associated with the attributes.
With these training data, we estimate the prior probabilities.
P(SHORT) = 4/15 = 0.267
P(MEDIUM)= 8/15=0.533
P(TALL)= 3/15=0.2
ATTRIBUTE VALUE COUNT PROBABILITY

Short Medium Tall Short Medium Tall

Gender M 1 2 3 1/4 2/8 3/3
F 3 6 0 3/4 6/8 0/3
Height [0-1.6] 2 0 0 2/4 0 0
[1.6-1.7] 2 0 0 2/4 0 0
[1.7-1.8] 0 3 0 0 3/8 0
[1.8-1.9] 0 4 0 0 4/8 0
[1.9-2.0] 0 1 1 0 1/8 1/3
[2.0- 0 0 2 0 0 2/3
above]

We use these values to classify a new tuple.

Example we wish to classify t= {adam, M,1.95m}. by using these values and the associated
probabilities of gender height, we obtain the following estimates
P(t|short) = ¼ X0 = 0
P(t|medium) = 2/8 X 1/8 = 0.031
P(t|tall) = 3/3 X 1/3 = 0.333
Combing these we get,

7
Likelihood of being short = 0X0.267 = 0
Likelihood of being medium = 0.031X0.5333 = 0.0166
Likelihood of being tall = 0.33X0.2 = 0.066
We estimate P(t) summing up these individual likelihood values since t will be either short or
medium or tall
P(t) = 0+0.0166+0.066 = 0.0826
Finally , we obtain the actual probabilities of each event
P(short|t) = (0 X 0.0267)/ 0.0826 = 0
P(medium|t) = (0.031X0.533)/ 0.0826 = 0.2
P(tall|t) =( 0.333X0.2 )/ 0.0826 = 0.799
Therefore based on the probabilities, we classify the new tuple as Tall because it has the highest
probability.

Classification & Prediction - Models
No ratings yet
Classification & Prediction - Models
23 pages
Classification and Prediction Techniques
No ratings yet
Classification and Prediction Techniques
28 pages
Data Classification Techniques Overview
No ratings yet
Data Classification Techniques Overview
102 pages
Classification and Prediction Overview
No ratings yet
Classification and Prediction Overview
20 pages
Classification and Prediction Methods Explained
No ratings yet
Classification and Prediction Methods Explained
88 pages
Understanding Pattern Recognition and Classification
No ratings yet
Understanding Pattern Recognition and Classification
5 pages
Classification vs. Prediction in Data Analysis
No ratings yet
Classification vs. Prediction in Data Analysis
69 pages
Understanding Data Classification Processes
No ratings yet
Understanding Data Classification Processes
15 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Supervised Machine Learning Explained
No ratings yet
Supervised Machine Learning Explained
22 pages
Classification and Prediction in Data Mining
No ratings yet
Classification and Prediction in Data Mining
17 pages
Classification Algorithms in Machine Learning
No ratings yet
Classification Algorithms in Machine Learning
51 pages
Classification and Prediction Techniques
No ratings yet
Classification and Prediction Techniques
103 pages
Statistical Methods in Machine Learning
No ratings yet
Statistical Methods in Machine Learning
5 pages
Classification Algorithms in Machine Learning
No ratings yet
Classification Algorithms in Machine Learning
50 pages
Machine Learning Classification Overview
No ratings yet
Machine Learning Classification Overview
214 pages
Understanding Data Classification Techniques
No ratings yet
Understanding Data Classification Techniques
21 pages
Classification Methods in Data Mining
No ratings yet
Classification Methods in Data Mining
33 pages
Classification vs Prediction in Data Analytics
No ratings yet
Classification vs Prediction in Data Analytics
55 pages
Module 4
No ratings yet
Module 4
35 pages
Data Mining Classification Techniques
No ratings yet
Data Mining Classification Techniques
47 pages
Naïve Bayes Classifier Overview
No ratings yet
Naïve Bayes Classifier Overview
37 pages
Predictive Analytics: Classification Basics
No ratings yet
Predictive Analytics: Classification Basics
28 pages
Data Mining: Classification Techniques
No ratings yet
Data Mining: Classification Techniques
22 pages
Generative vs Discriminative Classifiers
No ratings yet
Generative vs Discriminative Classifiers
53 pages
Understanding Data Classification Techniques
No ratings yet
Understanding Data Classification Techniques
90 pages
Big Data Analytics: Classification & Models
No ratings yet
Big Data Analytics: Classification & Models
27 pages
Classification and Prediction in Data Mining
No ratings yet
Classification and Prediction in Data Mining
91 pages
Unit - II Classification Algorithms
No ratings yet
Unit - II Classification Algorithms
43 pages
Machine Learning Classification Overview
No ratings yet
Machine Learning Classification Overview
35 pages
Unit-3 Supervised Learning
No ratings yet
Unit-3 Supervised Learning
30 pages
Classification and Prediction
No ratings yet
Classification and Prediction
36 pages
Understanding Classification Techniques
No ratings yet
Understanding Classification Techniques
20 pages
Classification vs. Prediction in DWM
No ratings yet
Classification vs. Prediction in DWM
12 pages
Data Classification and Mining Techniques
No ratings yet
Data Classification and Mining Techniques
31 pages
Classification Metrics for Mixed Targets
No ratings yet
Classification Metrics for Mixed Targets
16 pages
Classification vs Clustering Explained
No ratings yet
Classification vs Clustering Explained
162 pages
Data Mining: Classification Methods Overview
No ratings yet
Data Mining: Classification Methods Overview
63 pages
Machine Learning-Classification
No ratings yet
Machine Learning-Classification
52 pages
Module 4 - Notes
No ratings yet
Module 4 - Notes
47 pages
Statistical Modeling in Data Mining
No ratings yet
Statistical Modeling in Data Mining
7 pages
Supervised Learning
No ratings yet
Supervised Learning
79 pages
Data Classification Techniques Explained
No ratings yet
Data Classification Techniques Explained
48 pages
FDS1 Unit 4 (2bca)
No ratings yet
FDS1 Unit 4 (2bca)
29 pages
Classification and Prediction Methods
No ratings yet
Classification and Prediction Methods
92 pages
Supervised Learning: Regression & Classification
No ratings yet
Supervised Learning: Regression & Classification
141 pages
Classification Models in Machine Learning
No ratings yet
Classification Models in Machine Learning
7 pages
DM 4 Evaluatingclassifiersnew
No ratings yet
DM 4 Evaluatingclassifiersnew
36 pages
Understanding Classification Analysis
No ratings yet
Understanding Classification Analysis
4 pages
Classification in ML
No ratings yet
Classification in ML
5 pages
Machine Learning Classification Basics
No ratings yet
Machine Learning Classification Basics
6 pages
CCPS521 WIN2023 Week05 - Classification
No ratings yet
CCPS521 WIN2023 Week05 - Classification
47 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
73 pages
Data Classification Techniques Explained
No ratings yet
Data Classification Techniques Explained
18 pages
Binary Classification Techniques Explained
No ratings yet
Binary Classification Techniques Explained
39 pages
Understanding Data Mining Processes
No ratings yet
Understanding Data Mining Processes
24 pages
Understanding Classification in Data Analysis
No ratings yet
Understanding Classification in Data Analysis
78 pages
Naive Bayes and Regression Methods Overview
No ratings yet
Naive Bayes and Regression Methods Overview
18 pages
Data Augmentation Techniques Explained
No ratings yet
Data Augmentation Techniques Explained
4 pages
Soft Computing Mid Term Exam IT3140
No ratings yet
Soft Computing Mid Term Exam IT3140
6 pages
Muon Optimizer Implementation in PyTorch
No ratings yet
Muon Optimizer Implementation in PyTorch
4 pages
Software Defect Prediction with Shapley Models
No ratings yet
Software Defect Prediction with Shapley Models
20 pages
Nalluri Pardha Saradhi's Resume
No ratings yet
Nalluri Pardha Saradhi's Resume
1 page
Hybrid CNN-LSTM for IDS Optimization
No ratings yet
Hybrid CNN-LSTM for IDS Optimization
38 pages
Data Exploration and Integration with WEKA
No ratings yet
Data Exploration and Integration with WEKA
40 pages
Anime-to-Real Cosplay Costume Generation
No ratings yet
Anime-to-Real Cosplay Costume Generation
19 pages
Generative AI: Transformers & LLMs Explained
No ratings yet
Generative AI: Transformers & LLMs Explained
7 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
23 pages
AUC ROC Curve Explained for ML
No ratings yet
AUC ROC Curve Explained for ML
10 pages
Essential Guide to Feature Engineering
No ratings yet
Essential Guide to Feature Engineering
11 pages
Game Theory Using AI & ML Sem 6 All Branches Aeraxia in
No ratings yet
Game Theory Using AI & ML Sem 6 All Branches Aeraxia in
4 pages
Deep Learning for Bearing Fault Diagnosis
No ratings yet
Deep Learning for Bearing Fault Diagnosis
15 pages
CNN Architectures for Chart Classification
No ratings yet
CNN Architectures for Chart Classification
8 pages
AI/ML Models for 6G Air Interface
No ratings yet
AI/ML Models for 6G Air Interface
15 pages
Image and Video Data Preprocessing
No ratings yet
Image and Video Data Preprocessing
10 pages
LeNet-5: Foundation of CNNs in Vision
No ratings yet
LeNet-5: Foundation of CNNs in Vision
13 pages
AI Fruit Recognition System Proposal
No ratings yet
AI Fruit Recognition System Proposal
3 pages
Linear Regression and Data Splitting Guide
No ratings yet
Linear Regression and Data Splitting Guide
16 pages
Stack Less, Repeat More: A Block Reusing Approach For Progressive Speech Enhancement
No ratings yet
Stack Less, Repeat More: A Block Reusing Approach For Progressive Speech Enhancement
5 pages
Machine Learning Exam Question Paper
No ratings yet
Machine Learning Exam Question Paper
4 pages
Understanding Histograms of Oriented Gradients
No ratings yet
Understanding Histograms of Oriented Gradients
27 pages
Brain Cancer Diagnosis with AI Insights
No ratings yet
Brain Cancer Diagnosis with AI Insights
6 pages
Children's Story Language Model Guide
No ratings yet
Children's Story Language Model Guide
5 pages
AI Facial Emotion Recognition System
No ratings yet
AI Facial Emotion Recognition System
16 pages
Bird and Drone Classification with CNN
No ratings yet
Bird and Drone Classification with CNN
10 pages
Fake News Detection with BERT & Bi-LSTM
No ratings yet
Fake News Detection with BERT & Bi-LSTM
18 pages
Machine Vision Course Outcomes Overview
No ratings yet
Machine Vision Course Outcomes Overview
71 pages

Understanding Data Classification Techniques

Uploaded by

Understanding Data Classification Techniques

Uploaded by

CLASSIFICATION:

Classification is a technique where we categorize data into a given number of classes.

NAM RANK YEARS TENURED

Testing Data Classifier

NAME RANK YEARS TENURED

Accuracy is one metric for evaluating classification models.

NAVIE BAYES ALGORITHM

Short Medium Tall Short Medium Tall

We use these values to classify a new tuple.

You might also like