0% found this document useful (0 votes)
16 views8 pages

Understanding Data Classification Techniques

Classification is a data categorization technique aimed at identifying the class of new data based on training sets with known labels. It is essential for managing large, diverse datasets in the era of big data, and involves creating models that can predict outcomes based on attributes. The process includes model construction and usage, with accuracy being a key metric for evaluating classification models.

Uploaded by

petersonmwero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

Understanding Data Classification Techniques

Classification is a data categorization technique aimed at identifying the class of new data based on training sets with known labels. It is essential for managing large, diverse datasets in the era of big data, and involves creating models that can predict outcomes based on attributes. The process includes model construction and usage, with accuracy being a key metric for evaluating classification models.

Uploaded by

petersonmwero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

CLASSIFICATION:

Classification is a technique where we categorize data into a given number of classes.


The main goal of a classification problem is to identify the category/class to which a new data
will fall under.
Classification analysis is the organization of data in given classes.
Also known as supervised classification, the classification uses given class labels to order the
objects in the data collection
Why Classification?
Very large databases are becoming the norm in today's world of big data. Imagine a database
with terabytes of data—a terabyte is one trillion bytes of data. Facebook alone crunches 600
terabytes of new data every single day (as of 2014, the last time it reported these specs). The
primary challenge of big data is how to make sense of it. And sheer volume is not the only
problem: big data also tends to be diverse, unstructured and fast-changing. Consider audio and
video data, social media posts, 3D data, or geospatial data. This kind of data is not easily
categorized or organized.
To meet this challenge, a range of automatic methods for extracting useful information has been
developed, among them classification.
Classification approaches normally use a training set where all objects are already associated
with known class labels.
The classification algorithms learns from the training set and builds a model. The model is used
to classify new objects.
For example, after starting a credit policy, the video store managers could analyze the customers
behaviors vis-à-vis their credit, and labels accordingly the customers who received credits with
three
possible labels “safe”, “risky”, “very risky” . the classification analysis would generate a model
that could be used to either accept or reject credit requests in the future.
How Classification Works
An analyst's goal is to create a set of classification rules that answer a question, make a decision,
or predict behavior. To start, a set of training data is developed that contains a certain set of
attributes as well as the likely outcome. The job of the classification algorithm is to discover how
that set of attributes reaches its conclusion.
Consider a credit-card company trying to determine which prospects should receive a credit card
offer.
The company's training data might include:

1
Name Age Gender Annual Income Credit Card Offer
John Doe 25 M $39,500 No
Jane Doe 56 F $125,000 Yes
Training Data
The predictor columns Age, Gender, and Annual Income determine the value of the "predictor
attribute" Credit Card Offer. In a training set, the predictor attribute is known. The classification
algorithm then tries to determine how the value of the predictor attribute was reached: what
relationships exist between the predictors and the decision? It will develop a set of prediction
rules, usually an
IF/THEN statement.
this is a simple example, and the algorithm would need a far larger data sampling than the two
records shown here. Further, the prediction rules are likely to be far more complex, including
subrules to capture attribute details.
Next, the algorithm is given a "prediction set" of data to analyze, but
this set lacks the prediction attribute (or decision):
Name Age Gender Annual Income Credit Card Offer
Jack Frost 42 M $88,000 ? (Y)
Mary Murray 16 F $0 ? (N)
Example
A classification task begins with a data set in which the class assignments are known.
For example, a classification model that predicts credit risk could be developed based on
observed data for many loan applicants over a period of time. In addition to the historical credit
rating, the data might track employment history, home ownership or rental, years of residence,
number and type of investments, and so on. Credit rating would be the target, the other attributes
would be the predictors, and the data for each customer would constitute a case.
classifier: An algorithm that maps the input data to a specific category.
Classification model: A classification model tries to draw some conclusion from the input values
given for training. It will predict the class labels/categories for the new data.
Feature: A feature is an individual measurable property of a phenomenon being observed.
Binary Classification: Classification task with two possible outcomes. Eg: Gender classification
(Male / Female)
Multi-class classification: Classification with more than two classes. In multi class classification
each sample is assigned to one and only one target label. Eg: An animal can be cat or dog but not

2
both at the same time
Multi-label classification: Classification task where each sample is mapped to a set of target
labels (more than one class). Eg: A news article can be about sports, a person, and location at the
same time.
Classification—A Two-Step Process
1. Model construction: describing a set of predetermined classes Each tuple/sample is assumed to
belong to a predefined class, as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or mathematical formulae
2. Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
Process (1): Model Construction

Classification
Training Data Algorithm

NAM RANK YEARS TENURED


E
Mike Ass Prof 3 No
Mary Ass Prof 7 yes Classifier Model
Bill Prof 2 yes
Jim Asso Prof 7 yes
Dave Ass Prof 6 no
Anne Asso Prof 3 no If Rank=”professor”

OR Years>6

THEN Tenured=”yes”

3
Model usage
-for classifying future or unknown objects
o Estimate accuracy of the model
ƒ The known label of test sample is compared with the classified result from the model
ƒ Test accuracy rate is the percentage of test set samples that are correctly classified by the
model
ƒ Test set is independent of training set but from the same probability distribution

Testing Data Classifier

Unseen Data

NAME RANK YEARS TENURED


Tom Ass Prof 2 No
Melissa Ass Prof 7 No
George Prof 5 yes (Jeff,Professor,4)
Joseph Ass Prof 7 yes

Tenured?Yes

Accuracy is one metric for evaluating classification models.


Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has
the following definition:

4
Accuracy = Number of correct prediction / Total number of prediction
ISSUES REGARDING CLASSIFICATION
• The following preprocessing steps may be applied to the data which helps to improve the
accuracy, efficiency and scalability of the classification
• Data cleaning:
This refers to the processes data in order to reduce noise and handling missing values.
Relevance Analysis(feature selection)
remove the irrelevant or redundant attributes
Data Transformation and reduction
Generalize and / or normalize data
• Overfitting: Good performance on the training data, poor generliazation to other data.
• Underfitting: Poor performance on the training data and poor generalization to other data
• Examples of Overfitting
• Let’s say we want to predict if a student will land a job interview based on her resume.
• Now, assume we train a model from a dataset of 10,000 resumes and their outcomes.
• Next, we try the model out on the original dataset, and it predicts outcomes with 99%
accuracy… wow!
• But now comes the bad news.
• When we run the model on a new (“unseen”) dataset of resumes, we only get 50% accuracy…
uh-oh!
Techniques Reduce Overfitting
1. Increase the Training Data
2. Reduce the model Complexity
Techniques reduce underfitting
1. Increase model complexity
2. Increase the number of features
3. Remove the noise from the data

NAVIE BAYES ALGORITHM

5
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c). Look at the equation below:
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
EXAMPLE-2
CONDITION:
2M<= HEIGHT TALL
1.7M<HEIGHT<2M MEDIUM
HEIGHT<=1.7M SHORT
Table 1:
[Link] NAME GENDER HEIGHT OUTPUT
1 KRISTINA F 1.6 M SHORT
2 JIM M 2M TALL
3 MAGGIE F 1.9M MEDIUM
4 MARTHA F 1.88.M MEDIUM
5 STEPHANIE F 1.7M SHORT
6 BOB M 1.85M MEDIUM
7 KATHY F 1.6M SHORT
8 DAVE M 1.7M SHORT
9 WORTH M 2.2M TALL
10 STEVEN M 2.1M TALL
11 DEBBIE F 1.8M MEDIUM
12 TODD M 1.95M MEDIUM
13 KIM M 1.9M MEDIUM
14 AMY F 1.8.M MEDIUM
15 WYNETTE F 1.75M MEDIUM
Step 1: Convert the data set into a frequency table
Step 2: Create Likelihood table by finding the probabilities like red
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability
for each class. The class with the highest posterior probability is the
outcome of prediction.
Using the output of the classification results of the table

6
Short – 4 Male: 1 - Female:3
Medium - 8 Male:2 - Female:6
Tall - 3 Male: 3 – Female:0
P(SHORT) = 4/15 = 0.267
P(MEDIUM)= 8/15=0.533
P(TALL)= 3/15=0.2
We divide the height attribute into six ranges
[0-1.6],[1.6-1.7],[1.7-1.8],[1.8-1.9],[1.9-2.0], [2.0- above]
Table 2: shows the counts and subsequent probabilities associated with the attributes.
With these training data, we estimate the prior probabilities.
P(SHORT) = 4/15 = 0.267
P(MEDIUM)= 8/15=0.533
P(TALL)= 3/15=0.2
ATTRIBUTE VALUE COUNT PROBABILITY

Short Medium Tall Short Medium Tall


Gender M 1 2 3 1/4 2/8 3/3
F 3 6 0 3/4 6/8 0/3
Height [0-1.6] 2 0 0 2/4 0 0
[1.6-1.7] 2 0 0 2/4 0 0
[1.7-1.8] 0 3 0 0 3/8 0
[1.8-1.9] 0 4 0 0 4/8 0
[1.9-2.0] 0 1 1 0 1/8 1/3
[2.0- 0 0 2 0 0 2/3
above]

We use these values to classify a new tuple.


Example we wish to classify t= {adam, M,1.95m}. by using these values and the associated
probabilities of gender height, we obtain the following estimates
P(t|short) = ¼ X0 = 0
P(t|medium) = 2/8 X 1/8 = 0.031
P(t|tall) = 3/3 X 1/3 = 0.333
Combing these we get,

7
Likelihood of being short = 0X0.267 = 0
Likelihood of being medium = 0.031X0.5333 = 0.0166
Likelihood of being tall = 0.33X0.2 = 0.066
We estimate P(t) summing up these individual likelihood values since t will be either short or
medium or tall
P(t) = 0+0.0166+0.066 = 0.0826
Finally , we obtain the actual probabilities of each event
P(short|t) = (0 X 0.0267)/ 0.0826 = 0
P(medium|t) = (0.031X0.533)/ 0.0826 = 0.2
P(tall|t) =( 0.333X0.2 )/ 0.0826 = 0.799
Therefore based on the probabilities, we classify the new tuple as Tall because it has the highest
probability.

You might also like