Machine Learning for
Developers
Dr Prakash Goteti
Technology Learning Services
Agenda
Big Picture: Introduction to Data Science
Where Machine learning fits in?
What is machine learning
Machine learning case studies
Machine learning –Key terminology
Predictive Analytics and Recommendation Systems
(Un)Supervised learning algorithms
Copyright © 2017 Tech Mahindra. All rights reserved. 2
Introduction to Data Science
Copyright © 2017 Tech Mahindra. All rights reserved. 3
Big Picture –Data Science
Data Science
Establish Research
Goal
Gather the data
Prepare the data
Explore the data
Build a model
Present the findings
Copyright © 2017 Tech Mahindra. All rights reserved. 4
Big Picture –Data Science
Data Science
Define Research goal
Establish Research
Goal
Prepare Project charter
Gather the data
Prepare the data
Explore the data
Build a model
Present the findings
Copyright © 2017 Tech Mahindra. All rights reserved. 5
Big Picture –Data Science
Data Science
Establish Research
Goal
Internal Data
Gather the data
External Data
Prepare the data
Explore the data
Build a model
Present the findings
Copyright © 2017 Tech Mahindra. All rights reserved. 6
Big Picture –Data Science
Data Science
Establish Research
Goal
Gather the data
Data cleansing
Prepare the data Data Transformation
Data Aggregation
Explore the data
Build a model
Present the findings
Copyright © 2017 Tech Mahindra. All rights reserved. 7
Big Picture –Data Science
Data Science
Establish Research
Goal
Gather the data
Prepare the data
Graphical
techniques
Explore the data Visualization
Techniques
Non Graphical Techniques
Build a model
Present the findings
Copyright © 2017 Tech Mahindra. All rights reserved. 8
Big Picture –Data Science
Data Science
Establish Research
Goal
Gather the data
Prepare the data
Explore the data
Model selection
Build a model Model execution
Model evaluation
Present the findings
Copyright © 2017 Tech Mahindra. All rights reserved. 9
Big Picture –Data Science
Data Science
Establish Research
Goal
Gather the data
Prepare the data
Explore the data
Build a model
Presentation
Present the findings
Automation and inferences
Copyright © 2017 Tech Mahindra. All rights reserved. 10
Big Picture –Data Science
Data Science
Numpy and
Data cleansing
Pandas
Establish Research
Goal
Data
matplotlib
visualization and
package
Gather the data reporting
Machine
Prepare the data scikit-learn
learning
toolkit
algorithms
Explore the data
Natural
Nltk
language
framework
processing
Build a model
Social network NetworkX
analysis Library
Present the findings
Copyright © 2017 Tech Mahindra. All rights reserved. 11
Introduction to Machine Learning
Copyright © 2017 Tech Mahindra. All rights reserved. 12
Machine Learning
Machine learning is amalgamation of computer science,
engineering and statistics.
It is a tool that can be applied to many problems with the nature of
data interpretation and action on data for the benefit of business
Machine learning uses statistics extensively.
Copyright © 2017 Tech Mahindra. All rights reserved. 13
Machine learning case studies (1-2)
GE already makes hundreds of millions of dollars by crunching the
data it collects from deep-sea oil wells or jet engines to optimize
performance, anticipate breakdowns, and streamline maintenance.
Outside North America:
In Europe, more than a dozen banks have replaced older
statistical-modeling approaches with machine-learning techniques
and, in some cases, experienced 10 percent increases in sales of
new products, 20 percent savings in capital expenditures, 20
percent increases in cash collections, and 20 percent declines in
churn.
This is through new recommendation engines for clients in retailing
and in small and medium-sized companies enabling more accurate
forecast.
Copyright © 2017 Tech Mahindra. All rights reserved. 14
Machine learning case studies (2-2)
A Canadian bank uses predictive analytics to increase campaign response rates
by 600%, cut customer acquisition costs in half, and boost campaign ROI by
100%.
A research group at a leading hospital combined predictive and text analytics to
improve its ability to classify and treat pediatric brain tumors.
An airline increased revenue and customer satisfaction by better estimating the
number of passengers who won’t show up for a flight. This reduces the number of
overbooked flights that require re-accommodating passengers as well as the
number of empty seats.
These use cases reflect an important fact that predictive analytics (PA) can
provide significant impact towards Return –On -Investments for the organizations.
PA can help companies in achieving operational excellence through cost
reduction, process improvement, better understand customer behavior, identify
unexpected opportunities, and anticipate problems before they happen so that risk
mitigation, avoidance steps can be taken up effectively.
Copyright © 2017 Tech Mahindra. All rights reserved. 15
Key Terminology
Features: individual measurements that when combined with
other features make up a training example
• identifying key properties describing these entities.
• If these entities are represented as table, each column is identified as feature
or attribute.
• Each row in the table is described as instance.
• Features or attributes are the individual measurements which collectively make
up a training example.
• This is usually columns in a training or test set
Training set:
• Set of columns/attributes collectively constitutes training set.
• The target variable or class the training example belongs to is then compared
to the predicted value to understand how accurate the algorithm is.
Training example:
• Each training example has features of a class and target variable.
Copyright © 2017 Tech Mahindra. All rights reserved. 16
Key Terminology
Data Types
• Numeric Data (quantifiable things-discrete, continuous )
• Categorical ( Based on categories –enumerate the categories)
• Ordinal Data (mixture above: star ratings on product, movie etc)
Knowledge Representation:
• It is in the form of rules –like probability distribution
• These rules are readable by the machine.
Classification: To predict what class an instance of data should fall into.
Regression: A best fit line drawn through some data points to generalize the data
points
• Regression is prediction of a numeric value. For example, consider the problem of classification of items
Supervised learning:
• There is a target value given for the data
Un-supervised learning:
Copyright © 2017 Tech Mahindra. All rights reserved. 17
• There is no target value given for the data
Steps in Machine learning
Data • RSS feed, likes, dislikes
Collection extracting from Websites
Data
cleansing • Refining the data /columns
Analyze
input Data • Recognize if any patterns
Train the
Algorithm • Feed the MLA with clean data
Test the
algorithm • Infer the results
Copyright © 2017 Tech Mahindra. All rights reserved. 18
Mathematical and Statistical Foundations
Copyright © 2017 Tech Mahindra. All rights reserved. 19
Binning No of Age
people Range
Convert Numeric data into categorical data (bins)
Use pre-defined ranges as bins 20 20-30
Classification algorithm and age is class variable 33 31-40
Indicator variables –convert categorical data into Boolean
data
45 41-50
Centering and Scaling Time zone
– Standardise the range of values 41 51-60
– Better comparison
– Values are “centered” by subtracting them from the mean: 37 >60
– Values are scaled by dividing the above by SD
– ML algorithm gives better results with standardized values
Mean:
– Variance describes spread around the mean:
– SD Example: sample: (2,5,6,5,9) Mean =27/55.4
– (5.4) Differences from the mean =(-3.4, -0.4,0.6,-0.4,4.4)
– Squared differences =(11.56, 0.16, 0.36, 0.16,19.36)
– Avg of squared diffs =(11.56, 0.16, 0.36, 0.16,19.36)/5
– =31.6/5 =6.32 =2.51
Copyright © 2017 Tech Mahindra. All rights reserved. 20
Correlation
Pearson correlation correlation coefficient r measures the strength and direction of a
linear relationship between two variables on a scatterplot. The value of r is always
between +1 and –1.
Copyright © 2017 Tech Mahindra. All rights reserved. 21
Covariance and Correlation
How much two attributes (X, Y) are correlated or separated
Measuring Covariance:
– Capture the data sets as n –dimensional vectors: (x1,x2 …..xn); (y1,y2 …..yn)
– Convert them into vectors of variances from their mean: (x1 –x0, x2 –x0 ……. xn –x0)
and (y1 –y0, y2 –y0 ……. yn –y0)
– Take the dot product of these two : cosine of angle between two vectors and Divide by
sample size
– Divide the covariance by SDs of both sets
– -1 means perfect inverse correlation; 0 means no correlation; 1 – perfect correlation
– Correlation cannot be indicator for causation; It helps in what experiments to conduct
Copyright © 2017 Tech Mahindra. All rights reserved. 22
Covariance and Correlation
How much two attributes (X, Y) are correlated or separated
Measuring Covariance:
– Capture the data sets as n –dimensional vectors: (x1,x2 …..xn); (y1,y2 …..yn)
– Convert them into vectors of variances from their mean: (x1 –x0, x2 –x0 ……. xn –x0)
and (y1 –y0, y2 –y0 ……. yn –y0)
– Take the dot product of these two : cosine of angle between two vectors and Divide by
sample size
– Divide the covariance by SDs of both sets
– -1 means perfect inverse correlation; 0 means no correlation; 1 – perfect correlation
– Correlation cannot be indicator for causation; It helps in what experiments to conduct
Copyright © 2017 Tech Mahindra. All rights reserved. 23
Solving linear equation
• In machine learning, we deal with training sets and test data where the algorithms to be
trained on large data sets and Matrices are good representation for such data.
• Matrices help in dimensionality reduction with respect to data set through Principal
Component Analysis (PCA).
• A classifier algorithm or regression one by minimizing error between the value calculated
by the nascent classifier and the actual value from the training data can be done using
linear algebra techniques.
Steps in solving linear equations:
Consider: −3𝑥 − 2𝑦 + 4𝑧 = 9 3𝑦 − 2𝑧 = 5 4𝑥 − 3𝑦 + 2𝑧 = 7
These can be expressed as: AX=B; 𝑋 = 𝐴−1 . 𝐵 ,where
A=[ −3 −2 4
0 3 −2
4 −3 2 ]
B=[
9
5
7
] X=[x,
y,
Copyright © 2017 Tech Mahindra. All rights reserved. 24
Working with Data structures -Set
A|B
Returns a set which is the union of sets A and B.
[Link](B)
A |= B
Adds all elements of array B to the set A.
[Link](B)
A&B
Returns a set which is the intersection of sets A and B.
[Link](B)
A &= B
Leaves in the set A only items that belong to the set B.
A.intersection_update(B)
A-B Returns the set difference of A and B (the elements
[Link](B) included in A, but not included in B).
A -= B
Removes all elements of B from the set A.
A.difference_update(B)
Returns the symmetric difference of sets A and B (the
A^B
elements belonging to either A or B, but not to both
A.symmetric_difference(B)
sets simultaneously).
A ^= B
Writes in A the symmetric difference of sets A and B.
A.symmetric_difference_update(B)
A <= B
Returns true if A is a subset of B.
[Link](B)
A >= B
Returns true if B is a subset of A.
[Link](B)
A<B Equivalent to A <= B and A != B
A>B Equivalent to A >= B and A != B
Copyright © 2017 Tech Mahindra. All rights reserved. 25
Statistics
Mean: sum of the values in the sample/size of the sample:
– (x1+x2+x3 ……xn)/N
Median: It is middle value of the sorted set of values in the
sample.
– Median is less susceptible for outliers than the mean
– Median is better indicator to look at than mean
Mode: Most common value in the data set
– It is an indicative of frequency
– Ex. 0,1,3, 4,0, 3,6,0: Mode is 0 –occurred 3 times in the sample
Copyright © 2017 Tech Mahindra. All rights reserved. 26
Statistics
68% of the data falls within one SD of the mean
• 95% of the data falls within two SD of the mean
• 99.7% of the data falls within three SD of the mean
Copyright © 2017 Tech Mahindra. All rights reserved. 27
Statistics
The probability density for a Gaussian distribution is given in terms of mean
value ( ) and the variance ( ) of the population as :
The Central Limit Theorem states that
“Given a sufficiently large sample size from a population with a finite level of
variance, the mean of all samples from the same population will be approximately
equal to the mean of the population.
Furthermore, all of the samples will follow an approximate normal distribution
pattern, with all variances being approximately equal to the variance of the
population divided by each sample's size”.
[Link]
Copyright © 2017 Tech Mahindra. All rights reserved. 28
Working with Numpy –’NumpyNotebook1’ examples
Copyright © 2017 Tech Mahindra. All rights reserved. 29
Cleansing the Data
Copyright © 2017 Tech Mahindra. All rights reserved. 30
Data Cleansing
Issues with data quality
Invalid values
Formats of the data (dd-mm-yy); spelling issues
Dependency –referential constraints, one to many unary relations
Domain constraints, referential integrity constraints
Duplicate records
Missing values
Values in wrong columns
Issues with data quality
…..
Understanding Data Quality issues
Understanding Data quality issues Pandas:
• Outlier analysis
• Exploratory data analysis –charts, visualization tools
Understanding Data quality issues Pandas:
• Outlier analysis and data analysis – visualization tools
Fixing the data quality issues
Use coding language; fix the sources (R, Python..)
Find issues in data processing streams
Copyright © 2017 Tech Mahindra. All rights reserved. 31
Data Cleansing –Data imputation
If column is empty –what value we fill in?
Fixing null, empty values
Unlike RDBMS, any value in ML is valid
ML Considers nulls as ‘class of data’
Techniques:
– Populate by mean, median, mode
– Multiple imputation techniques (regression, mean median..)
– Prediction algorithm to predict missing value
Copyright © 2017 Tech Mahindra. All rights reserved. 32
Data Cleansing –Data Standardization
Numeric data
– Logarithm
– Decimal places
– Floor, ceiling
Date and time
– Time zone
– Fixing null, empty values
Text data
– Name formatting
– Upper case /lower case
Copyright © 2017 Tech Mahindra. All rights reserved. 33
Python Libraries
Installation:
Approach 1: pip install numpy scipy matplotlib ipython Jupyter Pandas sympy
Approach 2: Python library bundles are available through environment platforms:
Anaconda: [Link]
Canopy: [Link]
Numpy: It stands for 'Numerical Python'.
• Useful to perform operations on arrays (vectors) including multidimensional array objects. It
supports several operations on these objects
• The other operations include areas from linear algebra, random number generation etc.
Pandas: Pandas library provides two important data structures namely Series and DataFrame
Copyright © 2017 Tech Mahindra. All rights reserved. 34
Pandas (1- 4):
A library that provides a way of processing tabular data supported by
two data structures: Series, DataFrame
Copyright © 2017 Tech Mahindra. All rights reserved. 35
Pandas (2- 4):
Creating a Series:
– By passing a list of values
– [Link]?
– animals =[‘Lion’, ‘Tiger’, ‘Bear’, ‘Mouse’]
– [Link](animals) Pandas automatically assigns index values
0 ‘Lion’
1 Tiger
2 Bear
3 Mouse
dtype: Object
Series from Dictionary
– city_cap =[‘India’: New Delhi, ‘US’: New Yark’] # to know the type of keys
s =[Link](city_caps)
for i in city_cap.keys():
US New Yark
India New Delhi print(type(i))
dtype Object
Series Form a list of indices and corresponding values
– Pandas overrides automatic creation of index values using list of values provided
through index parameter
– s=[Link]([value_item_list], index=[keys_list])
Copyright © 2017 Tech Mahindra. All rights reserved. 36
Pandas (3- 4):
Working with DataFrame:
– A library that provides a way of processing tabular data supported by
two data structures: Series, DataFrame
Series
– A Series is cross breed of array indexing and dictionary:
Examples:
Copyright © 2017 Tech Mahindra. All rights reserved. 37
Pandas (4 - 4):
Pandas Data structures:
– A library that provides a way of processing tabular data supported by
two data structures: Series, DataFrame
Series
– A Series is cross breed of array indexing and dictionary:
Examples:
Copyright © 2017 Tech Mahindra. All rights reserved. 38
Data Visualization in Python
Copyright © 2017 Tech Mahindra. All rights reserved. 39
Data Visualization (1 - 6):
Data visualization: Story telling by means of visual patterns
– Before looking at data creating an interesting the story
– Story will tell us specific tools needed for visualization
1. Identify the tool (excel/tableau/python …)
2. Define the story clearly
3. Pick up right visual aid to tell the story
4. Assess data visualization
a) Are there any distractions from main story
b) Are they describe your story?
𝑠𝑡𝑜𝑟𝑦 𝑖𝑛𝑘
Story ink ratio: =
𝑡𝑜𝑡𝑎𝑙 𝑖𝑛𝑘 𝑢𝑠𝑒𝑑 𝑡𝑜 𝑝𝑟𝑖𝑛𝑡 𝑡ℎ𝑒 𝑔𝑟𝑎𝑝ℎ𝑖𝑐
– Portion of graphics ink is devoted to the non-redantant display of the story information
Copyright © 2017 Tech Mahindra. All rights reserved. 40
Data Visualization (2 - 6):
– Pick up the chart that communicates the story best !
– Bar chart: To make comparisons between the categories, comparisons in time intervals
– Two types:
Horizontal (long list of categories)
Vertical (showing negative values, time periods)
Comparing the trends –line charts
– Pi Chart:
Best for showing few categories
Parts of pi chart should add to a meaningful whole Creating effective visualization
– Stacked areas (ex. Cumulative flow diagram)
When cumulative proportions matter
They are poor at showing specific values
– Histograms –to understand spread in the data
– Box plots:
Summarises the distribution (median, min_val, max_val) of the data;
identify outliers in the data
– Scatter plots:
Used to establish the relationship between the variables Copyright © 2017 Tech Mahindra. All rights reserved. 41
Data Visualization (3 - 6):
Copyright © 2017 Tech Mahindra. All rights reserved. 42
Data Visualization (4 - 6):
Comparing colours
– Using the right colour –only if the colour communicates additional information
– Themes:
– Qualitative colour {contrast} They don’t carry obvious relationship among them
– Sequential colours{ range of values)
Same colour from fading shade to dark shade
– Diverging colours {obviously dividing segments}
Same colour from dark shade to fade
Copyright © 2017 Tech Mahindra. All rights reserved. 43
Data Visualization (5 - 6):
– Good practices
– A colour scheme should
Add information
Encode data well
Accommodate colour blindness
Print well –BW and colour
– Colour scheme tools
Color Brewer 2.0 [Link]
Colorbrwer implementations in Python is done through:
[Link]
– Selection of colours
Light grey dark lines : to show simple data
Black and red: Correlation
Use legends: Indicates what each component represents
Use labels that paints directly on charts instead of axes
Make sure the visualization stands by itself
Use squint test: Can this visualization tell a story?
Copyright © 2017 Tech Mahindra. All rights reserved. 44
Data Visualization (6 - 6):
– matplotlib library:
– Steps
1. Create the data set and visualize the figure
2. Plot the data
3. Configure axes
4. Add annotations/legends
5. Show() or save the file as image/pdf ….
– Implementation aspects
1. import [Link] as plt
2. [Link]()
3. [Link](x_vals, y_vals)
4. [Link](x2_vals, y2_vals)
5. [Link]([List of values])
6. [Link]([List of values]
7. [Link](lower_x, upper_x)
8. [Link](lower_y, upper_y)
9. [Link](‘’)
10. [Link](‘ ‘)
11. [Link]()
12. [Link]()
13. [Link]()/[Link](…<filename>)
Copyright © 2017 Tech Mahindra. All rights reserved. 45
Classification of Algorithms
Copyright © 2017 Tech Mahindra. All rights reserved. 46
Supervised Learning
It is process of creating predictive models using set of
historical data that contains results that you are trying to
predict.
A supervised learning algorithm is the one that given examples that contain the
desired target value
Supervised Learning Approaches: Use past results to train a
model
Classification: To identify which group a new record belongs to (i.e., customer or
event) based on its inherent characteristics.
Regression: It uses past values to predict future values and is used in forecasting
and variance analysis
Predictive Analytics: A practice of extracting information
from existing data sets in order to determine patterns and
predict future outcomes and trends.
Collaborative filtering –Mining user behavior and make product
recommendations
Copyright © 2017 Tech Mahindra. All rights reserved. 47
Un-Supervised Learning
Unsupervised learning does not use previously known
results to train its models.
Un –supervised algorithms are not given the target desired answer,
but they must find something plausible on their own.
Uses descriptive statistics to identify clusters (ex: Market analysis)
They can identify
clusters or groups of similar records within a database (i.e., clustering)
relationships among values in a database (i.e., association)
Copyright © 2017 Tech Mahindra. All rights reserved. 48
Tasks
Supervised learning tasks
K –Nearest neighbors
Naïve Bayes
Support vector machines
Decision trees
Un –supervised learning tasks:
k-Means
DBSCAN
Why do we have so many algorithms?
Copyright © 2017 Tech Mahindra. All rights reserved. 49
Choice of the Algorithm
Consider your goal
If you are trying to predict or forecast a target value –supervised
learning
If the target value is discrete {Y/NO, 1/2/3, A/B/C, Red/yellow…}
then use classification algorithm
If the target value is continuous [a range of values] then use
regression {0.00 -10.00; -99 to +99; -infty to +infty}
If you are NOT trying to predict or forecast a target value –un
supervised learning
Try to fit the data into some discrete groups (clustering)
Copyright © 2017 Tech Mahindra. All rights reserved. 50
Supervised Learning
– Classification
Copyright © 2017 Tech Mahindra. All rights reserved. 51
Introduction to classification: kNN Algorithm
for every point in our data set:
Compute distance between inX and the current point
sort the distances in increasing order
take k items with lowest distances into inX
find the majority class among these items
return the majority class as our prediction for the class inX
Copyright © 2017 Tech Mahindra. All rights reserved. 52
Example -kNN
Consider questionnaire survey on objective testing with two attributes –acid durability and
strength to classify whether a special paper tissue is good or not.
Four training samples:
Suppose factory produces a tissue with tests of values –X1=3, X2=7;
With out expensive survey can we guess what the classification of this new tissue is?
[Link]
Copyright © 2017 Tech Mahindra. All rights reserved. 53
Example -kNN
Let no of nearest neighbours (K)=3; Compute the distance from query instance and the training samples
Copyright © 2017 Tech Mahindra. All rights reserved. 54
Example -kNN
Let no of nearest neighbours (K)=3; Compute the distance from query instance and the training samples and identify 3 minima
Copyright © 2017 Tech Mahindra. All rights reserved. 55
Example -kNN
Gather the category Y of the nearest neighbours.
Copyright © 2017 Tech Mahindra. All rights reserved. 56
Example -kNN
With in k=3, we have 2 good and one bad as per the survey input
data
Conclude that the new tissue paper that pass laboratory tests with
X1=3, X2=7 is included in good category
Copyright © 2017 Tech Mahindra. All rights reserved. 57
Naïve Bayes:
Naïve: It simplifies the probability computations by assuming that
predictive features are mutually independent.
Bayes: It maps the probabilities of observing input features given belonging
classes, to the probability distribution over classes based on Bayes theorem:
𝑷 𝑩 𝑨 𝑷(𝑨)
𝑷 𝑨𝑩 =
𝑷(𝑩)
Probability of observing A occurs given B is true: 𝑃 𝐴 𝐵
Probability of occurrence of A is: 𝑃 𝐴
Probability of occurrence of B is : 𝑃 𝐵
Probability of observing B given A occurs:𝑃 𝐵 𝐴
Copyright © 2017 Tech Mahindra. All rights reserved. 58
Naïve Bayes (2-3):
Example1: A doctor reported the following screening test scenario on
Cancer screening test :
Test Cancer No cancer Total
Test +ve 80 900 980
Test –ve 20 9000 9020
Total 100 9900 10000
80 out of 100 are correctly diagnosed while the rest are not
Cancer is falsely detected among 900 patients out of 900 healthy people
If the result of this screening test on a person is Positive? What is the probability
𝑃 𝑃𝑜𝑠 𝐶 𝑃(𝐶)
that he actually have cancer? 𝑃 𝐶 𝑝𝑜𝑠 = 𝑃(𝑃𝑜𝑠)
80 100 980
𝑃 𝑃𝑜𝑠 𝐶 = = 0.8; 𝑃 𝐶 = = 0.01; 𝑃 𝑃𝑜𝑠 = = 0.098
100 10000 10000
=8.16% which is significantly higher than our general assumption: 100/10000=1%
Copyright © 2017 Tech Mahindra. All rights reserved. 59
Naïve Bayes (3-3):
Example2: Spam mail detection. Observed a tendency that the mails
containing the work “gift” are spam. Classify a given new mail into spam or
ham based on the probability:
𝑷 𝒈𝒊𝒇𝒕 𝑺𝒑𝒂𝒎 𝑷(𝑺𝒑𝒂𝒎)
𝑷 𝑺𝒑𝒂𝒎 𝒈𝒊𝒇𝒕 =
𝑷(𝒈𝒊𝒇𝒕)
Probability of an email being spam, if it contains the word “gift”:: 𝑃 𝑆𝑝𝑎𝑚 𝑔𝑖𝑓𝑡
The Nr is “Probability of a message being spam and containing the word “gift” :
𝑃 𝑔𝑖𝑓𝑡|𝑆𝑝𝑎𝑚 𝑃(𝑆𝑝𝑎𝑚)
The Dr is the overall probability of an email containing the word “gift”: Equivalent
to : 𝑃 𝑔𝑖𝑓𝑡|𝑆𝑝𝑎𝑚 𝑃 𝑆𝑝𝑎𝑚 + 𝑃 𝑔𝑖𝑓𝑡 𝐻𝑎𝑚 𝑃(𝐻𝑎𝑚)
Naïve : Presence of different words are independent of each other:
Copyright © 2017 Tech Mahindra. All rights reserved. 60
Naïve Bayes (3-3):
Let the event of having cancer and positive test result as C, pos respectively. The
probability that the person has cancer, given that test result is positive is: 𝑃 𝐶 𝑃𝑜𝑠
Cancer is falsely detected among 900 patients out of 900 healthy people
If the result of this screening test on a person is Positive? What is the probability
that he actually have cancer?
𝑃 𝐵 𝐴 𝑃(𝐴)
, positive: test shown positive, patient 𝑃 𝐴 𝐵 =
𝑃(𝐵)
Conclude that the new tissue paper that pass laboratory tests with X1=3, X2=7 is
included in good category
Copyright © 2017 Tech Mahindra. All rights reserved. 61
Un –Supervised Learning
– K Means clustering
Copyright © 2017 Tech Mahindra. All rights reserved. 62
K Means clustering (1-7):
It is process of grouping a complex data into clusters
Demographics, Movies
K stands for number of clusters based on
attributes of the data
“Split the data into k groups”
What group of the given data belongs to -scatter
plot
Helps in categorization which we don’t know
apriory!
Unlike supervised learning, its not a case we
already know the correct group, we try to
converge the data into groups based on the data
–groups also unknown(–latent values)
A supervised learning algorithm is the one that
given examples that contain the desired target
value
Ex: interesting clusters of songs based on the
attributes of song
63
Copyright © 2017 Tech Mahindra. All rights reserved. 63
K Means clustering (2-7):
Copyright © 2017 Tech Mahindra. All rights reserved. 64
K Means clustering (3-7):
Randomly we choose following two centroids (k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m=(5.0,7.0).
65
Copyright © 2017 Tech Mahindra. All rights reserved. 65
K Means clustering (4-7):
We obtain two clusters containing: {1,2,3} and {4,5,6,7}.
Their new centroids are:
Copyright © 2017 Tech Mahindra. All rights reserved. 66
K Means clustering (5-7):
Now using these centroids we
compute the Euclidean distance of
each object, as shown in table.
Therefore, the new clusters are:
{1,2} and {3,4,5,6,7}
Next centroids are: m1=(1.25,1.5) and
m2 = (3.9,5.1)
Copyright © 2017 Tech Mahindra. All rights reserved. 67
K Means clustering (6-7):
The clusters obtained are:
{1,2} and {3,4,5,6,7}
Therefore, there is no change in
the cluster.
Thus, the algorithm comes to a halt
here and final result consist of 2
clusters {1,2} and {3,4,5,6,7}.
Copyright © 2017 Tech Mahindra. All rights reserved. 68
K Means clustering (7-7):
Copyright © 2017 Tech Mahindra. All rights reserved. 69
Join Our community:
[Link]
s/[Link]?ID=2
Thank you
[Link]@[Link]
Disclaimer
Tech Mahindra Limited, herein referred to as TechM provide a wide array of presentations and reports, with the contributions of various
professionals. These presentations and reports are for information purposes and private circulation only and do not constitute an offer to buy or sell
any services mentioned therein. They do not purport to be a complete description of the market conditions or developments referred to in the
material. While utmost care has been taken in preparing the above, we claim no responsibility for their accuracy. We shall not be liable for any direct
or indirect losses arising from the use thereof and the viewers are requested to use the information contained herein at their own risk. These
presentations and reports should not be reproduced, re-circulated, published in any media, website or otherwise, in any form or manner, in part or as
a whole, without the express consent in writing of TechM or its subsidiaries. Any unauthorized use, disclosure or public dissemination of information
contained herein is prohibited. Individual situations and local practices and standards may vary, so viewers and others utilizing information contained
within a presentation are free to adopt differing standards and approaches as they see fit. You may not repackage or sell the presentation. Products
and names mentioned in materials or presentations are the property of their respective owners and the mention of them does not constitute an
endorsement by TechM. Information contained in a presentation hosted or promoted by TechM is provided “as is” without warranty of any kind, either
expressed or implied, including any warranty of merchantability or fitness for a particular purpose. TechM assumes no liability or responsibility for the
contents of a presentation or the opinions expressed by the presenters. All expressions of opinion are subject to change without notice.
Copyright © 2017 Tech Mahindra. All rights reserved. 71