0% found this document useful (0 votes)

124 views71 pages

MachineLearning Presentation

MachineLearning

Uploaded by

Ram Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views71 pages

MachineLearning Presentation

MachineLearning

Uploaded by

Ram Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Agenda
Introduction
Data Science Overview
Introduction to Machine Learning
Machine Learning Case Studies
Key Terminology
Steps in Machine Learning
Mathematical and Statistical Foundations
Working with Data Structures
Statistics
Working with Numpy
Data Cleansing
Python Libraries
Data Visualization
Classification of Algorithms
Supervised and Unsupervised Learning
Naïve Bayes
Unsupervised Learning - K Means
Conclusion

Machine Learning for

Developers

Dr Prakash Goteti
Technology Learning Services
Agenda

 Big Picture: Introduction to Data Science

 Where Machine learning fits in?

 What is machine learning

 Machine learning case studies

 Machine learning –Key terminology

 Predictive Analytics and Recommendation Systems

 (Un)Supervised learning algorithms

Copyright © 2017 Tech Mahindra. All rights reserved. 2

 Introduction to Data Science

Copyright © 2017 Tech Mahindra. All rights reserved. 3

Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data

Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 4

Big Picture –Data Science
Data Science

Define Research goal

Establish Research
Goal
Prepare Project charter
Gather the data

Prepare the data

Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 5

Big Picture –Data Science
Data Science

Establish Research
Goal

Internal Data
Gather the data
External Data
Prepare the data

Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 6

Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Data cleansing

Prepare the data Data Transformation

Data Aggregation
Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 7

Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data

Graphical
techniques
Explore the data Visualization
Techniques
Non Graphical Techniques

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 8

Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data

Explore the data

Model selection

Build a model Model execution

Model evaluation
Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 9

Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data

Explore the data

Build a model
Presentation
Present the findings
Automation and inferences

Copyright © 2017 Tech Mahindra. All rights reserved. 10

Big Picture –Data Science
Data Science
Numpy and
Data cleansing
Pandas
Establish Research
Goal
Data
matplotlib
visualization and
package
Gather the data reporting

Machine
Prepare the data scikit-learn
learning
toolkit
algorithms

Explore the data

Natural
Nltk
language
framework
processing
Build a model

Social network NetworkX

analysis Library
Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 11

 Introduction to Machine Learning

Copyright © 2017 Tech Mahindra. All rights reserved. 12

Machine Learning

Machine learning is amalgamation of computer science,

engineering and statistics.

It is a tool that can be applied to many problems with the nature of

data interpretation and action on data for the benefit of business

Machine learning uses statistics extensively.

Copyright © 2017 Tech Mahindra. All rights reserved. 13

Machine learning case studies (1-2)

GE already makes hundreds of millions of dollars by crunching the

data it collects from deep-sea oil wells or jet engines to optimize
performance, anticipate breakdowns, and streamline maintenance.

Outside North America:

In Europe, more than a dozen banks have replaced older
statistical-modeling approaches with machine-learning techniques
and, in some cases, experienced 10 percent increases in sales of
new products, 20 percent savings in capital expenditures, 20
percent increases in cash collections, and 20 percent declines in
churn.
This is through new recommendation engines for clients in retailing
and in small and medium-sized companies enabling more accurate
forecast.

Copyright © 2017 Tech Mahindra. All rights reserved. 14

Machine learning case studies (2-2)
 A Canadian bank uses predictive analytics to increase campaign response rates
by 600%, cut customer acquisition costs in half, and boost campaign ROI by
100%.

 A research group at a leading hospital combined predictive and text analytics to

improve its ability to classify and treat pediatric brain tumors.

 An airline increased revenue and customer satisfaction by better estimating the

number of passengers who won’t show up for a flight. This reduces the number of
overbooked flights that require re-accommodating passengers as well as the
number of empty seats.

 These use cases reflect an important fact that predictive analytics (PA) can
provide significant impact towards Return –On -Investments for the organizations.

 PA can help companies in achieving operational excellence through cost

reduction, process improvement, better understand customer behavior, identify
unexpected opportunities, and anticipate problems before they happen so that risk
mitigation, avoidance steps can be taken up effectively.

Copyright © 2017 Tech Mahindra. All rights reserved. 15

Key Terminology
Features: individual measurements that when combined with
other features make up a training example
• identifying key properties describing these entities.
• If these entities are represented as table, each column is identified as feature
or attribute.
• Each row in the table is described as instance.
• Features or attributes are the individual measurements which collectively make
up a training example.
• This is usually columns in a training or test set

Training set:
• Set of columns/attributes collectively constitutes training set.
• The target variable or class the training example belongs to is then compared
to the predicted value to understand how accurate the algorithm is.

Training example:
• Each training example has features of a class and target variable.

Copyright © 2017 Tech Mahindra. All rights reserved. 16

Key Terminology
Data Types
• Numeric Data (quantifiable things-discrete, continuous )
• Categorical ( Based on categories –enumerate the categories)
• Ordinal Data (mixture above: star ratings on product, movie etc)

Knowledge Representation:
• It is in the form of rules –like probability distribution
• These rules are readable by the machine.

Classification: To predict what class an instance of data should fall into.

Regression: A best fit line drawn through some data points to generalize the data
points
• Regression is prediction of a numeric value. For example, consider the problem of classification of items

Supervised learning:
• There is a target value given for the data

Un-supervised learning:
Copyright © 2017 Tech Mahindra. All rights reserved. 17
• There is no target value given for the data
Steps in Machine learning
Data • RSS feed, likes, dislikes
Collection extracting from Websites

Data
cleansing • Refining the data /columns

Analyze
input Data • Recognize if any patterns

Train the
Algorithm • Feed the MLA with clean data

Test the
algorithm • Infer the results

Copyright © 2017 Tech Mahindra. All rights reserved. 18

 Mathematical and Statistical Foundations

Copyright © 2017 Tech Mahindra. All rights reserved. 19

Binning No of Age
people Range
 Convert Numeric data into categorical data (bins)

 Use pre-defined ranges as bins 20 20-30

 Classification algorithm and age is class variable 33 31-40

 Indicator variables –convert categorical data into Boolean
data
45 41-50
 Centering and Scaling Time zone
– Standardise the range of values 41 51-60
– Better comparison
– Values are “centered” by subtracting them from the mean: 37 >60
– Values are scaled by dividing the above by SD
– ML algorithm gives better results with standardized values

Mean:

– Variance describes spread around the mean:

– SD Example: sample: (2,5,6,5,9) Mean =27/55.4
– (5.4) Differences from the mean =(-3.4, -0.4,0.6,-0.4,4.4)
– Squared differences =(11.56, 0.16, 0.36, 0.16,19.36)
– Avg of squared diffs =(11.56, 0.16, 0.36, 0.16,19.36)/5
– =31.6/5 =6.32 =2.51
Copyright © 2017 Tech Mahindra. All rights reserved. 20
Correlation
 Pearson correlation correlation coefficient r measures the strength and direction of a
linear relationship between two variables on a scatterplot. The value of r is always
between +1 and –1.

Copyright © 2017 Tech Mahindra. All rights reserved. 21

Covariance and Correlation
 How much two attributes (X, Y) are correlated or separated

 Measuring Covariance:

– Capture the data sets as n –dimensional vectors: (x1,x2 …..xn); (y1,y2 …..yn)

– Convert them into vectors of variances from their mean: (x1 –x0, x2 –x0 ……. xn –x0)
and (y1 –y0, y2 –y0 ……. yn –y0)

– Take the dot product of these two : cosine of angle between two vectors and Divide by
sample size

– Divide the covariance by SDs of both sets

– -1 means perfect inverse correlation; 0 means no correlation; 1 – perfect correlation

– Correlation cannot be indicator for causation; It helps in what experiments to conduct

Copyright © 2017 Tech Mahindra. All rights reserved. 22

Covariance and Correlation
 How much two attributes (X, Y) are correlated or separated

 Measuring Covariance:

– Capture the data sets as n –dimensional vectors: (x1,x2 …..xn); (y1,y2 …..yn)

– Convert them into vectors of variances from their mean: (x1 –x0, x2 –x0 ……. xn –x0)
and (y1 –y0, y2 –y0 ……. yn –y0)

– Take the dot product of these two : cosine of angle between two vectors and Divide by
sample size

– Divide the covariance by SDs of both sets

– -1 means perfect inverse correlation; 0 means no correlation; 1 – perfect correlation

– Correlation cannot be indicator for causation; It helps in what experiments to conduct

Copyright © 2017 Tech Mahindra. All rights reserved. 23

Solving linear equation
• In machine learning, we deal with training sets and test data where the algorithms to be
trained on large data sets and Matrices are good representation for such data.

• Matrices help in dimensionality reduction with respect to data set through Principal
Component Analysis (PCA).

• A classifier algorithm or regression one by minimizing error between the value calculated
by the nascent classifier and the actual value from the training data can be done using
linear algebra techniques.
Steps in solving linear equations:
Consider: −3𝑥 − 2𝑦 + 4𝑧 = 9 3𝑦 − 2𝑧 = 5 4𝑥 − 3𝑦 + 2𝑧 = 7
These can be expressed as: AX=B; 𝑋 = 𝐴−1 . 𝐵 ,where
A=[ −3 −2 4
0 3 −2
4 −3 2 ]
B=[
9
5
7
] X=[x,
y,
Copyright © 2017 Tech Mahindra. All rights reserved. 24
Working with Data structures -Set
A|B
Returns a set which is the union of sets A and B.
[Link](B)
A |= B
Adds all elements of array B to the set A.
[Link](B)
A&B
Returns a set which is the intersection of sets A and B.
[Link](B)
A &= B
Leaves in the set A only items that belong to the set B.
A.intersection_update(B)
A-B Returns the set difference of A and B (the elements
[Link](B) included in A, but not included in B).
A -= B
Removes all elements of B from the set A.
A.difference_update(B)
Returns the symmetric difference of sets A and B (the
A^B
elements belonging to either A or B, but not to both
A.symmetric_difference(B)
sets simultaneously).
A ^= B
Writes in A the symmetric difference of sets A and B.
A.symmetric_difference_update(B)
A <= B
Returns true if A is a subset of B.
[Link](B)
A >= B
Returns true if B is a subset of A.
[Link](B)
A<B Equivalent to A <= B and A != B
A>B Equivalent to A >= B and A != B

Copyright © 2017 Tech Mahindra. All rights reserved. 25

Statistics
 Mean: sum of the values in the sample/size of the sample:
– (x1+x2+x3 ……xn)/N

 Median: It is middle value of the sorted set of values in the

sample.
– Median is less susceptible for outliers than the mean
– Median is better indicator to look at than mean

 Mode: Most common value in the data set

– It is an indicative of frequency

– Ex. 0,1,3, 4,0, 3,6,0: Mode is 0 –occurred 3 times in the sample

Copyright © 2017 Tech Mahindra. All rights reserved. 26

Statistics

 68% of the data falls within one SD of the mean

• 95% of the data falls within two SD of the mean
• 99.7% of the data falls within three SD of the mean

Copyright © 2017 Tech Mahindra. All rights reserved. 27

Statistics
 The probability density for a Gaussian distribution is given in terms of mean
value ( ) and the variance ( ) of the population as :

 The Central Limit Theorem states that

“Given a sufficiently large sample size from a population with a finite level of
variance, the mean of all samples from the same population will be approximately
equal to the mean of the population.

Furthermore, all of the samples will follow an approximate normal distribution

pattern, with all variances being approximately equal to the variance of the
population divided by each sample's size”.

[Link]

Copyright © 2017 Tech Mahindra. All rights reserved. 28

 Working with Numpy –’NumpyNotebook1’ examples

Copyright © 2017 Tech Mahindra. All rights reserved. 29

 Cleansing the Data

Copyright © 2017 Tech Mahindra. All rights reserved. 30

Data Cleansing
 Issues with data quality
 Invalid values
 Formats of the data (dd-mm-yy); spelling issues
 Dependency –referential constraints, one to many unary relations
 Domain constraints, referential integrity constraints
 Duplicate records
 Missing values
 Values in wrong columns
 Issues with data quality
 …..
 Understanding Data Quality issues
 Understanding Data quality issues Pandas:
• Outlier analysis
• Exploratory data analysis –charts, visualization tools
 Understanding Data quality issues Pandas:
• Outlier analysis and data analysis – visualization tools
 Fixing the data quality issues
 Use coding language; fix the sources (R, Python..)
 Find issues in data processing streams
Copyright © 2017 Tech Mahindra. All rights reserved. 31
Data Cleansing –Data imputation
 If column is empty –what value we fill in?

 Fixing null, empty values

 Unlike RDBMS, any value in ML is valid

 ML Considers nulls as ‘class of data’

 Techniques:
– Populate by mean, median, mode
– Multiple imputation techniques (regression, mean median..)
– Prediction algorithm to predict missing value

Copyright © 2017 Tech Mahindra. All rights reserved. 32

Data Cleansing –Data Standardization
 Numeric data
– Logarithm
– Decimal places
– Floor, ceiling

 Date and time

– Time zone
– Fixing null, empty values

 Text data
– Name formatting
– Upper case /lower case

Copyright © 2017 Tech Mahindra. All rights reserved. 33

Python Libraries
Installation:

Approach 1: pip install numpy scipy matplotlib ipython Jupyter Pandas sympy

Approach 2: Python library bundles are available through environment platforms:

Anaconda: [Link]
Canopy: [Link]

Numpy: It stands for 'Numerical Python'.

• Useful to perform operations on arrays (vectors) including multidimensional array objects. It
supports several operations on these objects
• The other operations include areas from linear algebra, random number generation etc.

Pandas: Pandas library provides two important data structures namely Series and DataFrame

Copyright © 2017 Tech Mahindra. All rights reserved. 34

Pandas (1- 4):
 A library that provides a way of processing tabular data supported by
two data structures: Series, DataFrame

Copyright © 2017 Tech Mahindra. All rights reserved. 35

Pandas (2- 4):
 Creating a Series:
– By passing a list of values
– [Link]?
– animals =[‘Lion’, ‘Tiger’, ‘Bear’, ‘Mouse’]
– [Link](animals) Pandas automatically assigns index values
0 ‘Lion’
1 Tiger
2 Bear
3 Mouse
dtype: Object
 Series from Dictionary
– city_cap =[‘India’: New Delhi, ‘US’: New Yark’] # to know the type of keys
s =[Link](city_caps)
for i in city_cap.keys():
US New Yark
India New Delhi print(type(i))
dtype Object
 Series Form a list of indices and corresponding values
– Pandas overrides automatic creation of index values using list of values provided
through index parameter
– s=[Link]([value_item_list], index=[keys_list])
Copyright © 2017 Tech Mahindra. All rights reserved. 36
Pandas (3- 4):
 Working with DataFrame:

– A library that provides a way of processing tabular data supported by

two data structures: Series, DataFrame

 Series
– A Series is cross breed of array indexing and dictionary:
Examples:

Copyright © 2017 Tech Mahindra. All rights reserved. 37

Pandas (4 - 4):
 Pandas Data structures:

– A library that provides a way of processing tabular data supported by

two data structures: Series, DataFrame

 Series
– A Series is cross breed of array indexing and dictionary:
Examples:

Copyright © 2017 Tech Mahindra. All rights reserved. 38

 Data Visualization in Python

Copyright © 2017 Tech Mahindra. All rights reserved. 39

Data Visualization (1 - 6):
 Data visualization: Story telling by means of visual patterns
– Before looking at data creating an interesting the story

– Story will tell us specific tools needed for visualization

1. Identify the tool (excel/tableau/python …)

2. Define the story clearly

3. Pick up right visual aid to tell the story

4. Assess data visualization

a) Are there any distractions from main story
b) Are they describe your story?

𝑠𝑡𝑜𝑟𝑦 𝑖𝑛𝑘
 Story ink ratio: =
𝑡𝑜𝑡𝑎𝑙 𝑖𝑛𝑘 𝑢𝑠𝑒𝑑 𝑡𝑜 𝑝𝑟𝑖𝑛𝑡 𝑡ℎ𝑒 𝑔𝑟𝑎𝑝ℎ𝑖𝑐

– Portion of graphics ink is devoted to the non-redantant display of the story information

Copyright © 2017 Tech Mahindra. All rights reserved. 40

Data Visualization (2 - 6):
– Pick up the chart that communicates the story best !
– Bar chart: To make comparisons between the categories, comparisons in time intervals

– Two types:
 Horizontal (long list of categories)
 Vertical (showing negative values, time periods)
 Comparing the trends –line charts

– Pi Chart:
 Best for showing few categories
 Parts of pi chart should add to a meaningful whole Creating effective visualization

– Stacked areas (ex. Cumulative flow diagram)

 When cumulative proportions matter
 They are poor at showing specific values

– Histograms –to understand spread in the data

– Box plots:
 Summarises the distribution (median, min_val, max_val) of the data;
 identify outliers in the data

– Scatter plots:
 Used to establish the relationship between the variables Copyright © 2017 Tech Mahindra. All rights reserved. 41
Data Visualization (3 - 6):

Copyright © 2017 Tech Mahindra. All rights reserved. 42

Data Visualization (4 - 6):
 Comparing colours
– Using the right colour –only if the colour communicates additional information
– Themes:

– Qualitative colour {contrast} They don’t carry obvious relationship among them

– Sequential colours{ range of values)

Same colour from fading shade to dark shade

– Diverging colours {obviously dividing segments}

Same colour from dark shade to fade

Data Visualization (5 - 6):
– Good practices

– A colour scheme should

 Add information
 Encode data well
 Accommodate colour blindness
 Print well –BW and colour

– Colour scheme tools

 Color Brewer 2.0 [Link]
 Colorbrwer implementations in Python is done through:
[Link]

– Selection of colours
 Light grey dark lines : to show simple data
 Black and red: Correlation
 Use legends: Indicates what each component represents
 Use labels that paints directly on charts instead of axes
 Make sure the visualization stands by itself
 Use squint test: Can this visualization tell a story?

Data Visualization (6 - 6):
– matplotlib library:
– Steps
1. Create the data set and visualize the figure
2. Plot the data
3. Configure axes
4. Add annotations/legends
5. Show() or save the file as image/pdf ….

– Implementation aspects

Supervised Learning
 It is process of creating predictive models using set of
historical data that contains results that you are trying to
predict.
 A supervised learning algorithm is the one that given examples that contain the
desired target value

 Supervised Learning Approaches: Use past results to train a

model
 Classification: To identify which group a new record belongs to (i.e., customer or
event) based on its inherent characteristics.
 Regression: It uses past values to predict future values and is used in forecasting
and variance analysis

 Predictive Analytics: A practice of extracting information

from existing data sets in order to determine patterns and
predict future outcomes and trends.
 Collaborative filtering –Mining user behavior and make product
recommendations
Copyright © 2017 Tech Mahindra. All rights reserved. 47
Un-Supervised Learning
 Unsupervised learning does not use previously known
results to train its models.
 Un –supervised algorithms are not given the target desired answer,
but they must find something plausible on their own.

 Uses descriptive statistics to identify clusters (ex: Market analysis)

 They can identify

 clusters or groups of similar records within a database (i.e., clustering)
 relationships among values in a database (i.e., association)

Tasks
 Supervised learning tasks
 K –Nearest neighbors
 Naïve Bayes
 Support vector machines
 Decision trees

 Un –supervised learning tasks:

 k-Means
 DBSCAN

Why do we have so many algorithms?

Choice of the Algorithm
 Consider your goal
 If you are trying to predict or forecast a target value –supervised
learning
 If the target value is discrete {Y/NO, 1/2/3, A/B/C, Red/yellow…}
then use classification algorithm
 If the target value is continuous [a range of values] then use
regression {0.00 -10.00; -99 to +99; -infty to +infty}

 If you are NOT trying to predict or forecast a target value –un

supervised learning
 Try to fit the data into some discrete groups (clustering)

 Supervised Learning
– Classification

Introduction to classification: kNN Algorithm

for every point in our data set:

Compute distance between inX and the current point
sort the distances in increasing order
take k items with lowest distances into inX
find the majority class among these items
return the majority class as our prediction for the class inX

Example -kNN
Consider questionnaire survey on objective testing with two attributes –acid durability and
strength to classify whether a special paper tissue is good or not.
Four training samples:

Suppose factory produces a tissue with tests of values –X1=3, X2=7;

With out expensive survey can we guess what the classification of this new tissue is?
[Link]

Example -kNN
Let no of nearest neighbours (K)=3; Compute the distance from query instance and the training samples

Example -kNN
Let no of nearest neighbours (K)=3; Compute the distance from query instance and the training samples and identify 3 minima

Example -kNN
Gather the category Y of the nearest neighbours.

Example -kNN

 With in k=3, we have 2 good and one bad as per the survey input
data
 Conclude that the new tissue paper that pass laboratory tests with
X1=3, X2=7 is included in good category

Naïve Bayes:
Naïve: It simplifies the probability computations by assuming that
predictive features are mutually independent.

Bayes: It maps the probabilities of observing input features given belonging

classes, to the probability distribution over classes based on Bayes theorem:
𝑷 𝑩 𝑨 𝑷(𝑨)
𝑷 𝑨𝑩 =
𝑷(𝑩)

 Probability of observing A occurs given B is true: 𝑃 𝐴 𝐵

 Probability of occurrence of A is: 𝑃 𝐴

 Probability of occurrence of B is : 𝑃 𝐵

 Probability of observing B given A occurs:𝑃 𝐵 𝐴

Test Cancer No cancer Total

Test +ve 80 900 980
Test –ve 20 9000 9020
Total 100 9900 10000
 80 out of 100 are correctly diagnosed while the rest are not
 Cancer is falsely detected among 900 patients out of 900 healthy people
 If the result of this screening test on a person is Positive? What is the probability
𝑃 𝑃𝑜𝑠 𝐶 𝑃(𝐶)
that he actually have cancer? 𝑃 𝐶 𝑝𝑜𝑠 = 𝑃(𝑃𝑜𝑠)

80 100 980
𝑃 𝑃𝑜𝑠 𝐶 = = 0.8; 𝑃 𝐶 = = 0.01; 𝑃 𝑃𝑜𝑠 = = 0.098
100 10000 10000
=8.16% which is significantly higher than our general assumption: 100/10000=1%
Copyright © 2017 Tech Mahindra. All rights reserved. 59
Naïve Bayes (3-3):
Example2: Spam mail detection. Observed a tendency that the mails
containing the work “gift” are spam. Classify a given new mail into spam or
ham based on the probability:
𝑷 𝒈𝒊𝒇𝒕 𝑺𝒑𝒂𝒎 𝑷(𝑺𝒑𝒂𝒎)
𝑷 𝑺𝒑𝒂𝒎 𝒈𝒊𝒇𝒕 =
𝑷(𝒈𝒊𝒇𝒕)

 Probability of an email being spam, if it contains the word “gift”:: 𝑃 𝑆𝑝𝑎𝑚 𝑔𝑖𝑓𝑡

 The Nr is “Probability of a message being spam and containing the word “gift” :
𝑃 𝑔𝑖𝑓𝑡|𝑆𝑝𝑎𝑚 𝑃(𝑆𝑝𝑎𝑚)

 The Dr is the overall probability of an email containing the word “gift”: Equivalent
to : 𝑃 𝑔𝑖𝑓𝑡|𝑆𝑝𝑎𝑚 𝑃 𝑆𝑝𝑎𝑚 + 𝑃 𝑔𝑖𝑓𝑡 𝐻𝑎𝑚 𝑃(𝐻𝑎𝑚)

 Naïve : Presence of different words are independent of each other:

Naïve Bayes (3-3):
 Let the event of having cancer and positive test result as C, pos respectively. The
probability that the person has cancer, given that test result is positive is: 𝑃 𝐶 𝑃𝑜𝑠

 Cancer is falsely detected among 900 patients out of 900 healthy people
 If the result of this screening test on a person is Positive? What is the probability
that he actually have cancer?
𝑃 𝐵 𝐴 𝑃(𝐴)
 , positive: test shown positive, patient 𝑃 𝐴 𝐵 =
𝑃(𝐵)
 Conclude that the new tissue paper that pass laboratory tests with X1=3, X2=7 is
included in good category

 Un –Supervised Learning
– K Means clustering

K Means clustering (1-7):
 It is process of grouping a complex data into clusters
 Demographics, Movies
 K stands for number of clusters based on
attributes of the data
 “Split the data into k groups”
 What group of the given data belongs to -scatter
plot
 Helps in categorization which we don’t know
apriory!
 Unlike supervised learning, its not a case we
already know the correct group, we try to
converge the data into groups based on the data
–groups also unknown(–latent values)
 A supervised learning algorithm is the one that
given examples that contain the desired target
value
 Ex: interesting clusters of songs based on the
attributes of song

K Means clustering (3-7):
 Randomly we choose following two centroids (k=2) for two clusters.
 In this case the 2 centroid are: m1=(1.0,1.0) and m=(5.0,7.0).

K Means clustering (5-7):

 Now using these centroids we

compute the Euclidean distance of
each object, as shown in table.

 Therefore, the new clusters are:

 {1,2} and {3,4,5,6,7}

 Next centroids are: m1=(1.25,1.5) and

m2 = (3.9,5.1)

K Means clustering (6-7):

 The clusters obtained are:

{1,2} and {3,4,5,6,7}

 Therefore, there is no change in

the cluster.
 Thus, the algorithm comes to a halt
here and final result consist of 2
clusters {1,2} and {3,4,5,6,7}.

K Means clustering (7-7):

Join Our community:

[Link]
s/[Link]?ID=2
Thank you
[Link]@[Link]

Disclaimer
Tech Mahindra Limited, herein referred to as TechM provide a wide array of presentations and reports, with the contributions of various
professionals. These presentations and reports are for information purposes and private circulation only and do not constitute an offer to buy or sell
any services mentioned therein. They do not purport to be a complete description of the market conditions or developments referred to in the
material. While utmost care has been taken in preparing the above, we claim no responsibility for their accuracy. We shall not be liable for any direct
or indirect losses arising from the use thereof and the viewers are requested to use the information contained herein at their own risk. These
presentations and reports should not be reproduced, re-circulated, published in any media, website or otherwise, in any form or manner, in part or as
a whole, without the express consent in writing of TechM or its subsidiaries. Any unauthorized use, disclosure or public dissemination of information
contained herein is prohibited. Individual situations and local practices and standards may vary, so viewers and others utilizing information contained
within a presentation are free to adopt differing standards and approaches as they see fit. You may not repackage or sell the presentation. Products
and names mentioned in materials or presentations are the property of their respective owners and the mention of them does not constitute an
endorsement by TechM. Information contained in a presentation hosted or promoted by TechM is provided “as is” without warranty of any kind, either
expressed or implied, including any warranty of merchantability or fitness for a particular purpose. TechM assumes no liability or responsibility for the
contents of a presentation or the opinions expressed by the presenters. All expressions of opinion are subject to change without notice.

Machine Learning for
Developers

Dr Prakash Goteti
Technology Learning Services

2
Copyright © 2017 Tech Mahindra. All rights reserved.
Agenda

Big Picture: Introduction to Data Science

Where Mac

3
Copyright © 2017 Tech Mahindra. All rights reserved.

Introduction to Data Science

4
Copyright © 2017 Tech Mahindra. All rights reserved.
Big Picture –Data Science
Data Science
Establish Research
Goal
G

5
Copyright © 2017 Tech Mahindra. All rights reserved.
Big Picture –Data Science
Define Research goal
Prepare Project cha

6
Copyright © 2017 Tech Mahindra. All rights reserved.
Big Picture –Data Science
Internal Data
External Data
Data Scienc

7
Copyright © 2017 Tech Mahindra. All rights reserved.
Big Picture –Data Science
Data cleansing
Data Transformation
Data

8
Copyright © 2017 Tech Mahindra. All rights reserved.
Big Picture –Data Science
Visualization
Techniques
Graphical
tec

9
Copyright © 2017 Tech Mahindra. All rights reserved.
Big Picture –Data Science
Model selection
Model execution
Model e

10
Copyright © 2017 Tech Mahindra. All rights reserved.
Big Picture –Data Science
Presentation
Automation and inferences

Feature Engineering Basics in Python
No ratings yet
Feature Engineering Basics in Python
33 pages
Crime Prediction Model for Nigeria
No ratings yet
Crime Prediction Model for Nigeria
13 pages
Machine Learning Course Overview
100% (1)
Machine Learning Course Overview
57 pages
Supervised Machine Learning Notes
No ratings yet
Supervised Machine Learning Notes
112 pages
Machine Learning Overview by C. Vinoth Kumar
100% (1)
Machine Learning Overview by C. Vinoth Kumar
15 pages
CS771: Intro to Machine Learning
No ratings yet
CS771: Intro to Machine Learning
25 pages
Comparing Machine Learning Classifiers
No ratings yet
Comparing Machine Learning Classifiers
7 pages
Introduction to Machine Learning Concepts
100% (1)
Introduction to Machine Learning Concepts
46 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
21 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
16 pages
Feature Engineering Techniques Explained
100% (2)
Feature Engineering Techniques Explained
44 pages
Machine Learning Basics and Applications
100% (1)
Machine Learning Basics and Applications
32 pages
Ensemble Learning Methods Explained
100% (1)
Ensemble Learning Methods Explained
24 pages
Discourse Integration in NLP
No ratings yet
Discourse Integration in NLP
66 pages
BSCS 7th Sem Machine Learning Assignment 1
100% (1)
BSCS 7th Sem Machine Learning Assignment 1
5 pages
JNTUK R20 Machine Learning Notes
No ratings yet
JNTUK R20 Machine Learning Notes
23 pages
Engineer Being Machine Learning Notes
No ratings yet
Engineer Being Machine Learning Notes
95 pages
Understanding Decision Trees in ML
100% (1)
Understanding Decision Trees in ML
58 pages
Twitter Sentiment Analysis Project Report
100% (1)
Twitter Sentiment Analysis Project Report
14 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
12 pages
Overview of Machine Learning Concepts
No ratings yet
Overview of Machine Learning Concepts
11 pages
AdaBoost Classifier Tutorial in Python
100% (1)
AdaBoost Classifier Tutorial in Python
9 pages
KNN Algorithm Lab Manual
100% (1)
KNN Algorithm Lab Manual
8 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
Unit I MACHINE LEARNING
No ratings yet
Unit I MACHINE LEARNING
87 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
33 pages
Hate Speech Detection Project Report
100% (1)
Hate Speech Detection Project Report
24 pages
K-Means Clustering in Data Mining
No ratings yet
K-Means Clustering in Data Mining
8 pages
Data Science Course Overview and Applications
No ratings yet
Data Science Course Overview and Applications
63 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
17 pages
Data Preprocessing for Machine Learning
No ratings yet
Data Preprocessing for Machine Learning
82 pages
Python Programming Study Material
No ratings yet
Python Programming Study Material
178 pages
Handling Missing Data in Datasets
No ratings yet
Handling Missing Data in Datasets
12 pages
Riak CS Statistics in Data Science
No ratings yet
Riak CS Statistics in Data Science
27 pages
Feature Engineering for Regression Models
No ratings yet
Feature Engineering for Regression Models
23 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
21 pages
Machine Learning Data Preparation Guide
No ratings yet
Machine Learning Data Preparation Guide
49 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
11 pages
Introduction to Deep Learning Concepts
No ratings yet
Introduction to Deep Learning Concepts
23 pages
Predictive Analytics in Data Science
No ratings yet
Predictive Analytics in Data Science
35 pages
Logistic Regression vs Random Forest
100% (1)
Logistic Regression vs Random Forest
72 pages
Sign Language Recognition Using Deep Learning
No ratings yet
Sign Language Recognition Using Deep Learning
6 pages
Evolution of Machine Learning Insights
No ratings yet
Evolution of Machine Learning Insights
7 pages
Machine Learning Overview by Jane Dizon
100% (2)
Machine Learning Overview by Jane Dizon
23 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
25 pages
Linear Regression in Machine Learning
100% (1)
Linear Regression in Machine Learning
55 pages
Supervised vs Unsupervised Learning Algorithms
100% (1)
Supervised vs Unsupervised Learning Algorithms
41 pages
Data Mining and Warehousing Overview
No ratings yet
Data Mining and Warehousing Overview
25 pages
Artificial Intelligence Part1
No ratings yet
Artificial Intelligence Part1
24 pages
Machine Learning Dimensionality Reduction Techniques
No ratings yet
Machine Learning Dimensionality Reduction Techniques
79 pages
Types of Machine Learning Algorithms
No ratings yet
Types of Machine Learning Algorithms
9 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
7 pages
Linear Regression and SVM in ML
100% (1)
Linear Regression and SVM in ML
23 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
25 pages
Overview of Support Vector Machines
No ratings yet
Overview of Support Vector Machines
13 pages
Machine Learning Lab Experiments in Python
100% (1)
Machine Learning Lab Experiments in Python
15 pages
Exploratory Data Analysis Course Overview
100% (1)
Exploratory Data Analysis Course Overview
3 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
38 pages
Feature Selection in Sentiment Analysis
No ratings yet
Feature Selection in Sentiment Analysis
48 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
29 pages
Understanding Measures of Variability
No ratings yet
Understanding Measures of Variability
36 pages
Behavior Social Discounting
No ratings yet
Behavior Social Discounting
5 pages
Grade 11 Statistics 4th Quarter Exam
100% (1)
Grade 11 Statistics 4th Quarter Exam
2 pages
Introduction To Statistics and Data Analysis Roxy Peck Ebook Available Instantly
100% (1)
Introduction To Statistics and Data Analysis Roxy Peck Ebook Available Instantly
63 pages
Comprehensive Probability Formulas Guide
No ratings yet
Comprehensive Probability Formulas Guide
4 pages
JEE Main 2025 Math Qs Bank
No ratings yet
JEE Main 2025 Math Qs Bank
193 pages
Causal Hypothesis and Path Models
No ratings yet
Causal Hypothesis and Path Models
25 pages
Continuous Probability Distributions Overview
No ratings yet
Continuous Probability Distributions Overview
22 pages
Understanding Poisson Distribution
0% (1)
Understanding Poisson Distribution
22 pages
Probability Distributions: ©the Mcgraw Hill Companies, Inc. 2008 Mcgraw Hill/Irwin
No ratings yet
Probability Distributions: ©the Mcgraw Hill Companies, Inc. 2008 Mcgraw Hill/Irwin
34 pages
Comprehensive Linear Algebra & Calculus Course
No ratings yet
Comprehensive Linear Algebra & Calculus Course
12 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
8 pages
Data Analytics Course Syllabus Overview
No ratings yet
Data Analytics Course Syllabus Overview
12 pages
Discrete Random Variable Probability Analysis
No ratings yet
Discrete Random Variable Probability Analysis
2 pages
Scalars, Vectors, and Matrices in ML
No ratings yet
Scalars, Vectors, and Matrices in ML
21 pages
Probability and Statistics Course Syllabus
No ratings yet
Probability and Statistics Course Syllabus
3 pages
Financial Econometrics Course Notes
No ratings yet
Financial Econometrics Course Notes
554 pages
One-Way ANOVA of Broadband Bills
No ratings yet
One-Way ANOVA of Broadband Bills
23 pages
SPSS Independent Samples t-Test Guide
No ratings yet
SPSS Independent Samples t-Test Guide
21 pages
QMSS Math Camp Exam Instructions
No ratings yet
QMSS Math Camp Exam Instructions
6 pages
Chapter 3 Questions With Answers Fixed
No ratings yet
Chapter 3 Questions With Answers Fixed
4 pages
Mechanical Engineering Semester 3 Curriculum
No ratings yet
Mechanical Engineering Semester 3 Curriculum
12 pages
Probability and Random Variables Exam
No ratings yet
Probability and Random Variables Exam
7 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
21 pages
Central Limit Theorem in Practice
No ratings yet
Central Limit Theorem in Practice
14 pages
Statistics for Behavioral Sciences Test Bank
No ratings yet
Statistics for Behavioral Sciences Test Bank
19 pages
Hypothesis Tests & Confidence Intervals Guide
No ratings yet
Hypothesis Tests & Confidence Intervals Guide
9 pages
Quantitative Methods in Linguistics 1st Edition Keith Johnson Ebook Testbank Solutions Premium Unlock Version
100% (4)
Quantitative Methods in Linguistics 1st Edition Keith Johnson Ebook Testbank Solutions Premium Unlock Version
154 pages
Post Hoc Analysis in ANOVA Explained
No ratings yet
Post Hoc Analysis in ANOVA Explained
24 pages