Data Science With Python Notes
Data Science With Python Notes
There are three main measures of central tendency: the mode, the median and the mean.
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2
The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
The mode has an advantage over the median and the mean as it can be found for both numerical
and categorical (non-numerical) data
In some cases, particularly where the data are continuous, the distribution may have no mode at
all (i.e. if all values are different).
The median is the middle value in distribution when the values are arranged in ascending or
descending order.
The median divides the distribution in half (there are 50% of observations on either side of the
median value). In a distribution with an odd number of observations, the median value is the
middle value.
Looking at the retirement age distribution (which has 11 observations), the median is the middle
value, which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the mean of the
two middle values. In the following distribution, the two middle values are 56 and 57, therefore
the median equals 56.5 years:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The median is less affected by outliers and skewed data than the mean, and is usually the
preferred measure of central tendency when the distribution is not symmetrical.
The median cannot be identified for categorical nominal data, as it cannot be logically ordered.
The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean can be used for both continuous and discrete numeric data.
The mean cannot be calculated for categorical data, as the values cannot be summed.
Harmonic Mean
A simple way to define a harmonic mean is to call it the reciprocal of the arithmetic mean of the
reciprocals of the observations. The most important criteria for it is that none of the observations
should be zero.
A harmonic mean is used in averaging of ratios. The most common examples of ratios are that of
speed and time, cost and unit of material, work and time etc. The harmonic mean (H.M.) of n
observations is
Geometric Mean
A geometric mean is a mean or average which shows the central tendency of a set of numbers by
using the product of their values. For a set of n observations, a geometric mean is the nth root of
their product. The geometric mean G.M., for a set of numbers x1, x2, … , xn is given as
MEASURES OF DISPERSION
In statistics, the measures of dispersion help to interpret the variability of data i.e. to know how
much homogenous or heterogeneous the data is. In simple terms, it shows how squeezed or
scattered the variable is.
Types of Measures of Dispersion
There are two main types of dispersion methods in statistics which are:
Absolute dispersion method expresses the variations in terms of the average of deviations of
observations like standard or means deviations. It includes range, standard deviation, quartile
deviation, etc.
The types of absolute measures of dispersion are:
1. Range: It is simply the difference between the maximum value and the minimum value
given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set then squaring each of them and
adding each square and finally dividing them by the total no of values in the data set is
the variance. Variance (σ2)=∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard deviation
i.e. S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers
into quarters. The quartile deviation is half of the distance between the third and the first
quartile.
5. Mean and Mean Deviation: The average of numbers is known as the mean and the
arithmetic mean of the absolute deviations of the observations from a measure of central
tendency is known as the mean deviation (also called mean absolute deviation).
1. Co-efficient of Range
2. Co-efficient of Variation
3. Co-efficient of Standard Deviation
4. Co-efficient of Quartile Deviation
5. Co-efficient of Mean Deviation
RANDOM VARIABLES
A random variable is a rule that assigns a numerical value to each outcome in a sample space.
Random variables may be either discrete or continuous. A random variable is said to be discrete
if it assumes only specified values in an interval. Otherwise, it is continuous. We generally
denote the random variables with capital letters such as X and Y.
The number of eggs that a hen lays in a given day (it can’t be 2.3)
The number of people going to a given soccer match
The number of students that come to class on a given day
Continuous random variables, on the other hand, take on values that vary continuously within
one or more real intervals, and have a cumulative distribution function (CDF) that is absolutely
continuous. As a result, the random variable has an uncountable infinite number of possible
values
DISCRETE PROBABILITY DISTRIBUTIONS
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure),
and a single trial. So the random variable X which has a Bernoulli distribution can take value 1
with the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.
The probabilities of success and failure need not be equally likely, like the result of a fight
between me and hulk. He is pretty much certain to win. So in this case probability of my success
is 0.15 while my failure is 0.85
Here, the probability of success(p) is not same as the probability of failure. So, the chart below
shows the Bernoulli Distribution of our fight.
The expected value of a random variable X from a Bernoulli distribution is found as follows:
Binomial Distribution
Suppose that you won the toss today and this indicates a successful event. You toss again but you
lost this time. If you win a toss today, this does not necessitate that you will win the toss
tomorrow. Assign a random variable, say X, to the number of times you won the toss. What can
be the possible value of X? It can be any number depending on the number of times you tossed a
coin.
There are only two possible outcomes. Head denoting success and tail denoting failure.
Therefore, probability of getting a head = 0.5 and the probability of failure can be easily
computed as: q = 1- p = 0.5.
A distribution where only two outcomes are possible, such as success or failure, gain or loss, win
or lose and where the probability of success and failure is same for all the trials is called a
Binomial Distribution.
Each trial is independent since the outcome of the previous toss doesn’t determine or affect the
outcome of the current toss. An experiment with only two possible outcomes repeated n number
of times is called binomial. The parameters of a binomial distribution are n and p where n is the
total number of trials and p is the probability of success in each trial.
On the basis of the above explanation, the properties of a Binomial Distribution are
A binomial distribution graph where the probability of success does not equal the probability of
failure looks like
Poisson Distribution
Suppose you work at a call center, approximately how many calls do you get in a day? It can be
any number. Now, the entire number of calls at a call center in a day is modeled by Poisson
distribution. Some more examples are
You can now think of many examples following the same course. Poisson Distribution is
applicable in situations where events occur at random points of time and space wherein our
interest lies only in the number of occurrences of the event.
A distribution is called Poisson distribution when the following assumptions are valid:
1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a
longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.
Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some
notations used in Poisson distribution are:
Here, X is called a Poisson Random Variable and the probability distribution of X is called
Poisson distribution.
Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.
The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that
interval. The graph of a Poisson distribution is shown below:
CONTINUOUS PROBABILITY DISTRIBUTIONS
Exponential Distribution
Consider the call center example. What about the interval of time between the calls ? Here,
exponential distribution comes to our rescue. Exponential distribution models the interval of time
between the calls.
Exponential distribution is widely used for survival analysis. From the expected life of a machine
to the expected life of a human, exponential distribution successfully delivers the result.
f(x) = { λe-λx, x ≥ 0
For survival analysis, λ is called the failure rate of a device at any time t, given that it has
survived up to t.
CHI-SQUARE DISTRIBUTION
A random variable ꭓ follows chi-square distribution ,it can be written as a sum of squared
Degrees of freedom:
Degrees of freedom refers to the maximum number of logically independent values, which have
the freedom to vary. In simple words, it can be defined as the total number of observations minus
the number of independent constraints imposed on the observations.
In the above figure, we could see Chi-Square distribution for different degrees of freedom. We
can also observe that as the degrees of freedom increase Chi-Square distribution approximates to
normal distribution.
A chi-square test is used in statistics to test the independence of two events. Given the data of two
variables, we can get observed count O and expected count E. Chi-Square measures how expected
count E and observed count O deviates each other.
Let’s consider a scenario where we need to determine the relationship between the independent
category feature (predictor) and dependent category feature(response). In feature selection, we
aim to select the features which are highly dependent on the response.
When two features are independent, the observed count is close to the expected count, thus we
will have smaller Chi-Square value. So high Chi-Square value indicates that the hypothesis of
independence is incorrect. In simple words, higher the Chi-Square value the feature is more
dependent on the response and it can be selected for model training.
Normal distribution represents the behavior of most of the situations in the universe (That is
why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables
often turns out to be normally distributed, contributing to its widespread application. Any
distribution is known as Normal distribution if it has the following characteristics:
A normal distribution is highly different from Binomial Distribution. However, if the number of
trials approaches infinity then the shapes will be quite similar.
The mean and variance of a random variable X which is said to be normally distributed is given
by:
A standard normal distribution is defined as the distribution with mean 0 and standard deviation
1. For such a case, the PDF becomes:
WHITE-NOISE PROCESS
If a time series is white noise, it is a sequence of random numbers and cannot be predicted. If the
series of forecast errors are not white noise, it suggests improvements could be made to the
predictive model.
1. Predictability: If your time series is white noise, then, by definition, it is random. You cannot
reasonably model it and make predictions.
2. Model Diagnostics: The series of errors from a time series forecast model should ideally be
white noise.
VARIANCE
In statistics, variance refers to the spread of a data set. It’s a measurement used to identify how
far each number in the data set is from the mean.
While performing market research, variance is particularly useful when calculating probabilities
of future events. Variance is a great way to find all of the possible values and likelihoods that a
random variable can take within a given range.
A variance value of zero represents that all of the values within a data set are identical, while all
variances that are not equal to zero will come in the form of positive numbers.
The larger the variance, the more spread in the data set.
A large variance means that the numbers in a set are far from the mean and each other. A small
variance means that the numbers are closer together in value.
Variance is calculated by taking the differences between each number in a data set and the mean,
squaring those differences to give them positive value, and dividing the sum of the resulting
squares by the number of values in the set.
In this formula, X represents an individual data point, u represents the mean of the data points,
and N represents the total number of data points.
Note that while calculating a sample variance in order to estimate a population variance, the
denominator of the variance equation becomes N – 1. This removes bias from the estimation, as
it prohibits the researcher from underestimating the population variance.
An Advantage of Variance
One of the primary advantages of variance is that it treats all deviations from the mean of the
data set in the same way, regardless of direction.
This ensures that the squared deviations cannot sum to zero, which would result in giving the
appearance that there was no variability in the data set at all.
CORRELATION COEFFICIENT
The correlation coefficient is the term used to refer to the resulting correlation measurement. It
will always maintain a value between one and negative one.
When the correlation coefficient is one, the variables under examination have a perfect positive
correlation. In other words, when one moves, so does the other in the same direction,
proportionally.
If the correlation coefficient is less than one, but still greater than zero, it indicates a less than
perfect positive correlation. The closer the correlation coefficient gets to one, the stronger the
correlation between the two variables.
When the correlation coefficient is zero, it means that there is no identifiable relationship
between the variables. If one variable moves, it’s impossible to make predictions about the
movement of the other variable.
If the correlation coefficient is negative one, this means that the variables are perfectly
negatively or inversely correlated. If one variable increases, the other will decrease at the same
proportion. The variables will move in opposite directions from each other.
If the correlation coefficient is greater than negative one, it indicates that there is an imperfect
negative correlation. As the correlation approaches negative one, the correlation grows.
COVARIANCE
Covariance signifies the direction of the linear relationship between the two variables. By
direction we mean if the variables are directly proportional or inversely proportional to each
other. (Increasing the value of one variable might have a positive or a negative impact on the
value of the other variable).
The values of covariance can be any number between the two opposite infinities. Also, it’s
important to mention that covariance only measures how two variables change together, not
the dependency of one variable on another one.
The value of covariance between 2 variables is achieved by taking the summation of the
product of the differences from the means of the variables as follows:
The upper and lower limits for the covariance depend on the variances of the variables
involved. These variances, in turn, can vary with the scaling of the variables. Even a change
in the units of measurement can change the covariance. Thus, covariance is only useful to
find the direction of the relationship between two variables and not the magnitude. Below are
the plots which help us understand how the covariance between two variables would look in
different directions.
CORRELATION
It not only shows the kind of relation (in terms of direction) but also how strong the
relationship is. Thus, we can say the correlation values have standardized notions, whereas
the covariance values are not standardized and cannot be used to compare how strong or
weak the relationship is because the magnitude has no direct significance. It can assume
values from -1 to +1.
To determine whether the covariance of the two variables is large or small, we need to assess
it relative to the standard deviations of the two variables.
To do so we have to normalize the covariance by dividing it with the product of the standard
deviations of the two variables, thus providing a correlation between the two variables.
The correlation coefficient is a dimensionless metric and its value ranges from -1 to +1.
The closer it is to +1 or -1, the more closely the two variables are related.
If there is no relationship at all between two variables, then the correlation coefficient will
certainly be 0. However, if it is 0 then we can only say that there is no linear relationship.
There could exist other functional relationships between the variables.
When the correlation coefficient is positive, an increase in one variable also increases the
other. When the correlation coefficient is negative, the changes in the two v ariables are in
opposite directions.
A statistical hypothesis is an assumption about a population which may or may not be true.
Hypothesis testing is a set of formal procedures used by statisticians to either accept or reject
statistical hypotheses. Statistical hypotheses are of two types:
Null hypothesis, H0- represents a hypothesis of chance basis.
Alternative hypothesis, Ha - represents a hypothesis of observations which are
influenced by some non-random cause.
Example
suppose we wanted to check whether a coin was fair and balanced. A null hypothesis might say,
that half flips will be of head and half will of tails whereas alternative hypothesis might say that
flips of head and tail may be very different.
H0: P=0.5
Ha: P≠0.5
For example if we flipped the coin 50 times, in which 40 Heads and 10 Tails results. Using
result, we need to reject the null hypothesis and would conclude, based on the evidence, that the
coin was probably not fair and balanced.
Hypothesis Tests
Following formal process is used by statistican to determine whether to reject a null hypothesis,
based on sample data. This process is called hypothesis testing and is consists of following four
steps:
1. State the hypotheses - This step involves stating both null and alternative hypotheses.
The hypotheses should be stated in such a way that they are mutually exclusive. If one is
true then other must be false.
2. Formulate an analysis plan - The analysis plan is to describe how to use the sample
data to evaluate the null hypothesis. The evaluation process focuses around a single test
statistic.
3. Analyze sample data - Find the value of the test statistic (using properties like mean
score, proportion, t statistic, z-score, etc.) stated in the analysis plan.
4. Interpret results - Apply the decisions stated in the analysis plan. If the value of the test
statistic is very unlikely based on the null hypothesis, then reject the null hypothesis.
UNIT -2
1. EXPLORATORY DATA ANALYSIS (EDA)
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods. It
helps determine how best to manipulate data sources to get the answers you need, making it
easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check
assumptions.
The main purpose of EDA is to help look at data before making any assumptions. It can help
identify obvious errors, as well as better understand patterns within the data, detect outliers or
anomalous events, find interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about
standard deviations, categorical variables, and confidence intervals. Once EDA is complete
and insights are drawn, its features can then be used for more sophisticated data analysis or
modeling, including machine learning.
Specific statistical functions and techniques you can perform with EDA tools include:
Clustering and dimension reduction techniques, which help create graphical displays
of high-dimensional data containing many variables.
Univariate visualization of each field in the raw dataset, with summary statistics.
Bivariate visualizations and summary statistics that allow you to assess the
relationship between each variable in the dataset and the target variable you’re
looking at.
Multivariate visualizations, for mapping and understanding interactions between
different fields in the data.
K-means Clustering is a clustering method in unsupervised learning where data
points are assigned into K groups, i.e. the number of clusters, based on the distance
from each group’s centroid. The data points closest to a particular centroid will be
clustered under the same category. K-means Clustering is commonly used in market
segmentation, pattern recognition, and image compression.
Predictive models, such as linear regression, use statistics and data to predict
outcomes.
Univariate non-graphical. This is simplest form of data analysis, where the data
being analyzed consists of just one variable. Since it’s a single variable, it doesn’t
deal with causes or relationships. The main purpose of univariate analysis is to
describe the data and find patterns that exist within it.
Univariate graphical. Non-graphical methods don’t provide a full picture of the
data. Graphical methods are therefore required. Common types of univariate
graphics include:
o Stem-and-leaf plots, which show all data values and the shape of the
distribution.
o Histograms, a bar plot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of values.
o Box plots, which graphically depict the five-number summary of minimum,
first quartile, median, third quartile, and maximum.
Multivariate nongraphical: Multivariate data arises from more than one variable.
Multivariate non-graphical EDA techniques generally show the relationship
between two or more variables of the data through cross-tabulation or statistics.
Multivariate graphical: Multivariate data uses graphics to display relationships
between two or more sets of data. The most used graphic is a grouped bar plot or bar
chart with each group representing one level of one of the variables and each bar
within a group representing the levels of the other variable.
Some of the most common data science tools used to create an EDA include:
There are important reasons anyone working with data should do EDA.
In the context of data generated from logs, EDA also helps with de‐bugging the logging
process. For example, “patterns” you find in the data could actually be something wrong in
the logging process that needs to be fixed. If you never go to the trouble of debugging, you’ll
continue to think your patterns are real. The engineers we’ve worked with are always grateful
for help in this area.
2. THE LIFECYCLE OF DATA SCIENCE
1. Business Understanding: The complete cycle revolves around the enterprise goal. What
will you resolve if you do no longer have a specific problem? It is extraordinarily essential
to apprehend the commercial enterprise goal sincerely due to the fact that will be your
ultimate aim of the analysis. After desirable perception only we can set the precise aim of
evaluation that is in sync with the enterprise objective. You need to understand
understan if the
customer desires to minimize savings loss, or if they prefer to predict the rate of a
commodity, etc.
3. Preparation of Data: Next comes the data preparation stage. This consists of steps like
choosing the applicable data, integrating the data by means of merging the data sets,
cleaning it, treating the lacking values through either
eit eliminating them , treating inaccurate
data through eliminating ng them, additionally test forfor outliers the use of box plots.
plots
Constructing new data, derive new elements from present ones.
4. Exploratory Data Analysis: This step includes getting some concept about the answer
and elements affecting it, earlier than con
constructing
structing the real model. Distribution of data
inside distinctive variables of a character is explored graphically the usage of bar-graphs,
bar
Relations between distinct aspects are captured via graphical representations like scatter
plots and warmth maps. Many ny data visualization strategies are considerably used to
discover each and every characteristic individually and by means of combining them with
different features.
5. Data Modeling: A model takes the organized data as input and gives the preferred
output. This step consists of selecting the suitable kind of model, whether the problem is a
classification problem, or a regression problem or a clustering problem. After deciding on
the model family, amongst the number of algorithms amongst that family, we need to
cautiously pick out the algorithms to put into effect and enforce them. We need to tune the
hyperparameters of every model to obtain the preferred performance.
7. Model Deployment: This is the last step in the data science life cycle. Each step in the
data science life cycle defined above must be laboured upon carefully. If any step is
performed improperly, and hence, have an effect on the subsequent step and the complete
effort goes to waste. For example, if data is no longer accumulated properly, you’ll lose
records and you will no longer be constructing an ideal model. If information is not cleaned
properly, the model will no longer work. If the model is not evaluated properly, it will fail
in the actual world. Right from Business perception to model deployment, every step has to
be given appropriate attention, time, and effort.
3. DESCRIPTIVE STATISTICS
Descriptive statistics summarize and organize characteristics of a data set. A data set is a
collection of responses or observations from a sample or entire population.
In quantitative research, after collecting data, the first step of statistical analysis is to describe
characteristics of the responses, such as the average of one variable (e.g., age), or the relation
between two variables (e.g., age and creativity).
The next step is inferential statistics, which help you decide whether your data confirms or
refutes your hypothesis and whether it is generalizable to a larger population.
Frequency distribution
Frequency distributions are particularly useful for normal distributions, which show the
observations of probabilities divided among standard deviations.
In finance, traders use frequency distributions to take note of price action and identify trends.
The mean, or M, is the most commonly used method for finding the average.
To find the mean, simply add up all response values and divide the sum by the total number
of responses. The total number of responses or observations is called N.
To find the median, order each response value from the smallest to the biggest. Then, the
median is the number in the middle. If there are two numbers in the middle, find their mean.
The mode is the simply the most popular or most frequent response value. A data set can
have no mode, one mode, or more than one mode.
To find the mode, order your data set from lowest to highest and find the response that occurs
most frequently.
To find the mode, order your data set from lowest to highest and find the response that occurs
most frequently.
Measures of variability
Measures of variability give you a sense of how spread out the response values are. The
range, standard deviation and variance each reflect different aspects of spread.
Range
The range gives you an idea of how far apart the most extreme response scores are. To find
the range, simply subtract the lowest value from the highest value.
Range of visits to the library in the past year Ordered data set: 0, 3, 3, 12, 15, 24
Range: 24 – 0 = 24
Standard deviation
The standard deviation (s) is the average amount of variability in your dataset. It tells you, on
average, how far each score lies from the mean. The larger the standard deviation, the more
variable the data set is.
Standard deviations of visits to the library in the past yearIn the table below, you
complete Steps 1 through 4.
Raw data Deviation from mean Squared deviation
From learning that s = 9.18, you can say that on average, each score deviates from the mean
by 9.18 points.
Variance
The variance is the average of squared deviations from the mean. Variance reflects the degree
of spread in the data set. The more spread the data, the larger the variance is in relation to the
mean.
To find the variance, simply square the standard deviation. The symbol for variance is s2.
Variance of visits to the library in the past year Data set: 15, 3, 12, 0, 24, 3
s = 9.18
s2 = 84.3
DATA VISUALIZATION
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.
In the world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.
Bar Chart
Box-and-whisker Plots
Bubble Cloud
Gantt Chart
Heat Map
Histogram
Radial Tree
Scatter Plot (2D or 3D)
Scatter Plot
A scatter plot is a chart type that is normally used to observe and visually display the
relationship between variables. The values of the variables are represented by dots.
The positioning of the dots on the vertical and horizontal axis will inform the value of
the respective data point; hence, scatter plots make use of Cartesian coordinates to
display the values of the variables in a data set. Scatter plots are also known as
scattergrams, scatter graphs, or scatter charts.
Scatter Plot Applications and Uses
The scatter plot diagram for the data above is seen below:
Bar Graph
The pictorial representation of a grouped data, in the form of vertical or horizontal
rectangular bars, where the lengths of the bars are equivalent to the measure of data, are
known as bar graphs or bar charts.
The bars drawn are of uniform width, and the variable quantity is repres
represented
ented on one of the
axes. Also, the measure of the variable is depicted on the other axes. The heights or the
lengths of the bars denote the value of the variable, and these graphs are also used to compare
certain quantities. The frequency distribution tab
tables
les can be easily represented using bar charts
which simplify the calculations and understanding of data.
The three major attributes of bar graphs are:
The bar graph helps to compare the different sets of data among different groups
easily.
It shows the relationship
lationship using two axes, in which the categories on one axis and the
discrete values on the other axis.
The graph shows the major changes in data over time.
The types of bar charts are as follows:
Bar graph summarises the large set of data in simple visual form.
It displays each category of data in the frequency distribution.
It clarifies the trend of data better than the table.
It helps in estimating the key values at a glance.
Following is a simple example of the Matplotlib bar plot. It shows the number of
students enrolled for various courses offered at an institute.
Histogram
A histogram is a graphical representation of a grouped frequency distribution with
continuous classes. It is an area diagram and can be defined as a set of rectangles with bases
along with the intervals between class boundaries and with areas proportional to frequencies
in the corresponding classes. In such representations, all the rectangles are adjacent since the
base covers the intervals between class boundaries. The heights of rectangles are proportional
to corresponding frequencies of similar classes and for different classes, the heights will be
proportional to corresponding frequency densities.
In other words, histogram a diagram involving rectangles whose area is proportional to the
frequency of a variable and width is equal to the class interval.
Histogram Types
The histogram can be classified into different types based on the frequency distribution of the
data. There are different types of distributions, such as normal distribution, skewed
distribution, bimodal distribution, multimodal distribution, comb distribution, edge peak
distribution, dog food distributions, heart cut distribution, and so on.
Following example plots a histogram of marks obtained by students in a class. Four bins, 0-
25, 26-50, 51-75, and 76-100 are defined. The Histogram shows number of students falling
in this range.
from matplotlib import pyplot as plt
import numpy as np
fig,ax = [Link](1,1)
a = [Link]([22,87,5,43,56,73,55,54,11,20,51,5,79,31,27])
[Link](a, bins = [0,25,50,75,100])
ax.set_title("histogram of result")
ax.set_xticks([0,25,50,75,100])
ax.set_xlabel('marks')
ax.set_ylabel('no. of students')
[Link]()
The plot appears as shown below −
Heat Map
A heat map (or heatmap) is a graphical representation of data where values are depicted by
color. Heat maps make it easy to visualize complex data and understand it at a glance:
glance
Types of heatmap
Heat map is really an umbrella term for different heatmapping tools: scroll maps, click maps,
and move maps.
Scroll maps show you the exact percentage of people who scroll down to any point on
the page: the redder the area, the more visitors saw it.
Click maps show you an aggregate of where visitors click their mouse on desktop
devices and tap their finger on mobile devices (in this case, they are known as touch
heatmaps). The map is color
color-coded
coded to show the elements that have been clicked and
tapped the most (red, orange, yellow).
Move maps track where desktop users move their mouse as they navigate the page. The
hot spots in a move map represent where users have moved their cursor on a page
Box Plots
# Creating dataset
[Link](10)
data = [Link](100, 20, 200)
fig = [Link](figsize =(10, 7))
[Link](data)
# show plot
[Link]()
Output:
UNIT - 3
Introduction
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values
of new datapoints which further means that the new data point will be assigned a
value based on how closely it matches the points in the training set. We can
understand its working with the help of following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first
step of KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K
can be any integer.
Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data
with the help of any of the method namely: Euclidean, Manhattan or
Hamming distance. The most commonly used method to calculate distance
is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent
class of these rows.
Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN
algorithm −
Suppose we have a dataset which can be plotted as follows −
Now, we need to classify new data point with black dot (at point 60,60) into blue
or red class. We are assuming K = 3 i.e. it would find three nearest data points. It
is shown in the next diagram −
We can see in the above diagram the three nearest neighbors of the data point
with black dot. Among those three, two of them lies in Red class hence the black
dot will also be assigned in red class.
Pros
Cons
Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for loan
approval? Does that individual have the characteristics similar to the defaulters
one?
KNN algorithms can be used to find an individual’s credit rating by comparing with
the persons having similar traits.
Support vector machines (SVMs)
Introduction to SVM
Support vector machines (SVMs) are powerful yet flexible supervised machine
learning algorithms which are used both for classification and regression. But
generally, they are used in classification problems. In 1960s, SVMs were first
introduced but later they got refined in 1990. SVMs have their unique way of
implementation as compared to other machine learning algorithms. Lately, they
are extremely popular because of their ability to handle multiple continuous and
categorical variables.
Working of SVM
SVM Kernels
Linear Kernel
It can be used as a dot product between any two observations. The formula of
linear kernel is as below −
From the above formula, we can see that the product between two vectors say 𝑥
& 𝑥𝑖 is the sum of the multiplication of each pair of input values.
Polynomial Kernel
SVM classifiers offers great accuracy and work well with high dimensional space.
SVM classifiers basically use a subset of training points hence in result uses very
less memory.
They have high training time hence in practice not suitable for large datasets.
Another disadvantage is that SVM classifiers do not work well with overlapping
classes.
Decision Tree
In general, Decision tree analysis is a predictive modelling tool that can be applied
across many areas. Decision trees can be constructed by an algorithmic approach
that can split the dataset in different ways based on different conditions. Decisions
trees are the most powerful algorithms that falls under the category of supervised
algorithms.
They can be used for both classification and regression tasks. The two main
entities of a tree are decision nodes, where the data is split and leaves, where we
got outcome. The example of a binary tree for predicting whether a person is fit
or unfit providing various information like age, eating habits and exercise habits,
is given below −
In the above decision tree, the question are decision nodes and final outcomes
are leaves. We have the following two types of decision trees.
Classification decision trees − In this kind of decision trees, the decision
variable is categorical. The above decision tree is an example of
classification decision tree.
Regression decision trees − In this kind of decision trees, the decision
variable is continuous.
splitting criterion
The splitting criterion also tells us which branches to grow from node N
with respect to the outcomes of the chosen test. More specifically, the splitting
criterion
indicates the splitting attribute and may also indicate either a split-point or
a splitting subset.
In Decision Tree the major challenge is to identification of the attribute for the root
node in each level. This process is known as attribute selection. We have two
popular attribute selection measures:
1. Information Gain
2. Gini Index
[Link] Ratio
Information Gain
Let node N represent or hold the tuples of partition D. The attribute with the highest
information gain is chosen as the splitting attribute for node N. This attribute
minimizes the information needed to classify the tuples in the resulting partitions
and reflects the least randomness or “impurity” in these partitions.
We can understand the working of Random Forest algorithm with the help of
following steps −
Step 1 − First, start with the selection of random samples from a given
dataset.
Step 2 − Next, this algorithm will construct a decision tree for every sample.
Then it will get the prediction result from every decision tree.
Step 3 − In this step, voting will be performed for every predicted result.
Step 4 − At last, select the most voted prediction result as the final
prediction result.
The following diagram will illustrate its working −
Pros
CONFUSION MATRIX
It is the easiest way to measure the performance of a classification problem where
the output can be of two or more type of classes. A confusion matrix is nothing
but a table with two dimensions viz. “Actual” and “Predicted” and furthermore, both
the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False
Positives (FP)”, “False Negatives (FN)” as shown below −
The explanation of the terms associated with confusion matrix are as follows −
True Positives (TP) − It is the case when both actual class & predicted
class of data point is 1.
True Negatives (TN) − It is the case when both actual class & predicted
class of data point is 0.
False Positives (FP) − It is the case when actual class of data point is 0 &
predicted class of data point is 1.
False Negatives (FN) − It is the case when actual class of data point is 1 &
predicted class of data point is 0.
EXAMPLE
Metrics for Evaluating Classifier Performance
The accuracy of a classifier on a given test set is the percentage of test set tuples
that are correctly classified by the classifier. That is,
The precision and recall measures are also widely used in classification. Precision
can be thought of as a measure of exactness (i.e., what percentage of tuples
labeled as positive are actually such), whereas recall is a measure of
completeness (what percentage of positive tuples are labeled as such). If recall
seems familiar, that’s because it is the same as sensitivity (or the true positive
rate). These measures can be computed as
An alternative way to use precision and recall is to combine them into a single
measure. This is the approach of the F measure (also known as the F1 score or
F-score)
Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification
problems.
o It is mainly used in text classification that includes a high-
dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:
Bayes' Theorem:
Where,
Example
Pattern is everything around in this digital world. A pattern can either be seen physically or it
can be observed mathematically by applying algorithms.
Example: The colors on the clothes, speech pattern etc. In computer science, a pattern is
represented using vector feature values.
CURSE OF DIMENSIONALITY
Handling the high-dimensional data is very difficult in practice, commonly known as the curse of
dimensionality. If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex. As the number of features increases, the number of
samples also gets increased proportionally, and the chance of overfitting also increases. If the
machine learning model is trained on high-dimensional data, it becomes overfitted and results in
poor performance.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
DIMENSIONALITY REDUCTION
In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These fa factors
ctors are basically variables called features.
The higher the number of features, the harder it gets to visualize the training set and then
work on it. Sometimes, most of these features are correlated, and hence redundant. This is
where dimensionality reduction
ction algorithms come into play. Dimensionality reduction is the
process of reducing the number of random variables under consideration, by obtaining a set
of principal variables. It can be divided into feature selection and feature extraction.
This method was introduced by Karl Pearson. It works on a condition that while the data in a
higher dimensional space is mapped to data in a lower dimension space, the variance of the data
in the lower dimensional space should be maximum.
It involves the following steps:
Construct the covariance matrix of the data.
Compute the eigenvectors of this matrix.
Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of
variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data
loss in the process. But, the most important variances should be retained by the remaining
eigenvectors.
Supervised learning algorithms are trained using labeled Unsupervised learning algorithms are trained
data. using unlabeled data.
Supervised learning model takes direct feedback to Unsupervised learning model does not take
check if it is predicting correct output or not. any feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
In supervised learning, input data is provided to the In unsupervised learning, only input data is
model along with the output. provided to the model.
The goal of supervised learning is to train the model so The goal of unsupervised learning is to find
that it can predict the output when it is given new data. the hidden patterns and useful insights from
the unknown dataset.
Supervised learning needs supervision to train the Unsupervised learning does not need any
model. supervision to train the model.
Supervised learning can be categorized Unsupervised Learning can be classified
in Classification and Regression problems. in Clustering and Associations problems.
Supervised learning model produces an accurate result. Unsupervised learning model may give less
accurate result as compared to supervised
learning.
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
PERCEPTRON
Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers
decide whether an input, usually represented by a series of vectors, belongs to a specific class. a
perceptron is a single-layer neural network. They consist of four main parts including input
values, weights and bias, net sum, and an activation function.
The process begins by taking all the input values and multiplying them by their weights. Then,
all of these multiplied values are added together to create the weighted sum. The weighted sum is
then applied to the activation function, producing the perceptron's output. The activation function
plays the integral role of ensuring the output is mapped between required values such as (0,1) or
(-1,1). It is important to note that the weight of an input is indicative of the strength of a node.
Similarly, an input's bias value gives the ability to shift the activation function curve up or down.
LOGISTIC REGRESSION
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
In Logistic regression, instead of fi
fitting
tting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Boosting is a sequential process, where each subsequent model attempts to correct the
errors of the previous model. The succeeding models are dependent on the previous
model.
In this technique, learners are learned sequentially with early learners fitting simple
models to the data and then analyzing data for errors. In other words, we fit consecutive
trees (random sample) and at every step, the goal is to solve for net error from the prior
tree.
When an input is misclassified by a hypothesis, its weight is increased so that next
hypothesis is more likely to classify it correctly. By combining the whole set at the end
converts weak learners into better performing model.
Let’s understand the way boosting works in the below steps.
1. A subset is created from the original dataset.
2. Initially, all data points are given equal weights.
3. A base model is created on this subset.
4. This model is used to make predictions on the whole dataset.
Errors are calculated using the actual values and predicted values.
The observations which are incorrectly predicted, are given higher weights. (Here, the
three misclassified blue-plus points will be given higher weights)
Another model is created and predictions are made on the dataset. (This model tries to
correct the errors from the previous model)
Similarly, multiple models are created, each correcting the errors of the previous model.
The final model (strong learner) is the weighted mean of all the models (weak learners).
K-means clustering algorithm computes the centroids and iterates until we it finds optimal
centroid. It assumes that the number of clusters are already known. It is also called flat
clustering algorithm. The number of clusters identified from data by algorithm is represented by
‘K’ in K-means.
In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the
squared distance between the data points and centroid would be minimum. It is to be understood
that less variation within the clusters will lead to more similar data points within same cluster.
Step 1 − First, we need to specify the number of clusters, K, need to be generated by this
algorithm.
Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple
words, classify the data based on the number of data points.
Step 3 − Now it will compute the cluster centroids.
Step 4 − Next, keep iterating the following until we find optimal centroid which is the
assignment of data points to the clusters that are not changing any more
4.1 − First, the sum of squared distance between data points and centroids would be
computed.
4.2 − Assign each data point to the cluster that is closer than other cluster (centroid).
4.3 − At last compute the centroids for the clusters by taking the average of all data
points of that cluster.
K-means follows Expectation-Maximization approach to solve the problem. The Expectation-
step is used for assigning the data points to the closest cluster and the Maximization-step is used
for computing the centroid of each cluster.
Applications of K-Means Clustering Algorithm
Market segmentation
Document Clustering
Image segmentation
Image compression
Customer segmentation
Analyzing the trend on dynamic data
EVALUATION METRICS :
Root mean square error or root mean square deviation is one of the most commonly used
measures for evaluating the quality of predictions. It shows how far predictions fall from
measured true values using Euclidean distance.
To compute RMSE, calculate the residual (difference between prediction and truth) for each data
point, compute the norm of residual for each data point, compute the mean of residuals and take
the square root of that mean. RMSE is commonly used in supervised learning applications, as
RMSE uses and needs true measurements at each predicted data point.
Mean Absolute Error (also called L1 loss) is one of the most simple yet robust loss functions
used for regression models.
MAE takes the average sum of the absolute differences between the actual and the predicted
values. For a data point xi and its predic
predicted value yi, n being the total number of data points in the
dataset, the mean absolute error is defined as:
The coefficient of determination is the square of the correlation(r), thus it ranges from 0
to 1.
With linear regression,, the coefficient of determination is equal to the square of the
correlation between the x and y variables.
If R2 is equal to 0, then the dependent variable cannot be predict
predicted
ed from the independent
variable.
If R2 is equal to 1, then the dependent variable can be predicted from the independent
variable without any error.
If R2 is between 0 and 1, then it indicates the extent that the dependent variable can be
predictable. If R2 of 0.10 means, it is 10 percent of the variance in the y variable is
predicted from the x variable. If 0.20 means, 20 percent of the variance in the y variable
is predicted from the x variable, and so on.
The value of R2 shows whether the model would be a good fit for the given data set.
TRAINING AND TESTING A CLASSIFIER
Training and Testing is a phenomena through which a system gets trained and becomes
adaptable to give result in an accurate manner. Learning is the most important phase as how
well the system performs on the data provided to the system depends on which algorithms
used on the data. Entire dataset is divided into two categories, one which is used in training
the model i.e. Training set and the other that is used in testing the model after training, i.e.
Testing set.
Trainingset:
Training set is used to build a model. It consists of the set of images that are used to train
the system. Training rules and algorithms used give relevant information on how to
associate input data with output decision. The system is trained by applying these
algorithms on the dataset, all the relevant information is extracted from the data and
results are obtained. Generally, 80% of the data of the dataset is taken for training data
Testingset:
Testing data is used to test the system. It is the set of data which is used to verify whether
the system is producing the correct output after being trained or not. Generally, 20% of
the data of the dataset is used for testing.
CROSS-VALIDATION
Cross-Validation
Cross-validation is a technique in which we train our model using the subset of the data-set
and then evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows :
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
Methods of Cross Validation
Validation
LOOCV (Leave One Out Cross Validation)
K-Fold Cross Validation
ROC CURVE
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a
classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
To compute the points in an ROC curve, we could evaluate a logistic regression model many
times with different classification thresholds, but this would be inefficient. Fortunately, there's
an efficient, sorting-based algorithm that can provide this information for us, called AUC.
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-
dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).
(COST FUNCTIONS : same as evaluation functions)
UNIT - 4
1. MULTILAYER PERCEPTRON
The Perceptron consists of an input layer and an output layer which are fully connected. MLPs
have the same input and output layers but may have multiple hidden layers in between the
aforementioned layers, as seen below.
1. Just as with the perceptron, the inputs are pushed forward through the MLP by taking the
dot product of the input with the weights that exist between the input layer and the hidden
layer (WH). This dot product yields a value at the hidden layer.
2. MLPs utilize activation functions at each of their calculated layers. There are many
activation functions to discuss: rectified linear units (ReLU), sigmoid function,
tanh. Push the calculated output at the current layer through any of these activation
functions.
3. Once the calculated output at the hidden layer has been pushed through the activation
function, push it to the next layer in the MLP by taking the dot product with the
corresponding weights.
4. Repeat steps two and three until the output layer is reached.
5. At the output layer, the calculations will either be used for a backpropagation algorithm
that corresponds to the activation function that was selected for the MLP or a decision
will be made based on the output.
MLPs form the basis for all neural networks and have greatly improved the power of computers
when applied to classification and regression problems.
2. BACK PROPAGATION
Back propagation is a supervised learning algorithm, for training Multi
Multi-layer
layer Perceptrons
(Artificial Neural Networks).
Calculate the error – How far is your model output from the actual output.
Minimum Error – Check whether the error is minimized or not.
Update the parameters – If the error is huge then, update the parameters (weights and
biases). After that again check the error. Repeat the process until the error becomes
minimum.
Model is ready to make a prediction – Once the errorrror becomes minimum, you can feed
some inputs to your model and it will produce the output.
The Backpropagation algorithm looks for the minimum value of the error function in weight
space using a technique called the delta rule or gradient descent. The wweights
eights that minimize the
error function is then considered to be a solution to the learning problem.
two inputs
two hidden neurons
two output neurons
two biases
Repeat
epeat this process for the output layer neurons, using the output from the hidden layer neurons
as inputs.
Now, we will propagate backwards. This way we will try to reduce the error by changing the
values of weights and biases.
Consider W5, we will calculate the rate of change of error w.r.t change in weight W5.
Since we are propagating backwards, first thing we need to do is, calculate the change in total
errors w.r.t the output O1 and O2.
Now, propagate further backwards and calculate the change in output O1 w.r.t to its total net
input.
The
he total net input of O1 changes w.r.t W5?
Step – 3: Putting all the values together and calculating the updated weight value
Regression involves predicting a specific value that is continuous in nature. Estimating the price
Mean Absolute Error (also called L1 loss) is one of the most simple yet robust loss functions
used for regression models.
MAE takes the average sum of the absolute differences between the actual and the predicted
values. For a data point xi and its predicted value yi, n being the total number of data points in the
dataset, the mean absolute error is defined as:
Mean Squared Error (MSE)
Mean Squared Error (also called L2 loss) is almost every data scientist’s preference when it
comes to loss functions for regression.
Mean Squared Error is the average of the squared differences between the actual and the
predicted values. For a data point Y i and its predicted value Ŷi, where n is the total number of
data points in the dataset, the mean squared error is defined as:
Calculating the Mean Squared Logarithmic Error is the same as Mean Squared Error, except the
natural logarithm of the predicted values is used rather than the actual values.
Classification problems involve predicting a discrete class output. mail can be classified as a
spam or not a spam and a person’s dietary preferences can be put in one of three categories -
vegetarian, non-vegetarian
vegetarian and vegan.
Binary Cross Entropy Loss
Cross-entropy loss, or log loss, measures the performance of a classification model whose output
is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability
diverges from the actual label. So predicting a probability of .012 when the actual observation
label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of
0.
If the batch size is 1000, we can complete an epoch with a single iteration. Similarly, if the batch
size is 500, an epoch takes two iterations. So, if the batch size is 100, an epoch takes 10 iterations
to complete. Simply, for each epoch, the required number of iterations times the batch size gives
the number of data points.
We can use multiple epochs in training. In this case, the neural network is fed the same data
more than once.
Grid search
One traditional and popular way to perform hyperparameter tuning is by using an Exhaustive Grid
Search. This method tries every possible combination of each set of hyper-parameters. Using this
method, we can find the best set of values in the parameter search space. This usually uses more
computational power and takes a long time to run since this method needs to try every
combination in the grid size.
Randomized Search
The main difference in the RandomizedSearch CV, when compared with GridCV, is that instead
of trying every possible combination, this chooses the hyperparameter sample combinations
randomly from grid space. Because of this reason, there is no guarantee that we will find the best
result like Grid Search. But, this search can be extremely effective in practice as computational
time is very less.
Other approaches :
Bayesian optimization
Gradient-based optimization
Evolutionary optimization
1. max_depth: int, default=None This is used to select how deep you want to make each tree
in the forest.
2. criterion :{“Gini,” “entropy”}, default=” Gini”: Measures the quality of each split.
3. min_samples_leaf: int or float, default=1: This parameter helps determine the minimum
required number of observations at the end of each decision tree
4. n_estimators: int, default=100: This is perhaps the most important parameter. This
represents the number of trees you want to build within a random forest before calculating
the predictions.
Sentiment analysis is a machine learning text analysis technique that assigns sentiment (opinion,
feeling, or emotion) to words within a text, or an entire text, on a polarity scale
of Positive, Negative, or Neutral.
It can automatically read through thousands of pages in minutes or constantly monitor social
media for posts about you. And it would be analyzed to pull all of the individual statements
as Positive. This allows companies to follow product releases and marketing campaigns in real-
time, to see how customers are reacting.
Email Spam
One of the most common uses of classification, working non-stop and with little need for human
interaction, email spam classification saves us from tedious deletion tasks and sometimes even
costly phishing scams.
Email applications use the above algorithms to calculate the likelihood that an email is either not
intended for the recipient or unwanted spam. Using text analysis classification techniques, spam
emails are weeded out from the regular inbox: perhaps a recipient’s name is spelled incorrectly,
or certain scamming keywords are used.
Document Classification
Document classification is the ordering of documents into categories according to their content.
This was previously done manually, as in the library sciences or hand-ordered legal files.
Machine learning classification algorithms, however, allow this to be performed automatically
Image Classification
Image classification assigns previously trained categories to a given image. These could be the
subject of the image, a numerical value, a theme, etc. Image classification can even use multi-
label image classifiers, that work similarly to multi-label text classifiers, to tag an image of a
stream, for example, into different labels, like “stream,” “water,” “outdoors,” etc.
b. Applications to regression
Forecasting
A top advantage of using a linear regression model in machine learning is the ability to forecast
trends and make predictions that are feasible. Data scientists can use these predictions and
make further deductions based on machine learning. It is quick, efficient, and accurate. This is
predominantly since machines process large volumes of data and there is minimum human
intervention. Once the algorithm is established, the process of learning becomes simplified.
Preparing Strategies
Since machine learning enables prediction, one of the biggest advantages of a linear regression
model in it is the ability to prepare a strategy for a given situation, well in advance, and analyze
various outcomes. Meaningful information can be derived from the regression model of
forecasting thereby helping companies plan strategically and make executive decisions.
c. Applications to unsupervised learning
Clustering automatically split the dataset into groups base on their similarities
Anomaly detection can discover unusual data points in your dataset. It is useful for
finding fraudulent transactions
Association mining identifies sets of items which often occur together in your dataset
Latent variable models are widely used for data preprocessing. Like reducing the number
of features in a dataset or decomposing the dataset into multiple components
A Recurrent
ent Neural Network works on the principle of saving the output of a particular layer and
feeding this back to the input in order to predict the output of the layer.
The nodes in different layers of the neural network are compressed to form a single layer of
recurrent neural networks. A, B, and C are the parameters of the network.
Here, “x” is the input layer, “h” is the hidden layer, and “y” is the output layer. A, B, and C are
the network parameters used to improve the output of the model. At any given time t, the current
input is a combination of input at x(t) and x(t-1). The output at any given time is fetched back to
the network to improve on the output.
Recurrent neural networks were created because there were a few issues in the feed-forward
neural network:
Image Captioning
Time Series Prediction
Natural Language Processing
Convolutional Neural Networks – which are designed to address image recognition systems
and classification problems. Convolutional Neural Networks have wide applications in image
and video recognition, recommendation systems and natural language processing..
processing
A convolutional
ional neural network is a feed-forward neural network,, often with up to 20 or 30
layers. The power of a convolutional neural network comes from a special kind of layer called
the convolutional layer.
Convolutional neural networks contain many convolutional layers stacked on top of each other,
each one capable of recognizing more sophisticated shapes. With three or four convolutional
layers it is possible to recognize
gnize handwritten digits and with 25 layers it is possible to distinguish
human faces.
The hidden layers are typically convolutional layers followed by activation layers, some of them
followed by pooling layers.
1. Convolution,
2. ReLu,
3. Pooling and
4. Full Connectedness (Fully Connected Layer).
Convolution Layer
This is the first step in the process of extracting valuable features from an image. A convolution
layer has several filters that perform the convolution operation. Every image is considered as a
matrix of pixel values.
ReLU layer
ReLU stands for the rectified linear unit. Once the feature maps are extracted, the next step is to
move them to a ReLU layer. ReLU performs an element-wise wise operation and sets all the negative
pixels to 0. It introduces non-linearity
linearity to the network, and the generated output is a rectified
feature map.
Pooling Layer
Pooling is a down-sampling
sampling operation that reduces the dimensionality of the feature map. The
rectified feature map now goes through a pooling layer to generate a pooled feature map.
9. LONG SHORT-TERM
TERM MEMORY (LSTM)
:
UNIT - 5
1. RECOMMENDER SYSTEMS
Recommender systems are so commonplace now that many of us use them without even
knowing it. Because we can't possibly look through all the products or content on a website, a
recommendation system plays an important role in helping us have a better user experience
UN DE RS T A N D I N G RE L A T I O NS H I PS
User-Product Relationship
The user-product relationship occurs when some users have a preference towards specific
products that they need. For example, a cricket player might have a preference for cricket-related
items, thus the e-commerce website will build a user-product relation of player->cricket.
Product-Product Relationship
Product-product relationships occur when items are similar in nature, either by appearance or
description. Some examples include books or music of the same genre, dishes from the same
cuisine, or news articles from a particular event.
User-User Relationship
User-user relationships occur when some customers have similar taste with respect to a particular
product or service. Examples include mutual friends, similar backgrounds, similar age, etc.
DAT A & RE CO M ME NDE R S YS T E M S
There are two particularly important methods, explicit and implicit rating.
Explicit Ratings
Explicit ratings are provided by the user. They infer the user’s preference. Examples include star
ratings, reviews, feedback, likes and following. Since users don't always rate products, explicit
ratings can be hard to get.
Implicit Ratings
Implicit ratings are provided when users interact with the item. They infer a user’s behavior and
are easy to get as users are subconsciously clicking. Examples include clicks, views and
purchases.
Product similarity is the most useful system for suggesting products based on how much the user
would like the product. If the user is browsing or searching for a particular product, they can be
shown similar products. Users often expect to find products they want quickly and move on if
they have a hard time finding the relevant product. When the user clicks on one product we can
show another similar product, or if the user buys the product we can email the user
advertisements or coupons based on a similar product.
User Similarity (User-User Filtering)
User similarity is for checking the difference between the similarity of two users. If two users
have similar preferences for a product we can assume they have similar interests. It’s like a
friend recommending a product.
Similarity Measures
Minkowski Distance :
Manhattan Distance
Euclidean Distance
Pearson Coefficient
APPROACHES TO RECOMMENDER SYSTEMS
Content-based recommendation systems uses their knowledge about each product to recommend
new ones. Recommendations are based on attributes of the item. Content-based recommender
systems work well when descriptive data on the content is provided beforehand. “Similarity” is
measured against product attributes.
Suppose I watch a movie in a particular genre, then I will be recommended movies within that
specific genre. The movie's attributes, like title, year of release, director and cast, are also helpful
in identifying similar movie content.
COLLABORATIVE FILTERING RECOMMENDER
Collaborative filtering recommender makes suggestions based on how users rated in the past and
not based on the product themselves. It only knows how other customers rated the product.
“Similarity” is measured against the similarity of users.
2. IMAGE CLASSIFICATION
Classification between objects is a fairly easy task for us, but it has proved to be a complex one
for machines and therefore image classification has been an important task within the field of
computervision. Image classification refers to the labeling of images into one of a number of
predefined classes.
The advancements in the field of autonomous driving also serve as a great example of the use of
image classification in the real-world.
world. For example, we can build an image classification model
that recognizes various objects, such as other vehicles, pedestrians, rians, traffic lights,
lights
and signposts on the road.
Performance evaluation
There is a collection of entities that participate in the network. Typically, these entities are
people
There is at least one relationship between entities of the network. On Facebook or its ilk, this
relationship is called friends. Sometimes the relationship is all-or-nothing; two people are either
friends or they are not. However, in other examples of social networks, the relationship has a
degree.
That is, if entity A is related to both B and C, then there is a higher probability than average that
B and C are related
Social networks are naturally modeled as graphs, which we sometimes refer to as a social graph.
The entities are the nodes, and an edge connects two nodes if the nodes are related by the
relationship that characterizes the network. If there is a degree associated with the relationship,
this degree is represented by labeling the edges.
Figure is an example of a tiny social network. The entities are the nodes A through G. The
relationship, which we might think of as “friends,” is represented by the edges. For instance, B is
friends with A, C, and D.
Telephone Networks
Here the nodes represent phone numbers, which are really individuals. There is an edge between
two nodes if a call has been placed between those phones in some fixed period of time, such as
last month, or “ever.” The edges could be weighted by the number of calls made between these
phones during the period.
Email Networks
The nodes represent email addresses, which are again individuals. An edge represents the fact
that there was at least one email in at least one direction between the two addresses.
Alternatively, we may only place an edge if there were emails in both directions. In that way, we
avoid viewing spammers as “friends” with all their victims. Another approach is to label edges as
weak or strong. Strong edges represent communication in both directions, while weak edges
indicate that the communication was in one direction only.
Collaboration Networks
Nodes represent individuals who have published research papers. There is an edge between two
individuals who published one or more papers jointly. Optionally, we can label edges by the
number of joint publications. The communities in this network are authors working on a
particular topic.
An alternative view of the same data is as a graph in which the nodes are papers. Two papers are
connected by an edge if they have at least one author in common. Now, we form communities
that are collections of papers on the same topic.
Clustering of Social-Network Graphs
If we were to apply standard clustering techniques to a social-network graph, our first step would
be to define a distance measure. combine nodes which are nearby. Repeating same process can
from clusters
Traditional clustering includes two communities. Likely to put two nodes with small distance in
the same cluster. Social networks graphs would have cross community edges. Severe merging of
communities likely.
Other approach
In order to exploit the betweenness of edges, we need to calculate the number of shortest paths
going through each edge. We shall describe a method called the Girvan-Newman (GN)
Algorithm, which visits each node X once and computes the number of shortest paths from X to
each of the other nodes that go through each of the edges.
The algorithm begins by performing a breadth-first search (BFS) of the graph, starting at the
node X.
The level of each node in the BFS presentation is the length of the shortest path from X to that
node.
The second step of the GN algorithm is to label each node by the number of shortest paths that
reach it from the root. Start by labeling the root 1.
The thirdd and final step is to calculate the credit value of node. Credit of node is calculated using
shortcut method: finding the total no. of nodes that a current node is responsible to reach other
nodes from root node.
Bottom Up
Keep adding edges (among existing ones) starting from lowest betweenness . Graduallly join
small components to build large connected components.
Top-down approach:
Start from all existing edges. The graph may look like one bid component.
Keep removing edges starting from the highest betweenness
Gradually split large components to arrive at communities
Repeat process until desired no. of clusters formed