0% found this document useful (0 votes)

9 views97 pages

Data Science With Python Notes

The document discusses measures of central tendency, including mode, median, and mean, explaining their definitions and applications with examples. It also covers measures of dispersion, random variables, and various probability distributions such as Bernoulli, Binomial, Poisson, and Exponential distributions. Additionally, it touches on chi-square distribution, normal distribution, and the concept of white noise in time series forecasting.

Uploaded by

gvsm.hpcl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views97 pages

Data Science With Python Notes

Uploaded by

gvsm.hpcl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT -1

MEASURES OF CENTRAL TENDENCY

A measure of central tendency (also referred to as measures of centre or central location) is

a summary measure that attempts to describe a whole set of data with a single value that
represents the middle or centre of its distribution.

There are three main measures of central tendency: the mode, the median and the mean.

The mode is the most commonly occurring value in a distribution.

Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

This table shows a simple frequency distribution of the retirement age data.

Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2

The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.

The mode has an advantage over the median and the mean as it can be found for both numerical
and categorical (non-numerical) data

In some cases, particularly where the data are continuous, the distribution may have no mode at
all (i.e. if all values are different).

The median is the middle value in distribution when the values are arranged in ascending or
descending order.

The median divides the distribution in half (there are 50% of observations on either side of the
median value). In a distribution with an odd number of observations, the median value is the
middle value.

Looking at the retirement age distribution (which has 11 observations), the median is the middle
value, which is 57 years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the mean of the
two middle values. In the following distribution, the two middle values are 56 and 57, therefore
the median equals 56.5 years:

52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The median is less affected by outliers and skewed data than the mean, and is usually the
preferred measure of central tendency when the distribution is not symmetrical.

The median cannot be identified for categorical nominal data, as it cannot be logically ordered.

The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average.

Looking at the retirement age distribution again:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The mean is calculated by adding together all the values

(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the number of observations
(11) which equals 56.6 years.

The mean can be used for both continuous and discrete numeric data.

The mean cannot be calculated for categorical data, as the values cannot be summed.

Harmonic Mean

A simple way to define a harmonic mean is to call it the reciprocal of the arithmetic mean of the
reciprocals of the observations. The most important criteria for it is that none of the observations
should be zero.
A harmonic mean is used in averaging of ratios. The most common examples of ratios are that of
speed and time, cost and unit of material, work and time etc. The harmonic mean (H.M.) of n
observations is

Geometric Mean

A geometric mean is a mean or average which shows the central tendency of a set of numbers by
using the product of their values. For a set of n observations, a geometric mean is the nth root of
their product. The geometric mean G.M., for a set of numbers x1, x2, … , xn is given as

MEASURES OF DISPERSION
In statistics, the measures of dispersion help to interpret the variability of data i.e. to know how
much homogenous or heterogeneous the data is. In simple terms, it shows how squeezed or
scattered the variable is.
Types of Measures of Dispersion
There are two main types of dispersion methods in statistics which are:

 Absolute Measure of Dispersion

 Relative Measure of Dispersion

Absolute Measure of Dispersion

Absolute dispersion method expresses the variations in terms of the average of deviations of
observations like standard or means deviations. It includes range, standard deviation, quartile
deviation, etc.
The types of absolute measures of dispersion are:

1. Range: It is simply the difference between the maximum value and the minimum value
given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set then squaring each of them and
adding each square and finally dividing them by the total no of values in the data set is
the variance. Variance (σ2)=∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard deviation
i.e. S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers
into quarters. The quartile deviation is half of the distance between the third and the first
quartile.
5. Mean and Mean Deviation: The average of numbers is known as the mean and the
arithmetic mean of the absolute deviations of the observations from a measure of central
tendency is known as the mean deviation (also called mean absolute deviation).

Relative Measure of Dispersion

The relative measures of dispersion are used to compare the distribution of two or more data sets.
This measure compares values without units. Common relative dispersion methods include:

1. Co-efficient of Range
2. Co-efficient of Variation
3. Co-efficient of Standard Deviation
4. Co-efficient of Quartile Deviation
5. Co-efficient of Mean Deviation

RANDOM VARIABLES

A random variable is a rule that assigns a numerical value to each outcome in a sample space.
Random variables may be either discrete or continuous. A random variable is said to be discrete
if it assumes only specified values in an interval. Otherwise, it is continuous. We generally
denote the random variables with capital letters such as X and Y.

As a function, a random variable is needed to be measured, which allows probabilities to be

assigned to a set of potential values.

Types of Random Variable

 Discrete Random Variable

 Continuous Random Variable

Discrete Random Variable

A discrete random variable can take only a finite number of distinct values such as 0, 1, 2, 3, 4,
…

Examples of discrete random variables include:

 The number of eggs that a hen lays in a given day (it can’t be 2.3)
 The number of people going to a given soccer match
 The number of students that come to class on a given day

Continuous Random Variables

Continuous random variables, on the other hand, take on values that vary continuously within
one or more real intervals, and have a cumulative distribution function (CDF) that is absolutely
continuous. As a result, the random variable has an uncountable infinite number of possible
values
DISCRETE PROBABILITY DISTRIBUTIONS

A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure),
and a single trial. So the random variable X which has a Bernoulli distribution can take value 1
with the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.

The probability mass function is given by

The probabilities of success and failure need not be equally likely, like the result of a fight
between me and hulk. He is pretty much certain to win. So in this case probability of my success
is 0.15 while my failure is 0.85

Here, the probability of success(p) is not same as the probability of failure. So, the chart below
shows the Bernoulli Distribution of our fight.

The expected value of a random variable X from a Bernoulli distribution is found as follows:

E(X) = 1p + 0(1-p) = p

The variance of a random variable from a bernoulli distribution is:

V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)

Binomial Distribution

Suppose that you won the toss today and this indicates a successful event. You toss again but you
lost this time. If you win a toss today, this does not necessitate that you will win the toss
tomorrow. Assign a random variable, say X, to the number of times you won the toss. What can
be the possible value of X? It can be any number depending on the number of times you tossed a
coin.

There are only two possible outcomes. Head denoting success and tail denoting failure.
Therefore, probability of getting a head = 0.5 and the probability of failure can be easily
computed as: q = 1- p = 0.5.

A distribution where only two outcomes are possible, such as success or failure, gain or loss, win
or lose and where the probability of success and failure is same for all the trials is called a
Binomial Distribution.

Each trial is independent since the outcome of the previous toss doesn’t determine or affect the
outcome of the current toss. An experiment with only two possible outcomes repeated n number
of times is called binomial. The parameters of a binomial distribution are n and p where n is the
total number of trials and p is the probability of success in each trial.

On the basis of the above explanation, the properties of a Binomial Distribution are

1. Each trial is independent.

2. There are only two possible outcomes in a trial- either a success or a failure.
3. A total number of n identical trials are conducted.
4. The probability of success and failure is same for all trials. (Trials are identical.)

The mathematical representation of binomial distribution is given by:

A binomial distribution graph where the probability of success does not equal the probability of
failure looks like
Poisson Distribution

Suppose you work at a call center, approximately how many calls do you get in a day? It can be
any number. Now, the entire number of calls at a call center in a day is modeled by Poisson
distribution. Some more examples are

1. The number of emergency calls recorded at a hospital in a day.

2. The number of thefts reported in an area on a day.
3. The number of customers arriving at a salon in an hour.
4. The number of suicides reported in a particular city.
5. The number of printing errors at each page of the book.

You can now think of many examples following the same course. Poisson Distribution is
applicable in situations where events occur at random points of time and space wherein our
interest lies only in the number of occurrences of the event.

A distribution is called Poisson distribution when the following assumptions are valid:

1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a
longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.

Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some
notations used in Poisson distribution are:

 λ is the rate at which an event occurs,

 t is the length of a time interval,
 And X is the number of events in that time interval.

Here, X is called a Poisson Random Variable and the probability distribution of X is called
Poisson distribution.

Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.

The PMF of X following a Poisson distribution is given by:

The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that
interval. The graph of a Poisson distribution is shown below:
CONTINUOUS PROBABILITY DISTRIBUTIONS

Exponential Distribution

Consider the call center example. What about the interval of time between the calls ? Here,
exponential distribution comes to our rescue. Exponential distribution models the interval of time
between the calls.

Other examples are:

1. Length of time beteeen metro arrivals,

2. Length of time between arrivals at a gas station
3. The life of an Air Conditioner

Exponential distribution is widely used for survival analysis. From the expected life of a machine
to the expected life of a human, exponential distribution successfully delivers the result.

A random variable X is said to have an exponential distribution with PDF:

f(x) = { λe-λx, x ≥ 0

and parameter λ>0 which is also called the rate.

For survival analysis, λ is called the failure rate of a device at any time t, given that it has
survived up to t.

Mean and Variance of a random variable X following an exponential distribution:

Mean -> E(X) = 1/λ

Variance -> Var(X) = (1/λ)²

Also, the greater the rate, the faster the curve drops and the lower the rate, flatter the curve. This
is explained better with the graph shown below.

CHI-SQUARE DISTRIBUTION

A random variable ꭓ follows chi-square distribution ,it can be written as a sum of squared

standard normal variables.

Degrees of freedom:

Degrees of freedom refers to the maximum number of logically independent values, which have
the freedom to vary. In simple words, it can be defined as the total number of observations minus
the number of independent constraints imposed on the observations.
In the above figure, we could see Chi-Square distribution for different degrees of freedom. We
can also observe that as the degrees of freedom increase Chi-Square distribution approximates to
normal distribution.

Chi-Square Test for Feature Selection

A chi-square test is used in statistics to test the independence of two events. Given the data of two
variables, we can get observed count O and expected count E. Chi-Square measures how expected
count E and observed count O deviates each other.

Let’s consider a scenario where we need to determine the relationship between the independent
category feature (predictor) and dependent category feature(response). In feature selection, we
aim to select the features which are highly dependent on the response.

When two features are independent, the observed count is close to the expected count, thus we
will have smaller Chi-Square value. So high Chi-Square value indicates that the hypothesis of
independence is incorrect. In simple words, higher the Chi-Square value the feature is more
dependent on the response and it can be selected for model training.

Steps to perform the Chi-Square Test:

1. Define Hypothesis.
2. Find the expected values.
3. Calculate the Chi-Square statistic.
4. Accept or Reject the Null Hypothesis.
Normal Distribution

Normal distribution represents the behavior of most of the situations in the universe (That is
why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables
often turns out to be normally distributed, contributing to its widespread application. Any
distribution is known as Normal distribution if it has the following characteristics:

1. The mean, median and mode of the distribution coincide.

2. The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
3. The total area under the curve is 1.
4. Exactly half of the values are to the left of the center and the other half to the right.

A normal distribution is highly different from Binomial Distribution. However, if the number of
trials approaches infinity then the shapes will be quite similar.

The PDF of a random variable X following a normal distribution is given by:

The mean and variance of a random variable X which is said to be normally distributed is given
by:

Mean -> E(X) = µ

Variance -> Var(X) = σ^2

Here, µ (mean) and σ (standard deviation) are the parameters.

The graph of a random variable X ~ N (µ, σ) is shown below.

A standard normal distribution is defined as the distribution with mean 0 and standard deviation
1. For such a case, the PDF becomes:
WHITE-NOISE PROCESS

White noise is an important concept in time series forecasting.

If a time series is white noise, it is a sequence of random numbers and cannot be predicted. If the
series of forecast errors are not white noise, it suggests improvements could be made to the
predictive model.

It is important for two main reasons:

1. Predictability: If your time series is white noise, then, by definition, it is random. You cannot
reasonably model it and make predictions.
2. Model Diagnostics: The series of errors from a time series forecast model should ideally be
white noise.
VARIANCE

In statistics, variance refers to the spread of a data set. It’s a measurement used to identify how
far each number in the data set is from the mean.

While performing market research, variance is particularly useful when calculating probabilities
of future events. Variance is a great way to find all of the possible values and likelihoods that a
random variable can take within a given range.

A variance value of zero represents that all of the values within a data set are identical, while all
variances that are not equal to zero will come in the form of positive numbers.

The larger the variance, the more spread in the data set.

A large variance means that the numbers in a set are far from the mean and each other. A small
variance means that the numbers are closer together in value.

How to Calculate Variance

Variance is calculated by taking the differences between each number in a data set and the mean,
squaring those differences to give them positive value, and dividing the sum of the resulting
squares by the number of values in the set.

The formula for variance is as follows:

In this formula, X represents an individual data point, u represents the mean of the data points,
and N represents the total number of data points.

Note that while calculating a sample variance in order to estimate a population variance, the
denominator of the variance equation becomes N – 1. This removes bias from the estimation, as
it prohibits the researcher from underestimating the population variance.
An Advantage of Variance

One of the primary advantages of variance is that it treats all deviations from the mean of the
data set in the same way, regardless of direction.

This ensures that the squared deviations cannot sum to zero, which would result in giving the
appearance that there was no variability in the data set at all.

CORRELATION COEFFICIENT

The correlation coefficient is the term used to refer to the resulting correlation measurement. It
will always maintain a value between one and negative one.

When the correlation coefficient is one, the variables under examination have a perfect positive
correlation. In other words, when one moves, so does the other in the same direction,
proportionally.

If the correlation coefficient is less than one, but still greater than zero, it indicates a less than
perfect positive correlation. The closer the correlation coefficient gets to one, the stronger the
correlation between the two variables.

When the correlation coefficient is zero, it means that there is no identifiable relationship
between the variables. If one variable moves, it’s impossible to make predictions about the
movement of the other variable.

If the correlation coefficient is negative one, this means that the variables are perfectly
negatively or inversely correlated. If one variable increases, the other will decrease at the same
proportion. The variables will move in opposite directions from each other.

If the correlation coefficient is greater than negative one, it indicates that there is an imperfect
negative correlation. As the correlation approaches negative one, the correlation grows.

COVARIANCE

Covariance signifies the direction of the linear relationship between the two variables. By
direction we mean if the variables are directly proportional or inversely proportional to each
other. (Increasing the value of one variable might have a positive or a negative impact on the
value of the other variable).

The values of covariance can be any number between the two opposite infinities. Also, it’s
important to mention that covariance only measures how two variables change together, not
the dependency of one variable on another one.

The value of covariance between 2 variables is achieved by taking the summation of the
product of the differences from the means of the variables as follows:

The upper and lower limits for the covariance depend on the variances of the variables
involved. These variances, in turn, can vary with the scaling of the variables. Even a change
in the units of measurement can change the covariance. Thus, covariance is only useful to
find the direction of the relationship between two variables and not the magnitude. Below are
the plots which help us understand how the covariance between two variables would look in
different directions.

CORRELATION

Correlation analysis is a method of statistical evaluation used to study the strength of a

relationship between two, numerically measured, continuous variables.

It not only shows the kind of relation (in terms of direction) but also how strong the
relationship is. Thus, we can say the correlation values have standardized notions, whereas
the covariance values are not standardized and cannot be used to compare how strong or
weak the relationship is because the magnitude has no direct significance. It can assume
values from -1 to +1.
To determine whether the covariance of the two variables is large or small, we need to assess
it relative to the standard deviations of the two variables.

To do so we have to normalize the covariance by dividing it with the product of the standard
deviations of the two variables, thus providing a correlation between the two variables.

The main result of a correlation is called the correlation coefficient.

The correlation coefficient is a dimensionless metric and its value ranges from -1 to +1.

The closer it is to +1 or -1, the more closely the two variables are related.

If there is no relationship at all between two variables, then the correlation coefficient will
certainly be 0. However, if it is 0 then we can only say that there is no linear relationship.
There could exist other functional relationships between the variables.

When the correlation coefficient is positive, an increase in one variable also increases the
other. When the correlation coefficient is negative, the changes in the two v ariables are in
opposite directions.

HYPOTHESIS AND INFERENCE

A statistical hypothesis is an assumption about a population which may or may not be true.
Hypothesis testing is a set of formal procedures used by statisticians to either accept or reject
statistical hypotheses. Statistical hypotheses are of two types:
 Null hypothesis, H0- represents a hypothesis of chance basis.
 Alternative hypothesis, Ha - represents a hypothesis of observations which are
influenced by some non-random cause.
Example
suppose we wanted to check whether a coin was fair and balanced. A null hypothesis might say,
that half flips will be of head and half will of tails whereas alternative hypothesis might say that
flips of head and tail may be very different.
H0: P=0.5

Ha: P≠0.5

For example if we flipped the coin 50 times, in which 40 Heads and 10 Tails results. Using
result, we need to reject the null hypothesis and would conclude, based on the evidence, that the
coin was probably not fair and balanced.
Hypothesis Tests
Following formal process is used by statistican to determine whether to reject a null hypothesis,
based on sample data. This process is called hypothesis testing and is consists of following four
steps:
1. State the hypotheses - This step involves stating both null and alternative hypotheses.
The hypotheses should be stated in such a way that they are mutually exclusive. If one is
true then other must be false.
2. Formulate an analysis plan - The analysis plan is to describe how to use the sample
data to evaluate the null hypothesis. The evaluation process focuses around a single test
statistic.
3. Analyze sample data - Find the value of the test statistic (using properties like mean
score, proportion, t statistic, z-score, etc.) stated in the analysis plan.
4. Interpret results - Apply the decisions stated in the analysis plan. If the value of the test
statistic is very unlikely based on the null hypothesis, then reject the null hypothesis.
UNIT -2
1. EXPLORATORY DATA ANALYSIS (EDA)

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods. It
helps determine how best to manipulate data sources to get the answers you need, making it
easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check
assumptions.

The main purpose of EDA is to help look at data before making any assumptions. It can help
identify obvious errors, as well as better understand patterns within the data, detect outliers or
anomalous events, find interesting relations among the variables.

Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about
standard deviations, categorical variables, and confidence intervals. Once EDA is complete
and insights are drawn, its features can then be used for more sophisticated data analysis or
modeling, including machine learning.

Exploratory data analysis tools

Specific statistical functions and techniques you can perform with EDA tools include:

 Clustering and dimension reduction techniques, which help create graphical displays
of high-dimensional data containing many variables.
 Univariate visualization of each field in the raw dataset, with summary statistics.
 Bivariate visualizations and summary statistics that allow you to assess the
relationship between each variable in the dataset and the target variable you’re
looking at.
 Multivariate visualizations, for mapping and understanding interactions between
different fields in the data.
 K-means Clustering is a clustering method in unsupervised learning where data
points are assigned into K groups, i.e. the number of clusters, based on the distance
from each group’s centroid. The data points closest to a particular centroid will be
clustered under the same category. K-means Clustering is commonly used in market
segmentation, pattern recognition, and image compression.
 Predictive models, such as linear regression, use statistics and data to predict
outcomes.

Types of exploratory data analysis

There are four primary types of EDA:

 Univariate non-graphical. This is simplest form of data analysis, where the data
being analyzed consists of just one variable. Since it’s a single variable, it doesn’t
deal with causes or relationships. The main purpose of univariate analysis is to
describe the data and find patterns that exist within it.
 Univariate graphical. Non-graphical methods don’t provide a full picture of the
data. Graphical methods are therefore required. Common types of univariate
graphics include:
o Stem-and-leaf plots, which show all data values and the shape of the
distribution.
o Histograms, a bar plot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of values.
o Box plots, which graphically depict the five-number summary of minimum,
first quartile, median, third quartile, and maximum.
 Multivariate nongraphical: Multivariate data arises from more than one variable.
Multivariate non-graphical EDA techniques generally show the relationship
between two or more variables of the data through cross-tabulation or statistics.
 Multivariate graphical: Multivariate data uses graphics to display relationships
between two or more sets of data. The most used graphic is a grouped bar plot or bar
chart with each group representing one level of one of the variables and each bar
within a group representing the levels of the other variable.

Exploratory Data Analysis Tools

Some of the most common data science tools used to create an EDA include:

 Python: An interpreted, object-oriented programming language with dynamic

semantics. Its high-level, built-in data structures, combined with dynamic typing
and dynamic binding, make it very attractive for rapid application development, as
well as for use as a scripting or glue language to connect existing components
together. Python and EDA can be used together to identify missing values in a data
set, which is important so you can decide how to handle missing values for machine
learning.
 R: An open-source programming language and free software environment for
statistical computing and graphics supported by the R Foundation for Statistical
Computing. The R language is widely used among statisticians in data science in
developing statistical observations and data analysis.

Philosophy of Exploratory Data Analysis

There are important reasons anyone working with data should do EDA.

 Namely, to gain intuition about the data;

 To make comparisons between distributions;
 For sanity checking (making sure the data is on the scale you expect, in the format
you thought it should be);
 To find out where data is missing or if there are outliers;
 To summarize the data.

In the context of data generated from logs, EDA also helps with de‐bugging the logging
process. For example, “patterns” you find in the data could actually be something wrong in
the logging process that needs to be fixed. If you never go to the trouble of debugging, you’ll
continue to think your patterns are real. The engineers we’ve worked with are always grateful
for help in this area.
2. THE LIFECYCLE OF DATA SCIENCE

1. Business Understanding: The complete cycle revolves around the enterprise goal. What
will you resolve if you do no longer have a specific problem? It is extraordinarily essential
to apprehend the commercial enterprise goal sincerely due to the fact that will be your
ultimate aim of the analysis. After desirable perception only we can set the precise aim of
evaluation that is in sync with the enterprise objective. You need to understand
understan if the
customer desires to minimize savings loss, or if they prefer to predict the rate of a
commodity, etc.

2. Data Understanding: After enterprise understanding, the subsequent step is data

understanding. This includes a series of all the reachable dat
data.
a. Here you need to intently
work with the commercial enterprise group as they are certainly conscious of what
information is present, what facts should be used for this commercial enterprise problem,
and different information. This step includes describin
describingg the data, their structure, their
relevance, their records type. Explore the information using graphical plots. Basically,
extracting any data that you can get about the information through simply exploring the
data.

3. Preparation of Data: Next comes the data preparation stage. This consists of steps like
choosing the applicable data, integrating the data by means of merging the data sets,
cleaning it, treating the lacking values through either
eit eliminating them , treating inaccurate
data through eliminating ng them, additionally test forfor outliers the use of box plots.
plots
Constructing new data, derive new elements from present ones.

4. Exploratory Data Analysis: This step includes getting some concept about the answer
and elements affecting it, earlier than con
constructing
structing the real model. Distribution of data
inside distinctive variables of a character is explored graphically the usage of bar-graphs,
bar
Relations between distinct aspects are captured via graphical representations like scatter
plots and warmth maps. Many ny data visualization strategies are considerably used to
discover each and every characteristic individually and by means of combining them with
different features.

5. Data Modeling: A model takes the organized data as input and gives the preferred
output. This step consists of selecting the suitable kind of model, whether the problem is a
classification problem, or a regression problem or a clustering problem. After deciding on
the model family, amongst the number of algorithms amongst that family, we need to
cautiously pick out the algorithms to put into effect and enforce them. We need to tune the
hyperparameters of every model to obtain the preferred performance.

6. Model Evaluation: Here the model is evaluated for checking if it is geared up to be

deployed. The model is examined on an unseen data, evaluated on a cautiously thought out
set of assessment metrics.

7. Model Deployment: This is the last step in the data science life cycle. Each step in the
data science life cycle defined above must be laboured upon carefully. If any step is
performed improperly, and hence, have an effect on the subsequent step and the complete
effort goes to waste. For example, if data is no longer accumulated properly, you’ll lose
records and you will no longer be constructing an ideal model. If information is not cleaned
properly, the model will no longer work. If the model is not evaluated properly, it will fail
in the actual world. Right from Business perception to model deployment, every step has to
be given appropriate attention, time, and effort.

3. DESCRIPTIVE STATISTICS
Descriptive statistics summarize and organize characteristics of a data set. A data set is a
collection of responses or observations from a sample or entire population.

In quantitative research, after collecting data, the first step of statistical analysis is to describe
characteristics of the responses, such as the average of one variable (e.g., age), or the relation
between two variables (e.g., age and creativity).

The next step is inferential statistics, which help you decide whether your data confirms or
refutes your hypothesis and whether it is generalizable to a larger population.

Types of descriptive statistics

There are 3 main types of descriptive statistics:

 The distribution concerns the frequency of each value.

 The central tendency concerns the averages of the values.
 The variability or dispersion concerns how spread out the values are.

Frequency distribution

Frequency distribution in statistics is a representation that displays the number of

observations within a given interval.
The representation of a frequency distribution can be graphical or tabular so that it is easier to
understand.

Frequency distributions are particularly useful for normal distributions, which show the
observations of probabilities divided among standard deviations.

In finance, traders use frequency distributions to take note of price action and identify trends.

Measures of central tendency

Measures of central tendency estimate the center, or average, of a data set.
The mean, median and mode are 3 ways of finding the average

The mean, or M, is the most commonly used method for finding the average.

To find the mean, simply add up all response values and divide the sum by the total number
of responses. The total number of responses or observations is called N.

Mean number of library visits

Data set 15, 3, 12, 0, 24, 3
Sum of all values 15 + 3 + 12 + 0 + 24 + 3 = 57
Total number of responses N = 6
Mean Divide the sum of values by N to find M: 57/6 = 9.5
The median is the value that’s exactly in the middle of a data set.

To find the median, order each response value from the smallest to the biggest. Then, the
median is the number in the middle. If there are two numbers in the middle, find their mean.

Median number of library visits

Ordered data set 0, 3, 3, 12, 15, 24
Middle numbers 3, 12
Median Find the mean of the two middle numbers: (3 + 12)/2 = 7.5

The mode is the simply the most popular or most frequent response value. A data set can
have no mode, one mode, or more than one mode.

To find the mode, order your data set from lowest to highest and find the response that occurs
most frequently.

Mode number of library visits

Ordered data set 0, 3, 3, 12, 15, 24
Mode Find the most frequently occurring response: 3
The mode is the simply the most popular or most frequent response value. A data set can
have no mode, one mode, or more than one mode.

To find the mode, order your data set from lowest to highest and find the response that occurs
most frequently.

Mode number of library visits

Ordered data set 0, 3, 3, 12, 15, 24
Mode Find the most frequently occurring response: 3

Measures of variability
Measures of variability give you a sense of how spread out the response values are. The
range, standard deviation and variance each reflect different aspects of spread.

Range
The range gives you an idea of how far apart the most extreme response scores are. To find
the range, simply subtract the lowest value from the highest value.

Range of visits to the library in the past year Ordered data set: 0, 3, 3, 12, 15, 24

Range: 24 – 0 = 24

Standard deviation
The standard deviation (s) is the average amount of variability in your dataset. It tells you, on
average, how far each score lies from the mean. The larger the standard deviation, the more
variable the data set is.

There are six steps for finding the standard deviation:

1. List each score and find their mean.

2. Subtract the mean from each score to get the deviation from the mean.
3. Square each of these deviations.
4. Add up all of the squared deviations.
5. Divide the sum of the squared deviations by N – 1.
6. Find the square root of the number you found.

Standard deviations of visits to the library in the past yearIn the table below, you
complete Steps 1 through 4.
Raw data Deviation from mean Squared deviation

15 15 – 9.5 = 5.5 30.25

3 3 – 9.5 = -6.5 42.25

12 12 – 9.5 = 2.5 6.25

0 0 – 9.5 = -9.5 90.25

24 24 – 9.5 = 14.5 210.25

3 3 – 9.5 = -6.5 42.25

M = 9.5 Sum = 0 Sum of squares = 421.5

Step 5: 421.5/5 = 84.3

Step 6: √84.3 = 9.18

From learning that s = 9.18, you can say that on average, each score deviates from the mean
by 9.18 points.

Variance
The variance is the average of squared deviations from the mean. Variance reflects the degree
of spread in the data set. The more spread the data, the larger the variance is in relation to the
mean.

To find the variance, simply square the standard deviation. The symbol for variance is s2.

Variance of visits to the library in the past year Data set: 15, 3, 12, 0, 24, 3

s = 9.18

s2 = 84.3
DATA VISUALIZATION
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.

In the world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.

More specific examples of methods to visualize data:

 Bar Chart
 Box-and-whisker Plots
 Bubble Cloud
 Gantt Chart
 Heat Map
 Histogram
 Radial Tree
 Scatter Plot (2D or 3D)

Scatter Plot
A scatter plot is a chart type that is normally used to observe and visually display the
relationship between variables. The values of the variables are represented by dots.
The positioning of the dots on the vertical and horizontal axis will inform the value of
the respective data point; hence, scatter plots make use of Cartesian coordinates to
display the values of the variables in a data set. Scatter plots are also known as
scattergrams, scatter graphs, or scatter charts.
Scatter Plot Applications and Uses

1. Demonstration of the relationship between two variables

2. Identification of correlational relationships

3. Identification of data patterns

Creating a Scatter Plot Diagram

The scatter plot diagram for the data above is seen below:

Drawing a Scatter Plot

Scatter plot can be created using the [Link]() methods.

import pandas as pd
import numpy as np
df = [Link]([Link](50, 4), columns=['a', 'b', 'c', 'd'])
[Link](x='a', y='b')
Its output is as follows −

Bar Graph
The pictorial representation of a grouped data, in the form of vertical or horizontal
rectangular bars, where the lengths of the bars are equivalent to the measure of data, are
known as bar graphs or bar charts.
The bars drawn are of uniform width, and the variable quantity is repres
represented
ented on one of the
axes. Also, the measure of the variable is depicted on the other axes. The heights or the
lengths of the bars denote the value of the variable, and these graphs are also used to compare
certain quantities. The frequency distribution tab
tables
les can be easily represented using bar charts
which simplify the calculations and understanding of data.
The three major attributes of bar graphs are:

 The bar graph helps to compare the different sets of data among different groups
easily.
 It shows the relationship
lationship using two axes, in which the categories on one axis and the
discrete values on the other axis.
 The graph shows the major changes in data over time.
The types of bar charts are as follows:

1. Vertical bar chart

2. Horizontal bar chart

Properties of Bar Graph

Some of the important properties of a bar graph are as follows:

 All the bars should have a common base.

 Each column in the bar graph should have equal width.
 The height of the bar should correspond to the data value.
 The distance between each bar should be the same.

Advantages:

 Bar graph summarises the large set of data in simple visual form.
 It displays each category of data in the frequency distribution.
 It clarifies the trend of data better than the table.
 It helps in estimating the key values at a glance.

Following is a simple example of the Matplotlib bar plot. It shows the number of
students enrolled for various courses offered at an institute.

 import [Link] as plt

 fig = [Link]()
 ax = fig.add_axes([0,0,1,1])
 langs = ['C', 'C++', 'Java', 'Python', 'PHP']
 students = [23,17,35,29,12]
 [Link](langs,students)
 [Link]()

Histogram
A histogram is a graphical representation of a grouped frequency distribution with
continuous classes. It is an area diagram and can be defined as a set of rectangles with bases
along with the intervals between class boundaries and with areas proportional to frequencies
in the corresponding classes. In such representations, all the rectangles are adjacent since the
base covers the intervals between class boundaries. The heights of rectangles are proportional
to corresponding frequencies of similar classes and for different classes, the heights will be
proportional to corresponding frequency densities.
In other words, histogram a diagram involving rectangles whose area is proportional to the
frequency of a variable and width is equal to the class interval.

When to Use Histogram?

The histogram graph is used under certain conditions. They are:

 The data should be numerical.

 A histogram is used to check the shape of the data distribution.
 Used to check whether the process changes from one period to another.
 Used to determine whether the output is different when it involves two or more
processes.
 Used to analyse whether the given process meets the customer requirements.

Histogram Types
The histogram can be classified into different types based on the frequency distribution of the
data. There are different types of distributions, such as normal distribution, skewed
distribution, bimodal distribution, multimodal distribution, comb distribution, edge peak
distribution, dog food distributions, heart cut distribution, and so on.
Following example plots a histogram of marks obtained by students in a class. Four bins, 0-
25, 26-50, 51-75, and 76-100 are defined. The Histogram shows number of students falling
in this range.
from matplotlib import pyplot as plt
import numpy as np
fig,ax = [Link](1,1)
a = [Link]([22,87,5,43,56,73,55,54,11,20,51,5,79,31,27])
[Link](a, bins = [0,25,50,75,100])
ax.set_title("histogram of result")
ax.set_xticks([0,25,50,75,100])
ax.set_xlabel('marks')
ax.set_ylabel('no. of students')
[Link]()
The plot appears as shown below −
Heat Map

A heat map (or heatmap) is a graphical representation of data where values are depicted by
color. Heat maps make it easy to visualize complex data and understand it at a glance:
glance

Types of heatmap

Heat map is really an umbrella term for different heatmapping tools: scroll maps, click maps,
and move maps.

Scroll maps show you the exact percentage of people who scroll down to any point on
the page: the redder the area, the more visitors saw it.
Click maps show you an aggregate of where visitors click their mouse on desktop
devices and tap their finger on mobile devices (in this case, they are known as touch
heatmaps). The map is color
color-coded
coded to show the elements that have been clicked and
tapped the most (red, orange, yellow).
Move maps track where desktop users move their mouse as they navigate the page. The
hot spots in a move map represent where users have moved their cursor on a page

The below example is a twotwo-dimensional

dimensional plot of values which are mapped to the
indices and columns of the chart.
 from pandas import DataFrame
 import [Link] as plt

 data=[{2,3,4,1},{6,3,5,2},{6,3,5,4},{3,7,5,4},{2,8,1,5}]
 Index= ['I1', 'I2','I3','I4','I5']
 Cols = ['C1', 'C2', 'C3','C4']
 df = DataFrame(data, index=Index, columns=Cols)

 [Link](df)
 [Link]()
 Its output is as follows −


Box Plots

When we display the data distribution in a standardized way using 5 summary –

minimum, Q1 (First Quartile), median, Q3(third Quartile), and maximum, it is called
a Box plot. It is also termed as box and whisker plot.

Parts of Box Plots

Check the image below which shows the minimum, maximum, first quartile, third quartile,
median and outliers.

Minimum: The minimum value in the given dataset

First Quartile (Q1): The first quartile is the median of the lower half of the data set.
Median: The median is the middle value of the dataset, which divides the given dataset into
two equal parts. The median is considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half of the data.
Maximum: The maximum value in the given dataset.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile and first quartile is
known as the interquartile range. (i.e.) IQR = Q3-Q1
Outlier: The data that falls on the far left or right side of the ordered data is tested to be the
outliers. Generally, the outliers fall more than the specified distance from the first and third
quartile.

import [Link] as plt

import numpy as np

# Creating dataset
[Link](10)
data = [Link](100, 20, 200)
fig = [Link](figsize =(10, 7))
[Link](data)
# show plot
[Link]()

Output:
UNIT - 3

K-nearest neighbors (KNN)

Introduction

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which

can be used for both classification as well as regression predictive problems.
However, it is mainly used for classification predictive problems in industry. The
following two properties would define KNN well −
 Lazy learning algorithm − KNN is a lazy learning algorithm because it
does not have a specialized training phase and uses all the data for training
while classification.
 Non-parametric learning algorithm − KNN is also a non-parametric
learning algorithm because it doesn’t assume anything about the underlying
data.

Working of KNN Algorithm

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values
of new datapoints which further means that the new data point will be assigned a
value based on how closely it matches the points in the training set. We can
understand its working with the help of following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first
step of KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K
can be any integer.
Step 3 − For each point in the test data do the following −
 3.1 − Calculate the distance between test data and each row of training data
with the help of any of the method namely: Euclidean, Manhattan or
Hamming distance. The most commonly used method to calculate distance
is Euclidean.
 3.2 − Now, based on the distance value, sort them in ascending order.
 3.3 − Next, it will choose the top K rows from the sorted array.
 3.4 − Now, it will assign a class to the test point based on most frequent
class of these rows.
Step 4 − End

Example
The following is an example to understand the concept of K and working of KNN
algorithm −
Suppose we have a dataset which can be plotted as follows −

Now, we need to classify new data point with black dot (at point 60,60) into blue
or red class. We are assuming K = 3 i.e. it would find three nearest data points. It
is shown in the next diagram −

We can see in the above diagram the three nearest neighbors of the data point
with black dot. Among those three, two of them lies in Red class hence the black
dot will also be assigned in red class.

Pros and Cons of KNN

Pros

 It is very simple algorithm to understand and interpret.

 It is very useful for nonlinear data because there is no assumption about
data in this algorithm.
 It is a versatile algorithm as we can use it for classification as well as
regression.
 It has relatively high accuracy but there are much better supervised learning
models than KNN.

Cons

 It is computationally a bit expensive algorithm because it stores all the

training data.
 High memory storage required as compared to other supervised learning
algorithms.
 Prediction is slow in case of big N.
 It is very sensitive to the scale of data as well as irrelevant features.

Applications of KNN

The following are some of the areas in which KNN can be applied successfully −

Banking System

KNN can be used in banking system to predict weather an individual is fit for loan
approval? Does that individual have the characteristics similar to the defaulters
one?

Calculating Credit Ratings

KNN algorithms can be used to find an individual’s credit rating by comparing with
the persons having similar traits.
Support vector machines (SVMs)

Introduction to SVM

Support vector machines (SVMs) are powerful yet flexible supervised machine
learning algorithms which are used both for classification and regression. But
generally, they are used in classification problems. In 1960s, SVMs were first
introduced but later they got refined in 1990. SVMs have their unique way of
implementation as compared to other machine learning algorithms. Lately, they
are extremely popular because of their ability to handle multiple continuous and
categorical variables.

Working of SVM

An SVM model is basically a representation of different classes in a hyperplane

in multidimensional space. The hyperplane will be generated in an iterative
manner by SVM so that the error can be minimized. The goal of SVM is to divide
the datasets into classes to find a maximum marginal hyperplane (MMH).

The followings are important concepts in SVM −

 Support Vectors − Datapoints that are closest to the hyperplane is called
support vectors. Separating line will be defined with the help of these data
points.
 Hyperplane − As we can see in the above diagram, it is a decision plane or
space which is divided between a set of objects having different classes.
 Margin − It may be defined as the gap between two lines on the closet data
points of different classes. It can be calculated as the perpendicular
distance from the line to the support vectors. Large margin is considered
as a good margin and small margin is considered as a bad margin.
The main goal of SVM is to divide the datasets into classes to find a maximum
marginal hyperplane (MMH) and it can be done in the following two steps −
 First, SVM will generate hyperplanes iteratively that segregates the classes
in best way.
 Then, it will choose the hyperplane that separates the classes correctly.

SVM Kernels

In practice, SVM algorithm is implemented with kernel that transforms an input

data space into the required form. SVM uses a technique called the kernel trick in
which kernel takes a low dimensional input space and transforms it into a higher
dimensional space. In simple words, kernel converts non-separable problems into
separable problems by adding more dimensions to it. It makes SVM more
powerful, flexible and accurate. The following are some of the types of kernels
used by SVM.

Linear Kernel

It can be used as a dot product between any two observations. The formula of
linear kernel is as below −

From the above formula, we can see that the product between two vectors say 𝑥
& 𝑥𝑖 is the sum of the multiplication of each pair of input values.

Polynomial Kernel

It is more generalized form of linear kernel and distinguish curved or nonlinear

input space. Following is the formula for polynomial kernel −

Here d is the degree of polynomial, which we need to specify manually in the

learning algorithm.

Radial Basis Function (RBF) Kernel

RBF kernel, mostly used in SVM classification, maps input space in indefinite
dimensional space. Following formula explains it mathematically −

Here, gamma ranges from 0 to 1. We need to manually specify it in the learning

algorithm. A good default value of gamma is 0.1.
As we implemented SVM for linearly separable data, we can implement it in
Python for the data that is not linearly separable. It can be done by using kernels.

Pros and Cons of SVM Classifiers

Pros of SVM classifiers

SVM classifiers offers great accuracy and work well with high dimensional space.
SVM classifiers basically use a subset of training points hence in result uses very
less memory.

Cons of SVM classifiers

They have high training time hence in practice not suitable for large datasets.
Another disadvantage is that SVM classifiers do not work well with overlapping
classes.

Decision Tree

Introduction to Decision Tree

In general, Decision tree analysis is a predictive modelling tool that can be applied
across many areas. Decision trees can be constructed by an algorithmic approach
that can split the dataset in different ways based on different conditions. Decisions
trees are the most powerful algorithms that falls under the category of supervised
algorithms.
They can be used for both classification and regression tasks. The two main
entities of a tree are decision nodes, where the data is split and leaves, where we
got outcome. The example of a binary tree for predicting whether a person is fit
or unfit providing various information like age, eating habits and exercise habits,
is given below −

In the above decision tree, the question are decision nodes and final outcomes
are leaves. We have the following two types of decision trees.
 Classification decision trees − In this kind of decision trees, the decision
variable is categorical. The above decision tree is an example of
classification decision tree.
 Regression decision trees − In this kind of decision trees, the decision
variable is continuous.

splitting criterion

The splitting criterion also tells us which branches to grow from node N
with respect to the outcomes of the chosen test. More specifically, the splitting
criterion
indicates the splitting attribute and may also indicate either a split-point or
a splitting subset.

1. A is discrete-valued: In this case, the outcomes of the test at node N correspond

directly to the known values of A
2. A is continuous-valued: In this case, the test at node N has two possible
outcomes,
corresponding to the conditions A _ split point and A > split point, respectively,
where split point is the split-point returned by Attribute selection method as part
of the splitting criterion.
3. A is discrete-valued and a binary tree

In Decision Tree the major challenge is to identification of the attribute for the root
node in each level. This process is known as attribute selection. We have two
popular attribute selection measures:

1. Information Gain
2. Gini Index
[Link] Ratio

Information Gain

Let node N represent or hold the tuples of partition D. The attribute with the highest
information gain is chosen as the splitting attribute for node N. This attribute
minimizes the information needed to classify the tuples in the resulting partitions
and reflects the least randomness or “impurity” in these partitions.

Such an approach minimizes the expected number of tests needed to classify

a given tuple and guarantees that a simple (but not necessarily the simplest) tree
is found.
The expected information needed to classify a tuple in D is given by
How much more information would we still need (after the partitioning) to arrive at
an exact classification? This amount is measured by

Information gain is defined as the difference between the original information

requirement and the new requirement
That is,

Random Forest Algorithm

Random forest is a supervised learning algorithm which is used for both
classification as well as regression. But however, it is mainly used for classification
problems. As we know that a forest is made up of trees and more trees means
more robust forest. Similarly, random forest algorithm creates decision trees on
data samples and then gets the prediction from each of them and finally selects
the best solution by means of voting. It is an ensemble method which is better
than a single decision tree because it reduces the over-fitting by averaging the
result.

Working of Random Forest Algorithm

We can understand the working of Random Forest algorithm with the help of
following steps −
 Step 1 − First, start with the selection of random samples from a given
dataset.
 Step 2 − Next, this algorithm will construct a decision tree for every sample.
Then it will get the prediction result from every decision tree.
 Step 3 − In this step, voting will be performed for every predicted result.
 Step 4 − At last, select the most voted prediction result as the final
prediction result.
The following diagram will illustrate its working −

Pros

The following are the advantages of Random Forest algorithm −

 It overcomes the problem of overfitting by averaging or combining the
results of different decision trees.
 Random forests work well for a large range of data items than a single
decision tree does.
 Random forest has less variance then single decision tree.
 Random forests are very flexible and possess very high accuracy.
 Scaling of data does not require in random forest algorithm. It maintains
good accuracy even after providing data without scaling.
 Random Forest algorithms maintains good accuracy even a large
proportion of the data is missing.
Cons

The following are the disadvantages of Random Forest algorithm −

 Complexity is the main disadvantage of Random forest algorithms.
 Construction of Random forests are much harder and time-consuming than
decision trees.
 More computational resources are required to implement Random Forest
algorithm.

CONFUSION MATRIX
It is the easiest way to measure the performance of a classification problem where
the output can be of two or more type of classes. A confusion matrix is nothing
but a table with two dimensions viz. “Actual” and “Predicted” and furthermore, both
the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False
Positives (FP)”, “False Negatives (FN)” as shown below −

The explanation of the terms associated with confusion matrix are as follows −
 True Positives (TP) − It is the case when both actual class & predicted
class of data point is 1.
 True Negatives (TN) − It is the case when both actual class & predicted
class of data point is 0.
 False Positives (FP) − It is the case when actual class of data point is 0 &
predicted class of data point is 1.
 False Negatives (FN) − It is the case when actual class of data point is 1 &
predicted class of data point is 0.

EXAMPLE
Metrics for Evaluating Classifier Performance

The accuracy of a classifier on a given test set is the percentage of test set tuples
that are correctly classified by the classifier. That is,

error rate or misclassification rate of a classifier, M, which is simply 1-

accuracy(M), where accuracy(M) is the accuracy of M. This also can be computed
as
We now consider the class imbalance problem, where the main class of interest
is rare. That is, the data set distribution reflects a significant majority of the negative
class and a minority positive class. For example, in fraud detection applications,
the class of interest (or positive class) is “fraud,” which occurs much less
frequently. The sensitivity and specificity measures can be used to measure
accuracy.

The precision and recall measures are also widely used in classification. Precision
can be thought of as a measure of exactness (i.e., what percentage of tuples
labeled as positive are actually such), whereas recall is a measure of
completeness (what percentage of positive tuples are labeled as such). If recall
seems familiar, that’s because it is the same as sensitivity (or the true positive
rate). These measures can be computed as

An alternative way to use precision and recall is to combine them into a single
measure. This is the approach of the F measure (also known as the F1 score or
F-score)
Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification
problems.
o It is mainly used in text classification that includes a high-
dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence

of a certain feature is independent of the occurrence of other
features. Such as if the fruit is identified on the bases of color,
shape, and taste, then red, spherical, and sweet fruit is recognized
as an apple. Hence each feature individually contributes to identify
that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle
of Bayes' Theorem.
o

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law,

which is used to determine the probability of a hypothesis with
prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the

observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that

the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the

evidence.

P(B) is Marginal Probability: Probability of Evidence.

Example

The tuple we wish to classify is

PATTERNS, FEATURES, PATTER REPRESENTATION

Pattern is everything around in this digital world. A pattern can either be seen physically or it
can be observed mathematically by applying algorithms.

Example: The colors on the clothes, speech pattern etc. In computer science, a pattern is
represented using vector feature values.

Pattern recognition is the process of recognizing patterns by using machine learning

algorithm. Pattern recognition can be defined as the classification of data based on knowledge
already gained or on statistical information extracted from patterns and/or their representation.
One of the important aspects of the pattern recognition is its application potential.

Examples: Speech recognition, speaker identification, multimedia document recognition.

In a typical pattern recognition application, the raw data is processed and converted into a
form that is amenable for a machine to use. Pattern recognition involves classification and
cluster of patterns.

CURSE OF DIMENSIONALITY

Handling the high-dimensional data is very difficult in practice, commonly known as the curse of
dimensionality. If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex. As the number of features increases, the number of
samples also gets increased proportionally, and the chance of overfitting also increases. If the
machine learning model is trained on high-dimensional data, it becomes overfitted and results in
poor performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
DIMENSIONALITY REDUCTION

In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These fa factors
ctors are basically variables called features.
The higher the number of features, the harder it gets to visualize the training set and then
work on it. Sometimes, most of these features are correlated, and hence redundant. This is
where dimensionality reduction
ction algorithms come into play. Dimensionality reduction is the
process of reducing the number of random variables under consideration, by obtaining a set
of principal variables. It can be divided into feature selection and feature extraction.

Components of Dimensionality Reduction

There are two components of dimensionality reduction:
 Feature selection: In this, we try to find a subset of the original set of variables, or features, to get
a smaller subset which can be used to model the problem. It usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower dimension space,
i.e. a space with lesser no. of dimensions.

Methods of Dimensionality Reduction

The various methods used
sed for dimensionality reduction include:
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)

Principal Component Analysis

This method was introduced by Karl Pearson. It works on a condition that while the data in a
higher dimensional space is mapped to data in a lower dimension space, the variance of the data
in the lower dimensional space should be maximum.
It involves the following steps:
 Construct the covariance matrix of the data.
 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of
variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data
loss in the process. But, the most important variances should be retained by the remaining
eigenvectors.

SUPERVISED AND UNSUPERVISED LEARNING

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained using labeled Unsupervised learning algorithms are trained
data. using unlabeled data.

Supervised learning model takes direct feedback to Unsupervised learning model does not take
check if it is predicting correct output or not. any feedback.

Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.

In supervised learning, input data is provided to the In unsupervised learning, only input data is
model along with the output. provided to the model.

The goal of supervised learning is to train the model so The goal of unsupervised learning is to find
that it can predict the output when it is given new data. the hidden patterns and useful insights from
the unknown dataset.

Supervised learning needs supervision to train the Unsupervised learning does not need any
model. supervision to train the model.
Supervised learning can be categorized Unsupervised Learning can be classified
in Classification and Regression problems. in Clustering and Associations problems.

Supervised learning model produces an accurate result. Unsupervised learning model may give less
accurate result as compared to supervised
learning.

It includes various algorithms such as Linear It includes various algorithms such as

Regression, Logistic Regression, Support Vector Clustering, KNN, and Apriori algorithm.
Machine, Multi-class Classification, Decision tree,
Bayesian Logic, etc.

CLASSIFICATION—LINEAR AND NON-LINEAR

Classification Algorithms can be divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

PERCEPTRON

Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers
decide whether an input, usually represented by a series of vectors, belongs to a specific class. a
perceptron is a single-layer neural network. They consist of four main parts including input
values, weights and bias, net sum, and an activation function.

The process begins by taking all the input values and multiplying them by their weights. Then,
all of these multiplied values are added together to create the weighted sum. The weighted sum is
then applied to the activation function, producing the perceptron's output. The activation function
plays the integral role of ensuring the output is mapped between required values such as (0,1) or
(-1,1). It is important to note that the weight of an input is indicative of the strength of a node.
Similarly, an input's bias value gives the ability to shift the activation function curve up or down.

As a simplified form of a neural network, specifically a single-layer neural network, perceptrons

play an important role in binary classification. This means the perceptron is used to classify data
into two parts, hence binary. Sometimes, perceptrons are also referred to as linear binary
classifiers for this reason.

LOGISTIC REGRESSION

 Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.

 Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.

 Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
 In Logistic regression, instead of fi
fitting
tting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).

 The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.

 Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.

 Logistic Regression can be used to classify the obse

observations
rvations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:

Logistic Regression Equation:

The Logistic regression equation can be obtained fro from

m the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.

o Multinomial: In multinomial Logistic regression, there can be 3 or more possible

unordered types of the dependent variable, such as "cat", "dogs", or "sheep"

o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

BOOSTING AND BAGGING

 Bagging ( or Bootstrap Aggregation), is a simple and very powerful ensemble method.

Bagging is the application of the Bootstrap procedure to a high-variance machine
learning algorithm, typically decision trees.
 The idea behind bagging is combining the results of multiple models (for instance, all
decision trees) to get a generalized result. Now, bootstrapping comes into picture.
 Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea
of the distribution (complete set). The size of subsets created for bagging may be less
than the original set.
 It can be represented as follows:

Bagging works as follows:-

1. Multiple subsets are created from the original dataset, selecting observations with
replacement.
2. A base model (weak model) is created on each of these subsets.
3. The models run in parallel and are independent of each other.
4. The final predictions are determined by combining the predictions from all the models.
Now, bagging can be represented diagrammatically as follows

 Boosting is a sequential process, where each subsequent model attempts to correct the
errors of the previous model. The succeeding models are dependent on the previous
model.
 In this technique, learners are learned sequentially with early learners fitting simple
models to the data and then analyzing data for errors. In other words, we fit consecutive
trees (random sample) and at every step, the goal is to solve for net error from the prior
tree.
 When an input is misclassified by a hypothesis, its weight is increased so that next
hypothesis is more likely to classify it correctly. By combining the whole set at the end
converts weak learners into better performing model.
 Let’s understand the way boosting works in the below steps.
1. A subset is created from the original dataset.
2. Initially, all data points are given equal weights.
3. A base model is created on this subset.
4. This model is used to make predictions on the whole dataset.
 Errors are calculated using the actual values and predicted values.
 The observations which are incorrectly predicted, are given higher weights. (Here, the
three misclassified blue-plus points will be given higher weights)
 Another model is created and predictions are made on the dataset. (This model tries to
correct the errors from the previous model)
 Similarly, multiple models are created, each correcting the errors of the previous model.
 The final model (strong learner) is the weighted mean of all the models (weak learners).

CLUSTERING---PARTITIONAL AND HIERARCHICAL; K-MEANS CLUSTERING

Cluster analysis, or clustering, is an unsupervised machine learning task.

It involves automatically discovering natural grouping in data. Unlike supervised learning,

clustering algorithms only interpret the input data and find natural groups or clusters.

1. Examples of Clustering Algorithms

1. BIRCH
2. DBSCAN
3. K-Means
4. Spectral Clustering
5. Gaussian Mixture Model
K-MEANS

K-means clustering algorithm computes the centroids and iterates until we it finds optimal
centroid. It assumes that the number of clusters are already known. It is also called flat
clustering algorithm. The number of clusters identified from data by algorithm is represented by
‘K’ in K-means.
In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the
squared distance between the data points and centroid would be minimum. It is to be understood
that less variation within the clusters will lead to more similar data points within same cluster.

Working of K-Means Algorithm

Step 1 − First, we need to specify the number of clusters, K, need to be generated by this
algorithm.
Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple
words, classify the data based on the number of data points.
Step 3 − Now it will compute the cluster centroids.
Step 4 − Next, keep iterating the following until we find optimal centroid which is the
assignment of data points to the clusters that are not changing any more
 4.1 − First, the sum of squared distance between data points and centroids would be
computed.
 4.2 − Assign each data point to the cluster that is closer than other cluster (centroid).
 4.3 − At last compute the centroids for the clusters by taking the average of all data
points of that cluster.
K-means follows Expectation-Maximization approach to solve the problem. The Expectation-
step is used for assigning the data points to the closest cluster and the Maximization-step is used
for computing the centroid of each cluster.
Applications of K-Means Clustering Algorithm

 Market segmentation
 Document Clustering
 Image segmentation
 Image compression
 Customer segmentation
 Analyzing the trend on dynamic data

EVALUATION METRICS :

Root mean square error or root mean square deviation is one of the most commonly used
measures for evaluating the quality of predictions. It shows how far predictions fall from
measured true values using Euclidean distance.

To compute RMSE, calculate the residual (difference between prediction and truth) for each data
point, compute the norm of residual for each data point, compute the mean of residuals and take
the square root of that mean. RMSE is commonly used in supervised learning applications, as
RMSE uses and needs true measurements at each predicted data point.

Root mean square error can be expressed as

where N is the number of data points
points, y(i) is the i-th measurement,, and y ̂(i) is its
corresponding prediction.

Mean Absolute Error (MAE)

Mean Absolute Error (also called L1 loss) is one of the most simple yet robust loss functions
used for regression models.

MAE takes the average sum of the absolute differences between the actual and the predicted
values. For a data point xi and its predic
predicted value yi, n being the total number of data points in the
dataset, the mean absolute error is defined as:

Coefficient of Determination (R Squared)

 The coefficient of determination is the square of the correlation(r), thus it ranges from 0
to 1.
 With linear regression,, the coefficient of determination is equal to the square of the
correlation between the x and y variables.
 If R2 is equal to 0, then the dependent variable cannot be predict
predicted
ed from the independent
variable.
 If R2 is equal to 1, then the dependent variable can be predicted from the independent
variable without any error.
 If R2 is between 0 and 1, then it indicates the extent that the dependent variable can be
predictable. If R2 of 0.10 means, it is 10 percent of the variance in the y variable is
predicted from the x variable. If 0.20 means, 20 percent of the variance in the y variable
is predicted from the x variable, and so on.
The value of R2 shows whether the model would be a good fit for the given data set.
TRAINING AND TESTING A CLASSIFIER
Training and Testing is a phenomena through which a system gets trained and becomes
adaptable to give result in an accurate manner. Learning is the most important phase as how
well the system performs on the data provided to the system depends on which algorithms
used on the data. Entire dataset is divided into two categories, one which is used in training
the model i.e. Training set and the other that is used in testing the model after training, i.e.
Testing set.

 Trainingset:

Training set is used to build a model. It consists of the set of images that are used to train
the system. Training rules and algorithms used give relevant information on how to
associate input data with output decision. The system is trained by applying these
algorithms on the dataset, all the relevant information is extracted from the data and
results are obtained. Generally, 80% of the data of the dataset is taken for training data

 Testingset:

Testing data is used to test the system. It is the set of data which is used to verify whether
the system is producing the correct output after being trained or not. Generally, 20% of
the data of the dataset is used for testing.

CROSS-VALIDATION
Cross-Validation

Cross-validation is a technique in which we train our model using the subset of the data-set
and then evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows :
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
Methods of Cross Validation
 Validation
 LOOCV (Leave One Out Cross Validation)
 K-Fold Cross Validation

HANDLING- EXPLORATORY DATA ANALYSIS (EDA)

Steps in Data Exploration and Preprocessing:

1. Identification of variables and data types

2. Analyzing the basic metrics
3. Non-Graphical Univariate Analysis
4. Graphical Univariate Analysis
5. Bivariate Analysis
6. Variable transformations
7. Missing value treatment
8. Outlier treatment
9. Correlation Analysis
10. Dimensionality Reduction

ROC CURVE

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a
classification model at all classification thresholds. This curve plots two parameters:

 True Positive Rate

 False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

False Positive Rate (FPR) is defined as follows:

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the
classification threshold classifies more items as positive, thus increasing both False Positives and
True Positives. The following figure shows a typical ROC curve.

To compute the points in an ROC curve, we could evaluate a logistic regression model many
times with different classification thresholds, but this would be inefficient. Fortunately, there's
an efficient, sorting-based algorithm that can provide this information for us, called AUC.

AUC: Area Under the ROC Curve

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-
dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).
(COST FUNCTIONS : same as evaluation functions)
UNIT - 4

1. MULTILAYER PERCEPTRON
The Perceptron consists of an input layer and an output layer which are fully connected. MLPs
have the same input and output layers but may have multiple hidden layers in between the
aforementioned layers, as seen below.

The algorithm for the MLP is as follows:

1. Just as with the perceptron, the inputs are pushed forward through the MLP by taking the
dot product of the input with the weights that exist between the input layer and the hidden
layer (WH). This dot product yields a value at the hidden layer.
2. MLPs utilize activation functions at each of their calculated layers. There are many
activation functions to discuss: rectified linear units (ReLU), sigmoid function,
tanh. Push the calculated output at the current layer through any of these activation
functions.
3. Once the calculated output at the hidden layer has been pushed through the activation
function, push it to the next layer in the MLP by taking the dot product with the
corresponding weights.
4. Repeat steps two and three until the output layer is reached.
5. At the output layer, the calculations will either be used for a backpropagation algorithm
that corresponds to the activation function that was selected for the MLP or a decision
will be made based on the output.

MLPs form the basis for all neural networks and have greatly improved the power of computers
when applied to classification and regression problems.
2. BACK PROPAGATION
Back propagation is a supervised learning algorithm, for training Multi
Multi-layer
layer Perceptrons
(Artificial Neural Networks).

Consider the diagram below

 Calculate the error – How far is your model output from the actual output.
 Minimum Error – Check whether the error is minimized or not.
 Update the parameters – If the error is huge then, update the parameters (weights and
biases). After that again check the error. Repeat the process until the error becomes
minimum.
 Model is ready to make a prediction – Once the errorrror becomes minimum, you can feed
some inputs to your model and it will produce the output.

The Backpropagation algorithm looks for the minimum value of the error function in weight
space using a technique called the delta rule or gradient descent. The wweights
eights that minimize the
error function is then considered to be a solution to the learning problem.

 We first initialized some random value to ‘W’ and propagated forward.

 Then, we noticed that there is some error. To reduce that error, we propagated backwards
back
and increased the value of ‘W’.
 After that, if we noticed that the error has increased. We came to know that, we can’t
increase the ‘W’ value.
 So, we again propagated backwards and we decreased ‘W’ value.
 Now, if we noticed that the error has reduce
reduced,, We should proceed in decrease direction
until we reach the ‘Global Loss Minimum’.
Example : Consider neural network with following properties

The above network contains the following:

 two inputs
 two hidden neurons
 two output neurons
 two biases

Below are the steps involved in Backpropagation:

 Step – 1: Forward Propagation

 Step – 2: Backward Propagation
 Step – 3: Putting all the values together and calculating the updated weight value
Step – 1: Forward Propagation

We will start by propagating forward.

Repeat
epeat this process for the output layer neurons, using the output from the hidden layer neurons
as inputs.

The value of the error:

Step – 2: Backward Propagation

Now, we will propagate backwards. This way we will try to reduce the error by changing the
values of weights and biases.

Consider W5, we will calculate the rate of change of error w.r.t change in weight W5.

Since we are propagating backwards, first thing we need to do is, calculate the change in total
errors w.r.t the output O1 and O2.

Now, propagate further backwards and calculate the change in output O1 w.r.t to its total net
input.
The
he total net input of O1 changes w.r.t W5?

Step – 3: Putting all the values together and calculating the updated weight value

Now, put all the values together:

The updated value of W5:

 Similarly, we can calculate the other weight values as well.

 After that we will again propagate forward and calculate the output. Again, we will
calculate the error.
 If the error is minimum we will stop right there, else we will again propagate backwards
and update the weight values.
 This process will keep on repeating until error becomes minimum.
3. LOSS FUNCTIONS
Loss functions measure how far an estimated value is from its true value. A loss function maps
decisions to their associated costs
costs. Loss functions are not fixed, they
y change depending on the
task in hand and the goal to be met.

Loss functions for regression

Regression involves predicting a specific value that is continuous in nature. Estimating the price

of a house or predicting stock prices are examples of regression

Mean Absolute Error (MAE)

Mean Absolute Error (also called L1 loss) is one of the most simple yet robust loss functions
used for regression models.

MAE takes the average sum of the absolute differences between the actual and the predicted
values. For a data point xi and its predicted value yi, n being the total number of data points in the
dataset, the mean absolute error is defined as:
Mean Squared Error (MSE)

Mean Squared Error (also called L2 loss) is almost every data scientist’s preference when it
comes to loss functions for regression.

Mean Squared Error is the average of the squared differences between the actual and the
predicted values. For a data point Y i and its predicted value Ŷi, where n is the total number of
data points in the dataset, the mean squared error is defined as:

Mean Bias Error (MBE)

Mean Bias Error takes the actual differenc

differencee between the target and the predicted value, and not
the absolute difference. One has to be cautious as the positive and the negative errors could
cancel each other out, which is why it is one of the lesser
lesser-used loss functions.

The formula of Mean Bias Error

ror is:

Mean Squared Logarithmic Error (MSLE)

Calculating the Mean Squared Logarithmic Error is the same as Mean Squared Error, except the
natural logarithm of the predicted values is used rather than the actual values.

Loss functions for classificatio

classification

Classification problems involve predicting a discrete class output. mail can be classified as a
spam or not a spam and a person’s dietary preferences can be put in one of three categories -
vegetarian, non-vegetarian
vegetarian and vegan.
Binary Cross Entropy Loss

Cross-entropy loss, or log loss, measures the performance of a classification model whose output
is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability
diverges from the actual label. So predicting a probability of .012 when the actual observation
label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of
0.

4. EPOCHS AND BATCH SIZES

An epoch means training the neural network with all the training data for one cycle. In an epoch,
we use all of the data exactly once. A forward pass and a backward pass together are counted as
one pass.
An epoch is made up of one or more batches, where we use a part of the dataset to train the
neural network. We call passing through the training examples in a batch an iteration.
An epoch is sometimes mixed with an iteration. To clarify the concepts, let’s consider a simple
example where we have 1000 data points as presented in the figure below:

If the batch size is 1000, we can complete an epoch with a single iteration. Similarly, if the batch
size is 500, an epoch takes two iterations. So, if the batch size is 100, an epoch takes 10 iterations
to complete. Simply, for each epoch, the required number of iterations times the batch size gives
the number of data points.
We can use multiple epochs in training. In this case, the neural network is fed the same data
more than once.

5. HYPER PARAMETER TUNING

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set
of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose
value is used to control the learning process. By contrast, the values of other parameters
(typically node weights) are learned.
The same kind of machine learning model can require different constraints, weights or learning
rates to generalize different data patterns. These measures are called hyperparameters, and have
to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter
optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a
predefined loss function on given independent data
Approaches

Grid search
One traditional and popular way to perform hyperparameter tuning is by using an Exhaustive Grid
Search. This method tries every possible combination of each set of hyper-parameters. Using this
method, we can find the best set of values in the parameter search space. This usually uses more
computational power and takes a long time to run since this method needs to try every
combination in the grid size.

Randomized Search
The main difference in the RandomizedSearch CV, when compared with GridCV, is that instead
of trying every possible combination, this chooses the hyperparameter sample combinations
randomly from grid space. Because of this reason, there is no guarantee that we will find the best
result like Grid Search. But, this search can be extremely effective in practice as computational
time is very less.
Other approaches :

Bayesian optimization
Gradient-based optimization
Evolutionary optimization

Some important Parameters in Random Forest:

1. max_depth: int, default=None This is used to select how deep you want to make each tree
in the forest.

2. criterion :{“Gini,” “entropy”}, default=” Gini”: Measures the quality of each split.

3. min_samples_leaf: int or float, default=1: This parameter helps determine the minimum
required number of observations at the end of each decision tree

4. n_estimators: int, default=100: This is perhaps the most important parameter. This
represents the number of trees you want to build within a random forest before calculating
the predictions.

6. APPLICATIONS TO CLASSIFICATION, REGRESSION AND

UNSUPERVISED LEARNING
a. Applications to classification

Sentiment analysis is a machine learning text analysis technique that assigns sentiment (opinion,
feeling, or emotion) to words within a text, or an entire text, on a polarity scale
of Positive, Negative, or Neutral.

It can automatically read through thousands of pages in minutes or constantly monitor social
media for posts about you. And it would be analyzed to pull all of the individual statements
as Positive. This allows companies to follow product releases and marketing campaigns in real-
time, to see how customers are reacting.

Email Spam

One of the most common uses of classification, working non-stop and with little need for human
interaction, email spam classification saves us from tedious deletion tasks and sometimes even
costly phishing scams.

Email applications use the above algorithms to calculate the likelihood that an email is either not
intended for the recipient or unwanted spam. Using text analysis classification techniques, spam
emails are weeded out from the regular inbox: perhaps a recipient’s name is spelled incorrectly,
or certain scamming keywords are used.

Document Classification

Document classification is the ordering of documents into categories according to their content.
This was previously done manually, as in the library sciences or hand-ordered legal files.
Machine learning classification algorithms, however, allow this to be performed automatically

Image Classification

Image classification assigns previously trained categories to a given image. These could be the
subject of the image, a numerical value, a theme, etc. Image classification can even use multi-
label image classifiers, that work similarly to multi-label text classifiers, to tag an image of a
stream, for example, into different labels, like “stream,” “water,” “outdoors,” etc.

b. Applications to regression
Forecasting

A top advantage of using a linear regression model in machine learning is the ability to forecast
trends and make predictions that are feasible. Data scientists can use these predictions and
make further deductions based on machine learning. It is quick, efficient, and accurate. This is
predominantly since machines process large volumes of data and there is minimum human
intervention. Once the algorithm is established, the process of learning becomes simplified.

Beneficial to small businesses

By altering one or two variables, machines can understand the impact on sales. Since deploying
linear regression is cost-effective, it is greatly advantageous to small businesses since short-
and long-term forecasts can be made when it comes to sales. This means that small businesses
can plan their resources well and create a growth trajectory for themselves. They will also be to
understand the market and its preferences and learn about supply and demand.

Preparing Strategies
Since machine learning enables prediction, one of the biggest advantages of a linear regression
model in it is the ability to prepare a strategy for a given situation, well in advance, and analyze
various outcomes. Meaningful information can be derived from the regression model of
forecasting thereby helping companies plan strategically and make executive decisions.
c. Applications to unsupervised learning

Some application of Unsupervised Learning Techniques are:

 Clustering automatically split the dataset into groups base on their similarities
 Anomaly detection can discover unusual data points in your dataset. It is useful for
finding fraudulent transactions
 Association mining identifies sets of items which often occur together in your dataset
 Latent variable models are widely used for data preprocessing. Like reducing the number
of features in a dataset or decomposing the dataset into multiple components

7. RECURRENT NEURAL NETWORK

A Recurrent
ent Neural Network works on the principle of saving the output of a particular layer and
feeding this back to the input in order to predict the output of the layer.

Below is how you can convert a Feed

Feed-Forward
Forward Neural Network into a Recurrent Neural
Network:

The nodes in different layers of the neural network are compressed to form a single layer of
recurrent neural networks. A, B, and C are the parameters of the network.
Here, “x” is the input layer, “h” is the hidden layer, and “y” is the output layer. A, B, and C are
the network parameters used to improve the output of the model. At any given time t, the current
input is a combination of input at x(t) and x(t-1). The output at any given time is fetched back to
the network to improve on the output.

Recurrent neural networks were created because there were a few issues in the feed-forward
neural network:

 Cannot handle sequential data

 Considers only the current input
 Cannot memorize previous input

Applications of Recurrent Neural Networks

Image Captioning
Time Series Prediction
Natural Language Processing

Types of Recurrent Neural Networks

There are four types of Recurrent Neural Networks:

1. One to One
2. One to Many
3. Many to One
4. Many to Many
8. CONVOLUTIONAL NEURAL NETWORKS

Convolutional Neural Networks – which are designed to address image recognition systems
and classification problems. Convolutional Neural Networks have wide applications in image
and video recognition, recommendation systems and natural language processing..
processing

A convolutional
ional neural network is a feed-forward neural network,, often with up to 20 or 30
layers. The power of a convolutional neural network comes from a special kind of layer called
the convolutional layer.

Convolutional neural networks contain many convolutional layers stacked on top of each other,
each one capable of recognizing more sophisticated shapes. With three or four convolutional
layers it is possible to recognize
gnize handwritten digits and with 25 layers it is possible to distinguish
human faces.

The architecture of a convolutional neural network is a multi

multi-layered feed-forward
forward neural
network, made by stacking many hidden layers on top of each other in sequence. It is this
sequential design that allows convolutional neural networks to learn hierarchical features.

The hidden layers are typically convolutional layers followed by activation layers, some of them
followed by pooling layers.

There are four layered concepts in Convolutional Neural Networks:

1. Convolution,
2. ReLu,
3. Pooling and
4. Full Connectedness (Fully Connected Layer).
Convolution Layer

This is the first step in the process of extracting valuable features from an image. A convolution
layer has several filters that perform the convolution operation. Every image is considered as a
matrix of pixel values.

ReLU layer

ReLU stands for the rectified linear unit. Once the feature maps are extracted, the next step is to
move them to a ReLU layer. ReLU performs an element-wise wise operation and sets all the negative
pixels to 0. It introduces non-linearity
linearity to the network, and the generated output is a rectified
feature map.

Pooling Layer

Pooling is a down-sampling
sampling operation that reduces the dimensionality of the feature map. The
rectified feature map now goes through a pooling layer to generate a pooled feature map.
9. LONG SHORT-TERM
TERM MEMORY (LSTM)

Long short-term memory (LSTM) is an artificial recurrent neural network (RNN)

architecture used in the field of deep learning. Unlike standard feedforward neural networks,
networks
LSTM has feedback connections. It can not only process single data points , but also entire
sequences of data.

For example, LSTM is applicable to tasks such as unsegmented, connected handwriting

recognition, speech recognition
recognition. A common LSTM unit is composed of a cell, cell an input gate,
an output gate and a forget gate
gate.. The cell remembers values over arbitrary time intervals and
the three gates regulate the flow of information into and out of the cell.
LSTM networks are well-suited to classifying, processing and making predictions based
on time series data. LSTMs were developed to deal with the vanishing gradient problem that
can be encountered when training traditional RNNs. Relative insensitivity to gap length is an
advantage of LSTM over RNNs

:
UNIT - 5
1. RECOMMENDER SYSTEMS

Recommender systems are so commonplace now that many of us use them without even
knowing it. Because we can't possibly look through all the products or content on a website, a
recommendation system plays an important role in helping us have a better user experience

Some examples of recommender systems in action include product recommendations on

Amazon, Netflix suggestions for movies and TV shows in your feed, recommended videos on
YouTube, music on Spotify, the Facebook newsfeed and Google Ads

HOW DO RECOMMENDER SYSTEMS WORK

UN DE RS T A N D I N G RE L A T I O NS H I PS
User-Product Relationship
The user-product relationship occurs when some users have a preference towards specific
products that they need. For example, a cricket player might have a preference for cricket-related
items, thus the e-commerce website will build a user-product relation of player->cricket.

Product-Product Relationship
Product-product relationships occur when items are similar in nature, either by appearance or
description. Some examples include books or music of the same genre, dishes from the same
cuisine, or news articles from a particular event.

User-User Relationship
User-user relationships occur when some customers have similar taste with respect to a particular
product or service. Examples include mutual friends, similar backgrounds, similar age, etc.
DAT A & RE CO M ME NDE R S YS T E M S

In addition to relationships, recommender systems utilize the following kinds of data:

User Behavior Data

Users behavior data is useful information about the engagement of the user on the product. It can
be collected from ratings, clicks and purchase history.

User Demographic Data

User demographic information is related to the user’s personal information such as age,
education, income and location.

Product Attribute Data

Product attribute data is information related to the product itself such as genre in case of books,
cast in case of movies, and cuisine in case of food.

There are two particularly important methods, explicit and implicit rating.

Explicit Ratings
Explicit ratings are provided by the user. They infer the user’s preference. Examples include star
ratings, reviews, feedback, likes and following. Since users don't always rate products, explicit
ratings can be hard to get.

Implicit Ratings
Implicit ratings are provided when users interact with the item. They infer a user’s behavior and
are easy to get as users are subconsciously clicking. Examples include clicks, views and
purchases.

Product Similarity (Item-Item Filtering)

Product similarity is the most useful system for suggesting products based on how much the user
would like the product. If the user is browsing or searching for a particular product, they can be
shown similar products. Users often expect to find products they want quickly and move on if
they have a hard time finding the relevant product. When the user clicks on one product we can
show another similar product, or if the user buys the product we can email the user
advertisements or coupons based on a similar product.
User Similarity (User-User Filtering)
User similarity is for checking the difference between the similarity of two users. If two users
have similar preferences for a product we can assume they have similar interests. It’s like a
friend recommending a product.

Similarity Measures

Minkowski Distance :

Manhattan Distance

Euclidean Distance

Pearson Coefficient
APPROACHES TO RECOMMENDER SYSTEMS

CONTENT BASED FILTERING RECOMMENDER

Content-based recommendation systems uses their knowledge about each product to recommend
new ones. Recommendations are based on attributes of the item. Content-based recommender
systems work well when descriptive data on the content is provided beforehand. “Similarity” is
measured against product attributes.

Suppose I watch a movie in a particular genre, then I will be recommended movies within that
specific genre. The movie's attributes, like title, year of release, director and cast, are also helpful
in identifying similar movie content.
COLLABORATIVE FILTERING RECOMMENDER

Collaborative filtering recommender makes suggestions based on how users rated in the past and
not based on the product themselves. It only knows how other customers rated the product.
“Similarity” is measured against the similarity of users.
2. IMAGE CLASSIFICATION
Classification between objects is a fairly easy task for us, but it has proved to be a complex one
for machines and therefore image classification has been an important task within the field of
computervision. Image classification refers to the labeling of images into one of a number of
predefined classes.

Some examples of image classification include

include:
 Labeling an x-ray
ray as cancer or not (binary classification).

 Classifying a handwritten digit (multiclass classification).

 Assigning a name to a photograph of a face (multiclass classification).

The advancements in the field of autonomous driving also serve as a great example of the use of
image classification in the real-world.
world. For example, we can build an image classification model
that recognizes various objects, such as other vehicles, pedestrians, rians, traffic lights,
lights
and signposts on the road.

Structure of an Image Classification Task

 Image Preprocessing - The aim of this process is to improve the image data(features) by
suppressing unwanted distortions and enhancement of some important image features so that our
Computer Vision models can benefit from this improved data to work on.
 Detection of an object - Detection refers to the localization of an object which means the
segmentation of the image and identifying the position of the object oof interest.
 Feature extraction and Training
Training- This is a crucial step wherein statistical or deep learning
methods are used to identify the most interesting patterns of the image, features that might be
unique to a particular class and that will, later onon,, help the model to differentiate between
different classes. This process where the model learns the features from the dataset is called
model training.
 Classification of the object - This step categorizes detected objects into predefined classes
by using
ng a suitable classification technique that compares the image patterns with the target
patterns.
Image Classification Techniques
We will start with some statistical machine learning classifiers like Support Vector
Machine and Decision Tree and then move on to deep learning architectures like Convolutional
Neural Networks.

Performance evaluation

3. SOCIAL NETWORK GRAPHS

The essential characteristics of a social network are:

There is a collection of entities that participate in the network. Typically, these entities are
people

There is at least one relationship between entities of the network. On Facebook or its ilk, this
relationship is called friends. Sometimes the relationship is all-or-nothing; two people are either
friends or they are not. However, in other examples of social networks, the relationship has a
degree.

There is an assumption of nonrandomness or locality. This condition is the hardest to formalize.

That is, if entity A is related to both B and C, then there is a higher probability than average that
B and C are related

Social Networks as Graphs

Social networks are naturally modeled as graphs, which we sometimes refer to as a social graph.
The entities are the nodes, and an edge connects two nodes if the nodes are related by the
relationship that characterizes the network. If there is a degree associated with the relationship,
this degree is represented by labeling the edges.

Figure is an example of a tiny social network. The entities are the nodes A through G. The
relationship, which we might think of as “friends,” is represented by the edges. For instance, B is
friends with A, C, and D.

Varieties of Social Networks

Telephone Networks

Here the nodes represent phone numbers, which are really individuals. There is an edge between
two nodes if a call has been placed between those phones in some fixed period of time, such as
last month, or “ever.” The edges could be weighted by the number of calls made between these
phones during the period.

Email Networks

The nodes represent email addresses, which are again individuals. An edge represents the fact
that there was at least one email in at least one direction between the two addresses.
Alternatively, we may only place an edge if there were emails in both directions. In that way, we
avoid viewing spammers as “friends” with all their victims. Another approach is to label edges as
weak or strong. Strong edges represent communication in both directions, while weak edges
indicate that the communication was in one direction only.

Collaboration Networks

Nodes represent individuals who have published research papers. There is an edge between two
individuals who published one or more papers jointly. Optionally, we can label edges by the
number of joint publications. The communities in this network are authors working on a
particular topic.

An alternative view of the same data is as a graph in which the nodes are papers. Two papers are
connected by an edge if they have at least one author in common. Now, we form communities
that are collections of papers on the same topic.
Clustering of Social-Network Graphs

Distance Measures for Social-Network Graphs

If we were to apply standard clustering techniques to a social-network graph, our first step would
be to define a distance measure. combine nodes which are nearby. Repeating same process can
from clusters

Traditional clustering includes two communities. Likely to put two nodes with small distance in
the same cluster. Social networks graphs would have cross community edges. Severe merging of
communities likely.

Other approach

The Girvan-Newman Algorithm

In order to exploit the betweenness of edges, we need to calculate the number of shortest paths
going through each edge. We shall describe a method called the Girvan-Newman (GN)
Algorithm, which visits each node X once and computes the number of shortest paths from X to
each of the other nodes that go through each of the edges.

The algorithm begins by performing a breadth-first search (BFS) of the graph, starting at the
node X.

The level of each node in the BFS presentation is the length of the shortest path from X to that
node.
The second step of the GN algorithm is to label each node by the number of shortest paths that
reach it from the root. Start by labeling the root 1.

The thirdd and final step is to calculate the credit value of node. Credit of node is calculated using
shortcut method: finding the total no. of nodes that a current node is responsible to reach other
nodes from root node.

Using Betweenness to Find Communities

Bottom Up

Keep adding edges (among existing ones) starting from lowest betweenness . Graduallly join
small components to build large connected components.
Top-down approach:

Start from all existing edges. The graph may look like one bid component.
Keep removing edges starting from the highest betweenness
Gradually split large components to arrive at communities
Repeat process until desired no. of clusters formed

Understanding Probability Distributions
No ratings yet
Understanding Probability Distributions
36 pages
Descriptive Statistics Overview
No ratings yet
Descriptive Statistics Overview
51 pages
Understanding Measures of Central Tendency
No ratings yet
Understanding Measures of Central Tendency
14 pages
Understanding Central Tendency and Dispersion
No ratings yet
Understanding Central Tendency and Dispersion
9 pages
Probability and Statistics Overview
No ratings yet
Probability and Statistics Overview
23 pages
DS Presentation PPT 1
No ratings yet
DS Presentation PPT 1
32 pages
J3's SP Notes
No ratings yet
J3's SP Notes
5 pages
Empirical Political Analysis Techniques
No ratings yet
Empirical Political Analysis Techniques
8 pages
Module II Analytical Tools For Economic Analysis II
No ratings yet
Module II Analytical Tools For Economic Analysis II
19 pages
Discrete Random Variables & Distributions
No ratings yet
Discrete Random Variables & Distributions
30 pages
Statistics: Descriptive & Inferential Insights
No ratings yet
Statistics: Descriptive & Inferential Insights
15 pages
Understanding Probability Distributions
No ratings yet
Understanding Probability Distributions
37 pages
Mean and Range in Probability Distributions
No ratings yet
Mean and Range in Probability Distributions
4 pages
Understanding Uniform Probability Distribution
No ratings yet
Understanding Uniform Probability Distribution
14 pages
Understanding Probability Distributions
No ratings yet
Understanding Probability Distributions
6 pages
Random Variables and Probability Distributions
No ratings yet
Random Variables and Probability Distributions
23 pages
Understanding Probability Distributions
No ratings yet
Understanding Probability Distributions
10 pages
Day 02-Random Variable and Probability - Part (I)
No ratings yet
Day 02-Random Variable and Probability - Part (I)
34 pages
Mean and Variance in Probability
No ratings yet
Mean and Variance in Probability
35 pages
Overview of Probability in Statistics
No ratings yet
Overview of Probability in Statistics
29 pages
Discrete Probability Distributions Explained
No ratings yet
Discrete Probability Distributions Explained
7 pages
Data Science: Statistics for Engineers
No ratings yet
Data Science: Statistics for Engineers
67 pages
Statistics: Random Variables & Distributions
No ratings yet
Statistics: Random Variables & Distributions
8 pages
Understanding Random Variables and Distributions
No ratings yet
Understanding Random Variables and Distributions
4 pages
Introduction to Basic Statistics
No ratings yet
Introduction to Basic Statistics
31 pages
Data Analytics: Distributions & Outliers
No ratings yet
Data Analytics: Distributions & Outliers
53 pages
Descriptive Statistics and Probability Concepts
No ratings yet
Descriptive Statistics and Probability Concepts
44 pages
VSUEE. Discrete Probability Distribution
No ratings yet
VSUEE. Discrete Probability Distribution
42 pages
Understanding Random Variables and Distributions
No ratings yet
Understanding Random Variables and Distributions
7 pages
Introduction to Statistics and Variables
No ratings yet
Introduction to Statistics and Variables
18 pages
Understanding Probability Distributions
No ratings yet
Understanding Probability Distributions
24 pages
Probability Distributions Overview
No ratings yet
Probability Distributions Overview
6 pages
Statistics and Probability Fundamentals
No ratings yet
Statistics and Probability Fundamentals
45 pages
Probability Distribution Overview
No ratings yet
Probability Distribution Overview
28 pages
Central Tendency and Data Dispersion
No ratings yet
Central Tendency and Data Dispersion
63 pages
Data Modeling: Random Variables & CLT
No ratings yet
Data Modeling: Random Variables & CLT
44 pages
Module 2 Notes 1
No ratings yet
Module 2 Notes 1
65 pages
Understanding Probability Distributions
No ratings yet
Understanding Probability Distributions
16 pages
Understanding Probability Distributions
No ratings yet
Understanding Probability Distributions
8 pages
LSSGB (Simplilearn, 2014) - Lesson - 4. Analyze
100% (1)
LSSGB (Simplilearn, 2014) - Lesson - 4. Analyze
121 pages
Chapter 3
No ratings yet
Chapter 3
49 pages
Understanding Binomial Distribution
No ratings yet
Understanding Binomial Distribution
8 pages
Statistical Methods Overview and Applications
No ratings yet
Statistical Methods Overview and Applications
27 pages
Understanding Random Variables and Distributions
No ratings yet
Understanding Random Variables and Distributions
70 pages
Understanding Measures of Central Tendency
No ratings yet
Understanding Measures of Central Tendency
31 pages
Measures of Central Tendency Explained
No ratings yet
Measures of Central Tendency Explained
16 pages
Understanding Probability Distributions
No ratings yet
Understanding Probability Distributions
63 pages
Differential Equations & Probability Basics
No ratings yet
Differential Equations & Probability Basics
106 pages
Understanding Random Variables in Statistics
No ratings yet
Understanding Random Variables in Statistics
68 pages
Discrete Probability Distributions Explained
No ratings yet
Discrete Probability Distributions Explained
6 pages
Business Statistics: Central Tendency & Dispersion
No ratings yet
Business Statistics: Central Tendency & Dispersion
35 pages
Measures of Central Tendency Explained
No ratings yet
Measures of Central Tendency Explained
9 pages
Understanding Descriptive Statistics and Distributions
No ratings yet
Understanding Descriptive Statistics and Distributions
20 pages
Sampling Distribution in Business Analysis
No ratings yet
Sampling Distribution in Business Analysis
11 pages
Statistics and Probability Basics Guide
No ratings yet
Statistics and Probability Basics Guide
22 pages
Statistical Concepts and Definitions
No ratings yet
Statistical Concepts and Definitions
15 pages
Probability - Random Variables and Distribution - Probability...
No ratings yet
Probability - Random Variables and Distribution - Probability...
10 pages
Solving Equations with Matrices
No ratings yet
Solving Equations with Matrices
35 pages
Understanding Measures of Dispersion
No ratings yet
Understanding Measures of Dispersion
6 pages
User Acceptance Testing Expertise in SAP
No ratings yet
User Acceptance Testing Expertise in SAP
5 pages
Sims 4 Desync Error Report
No ratings yet
Sims 4 Desync Error Report
1 page
E-Waste Management SOP for Andhra Pradesh
No ratings yet
E-Waste Management SOP for Andhra Pradesh
5 pages
Square-Root Unscented Kalman Filter
100% (1)
Square-Root Unscented Kalman Filter
4 pages
HTML Project for Vocational Training
No ratings yet
HTML Project for Vocational Training
18 pages
SPPC2000 Security Solutions Overview
No ratings yet
SPPC2000 Security Solutions Overview
9 pages
Social Media Satisfaction in Online Studies
No ratings yet
Social Media Satisfaction in Online Studies
8 pages
Dell Vostro 3450 BIOS Update Guide
No ratings yet
Dell Vostro 3450 BIOS Update Guide
52 pages
Standing Petrovalve Installation Guide
No ratings yet
Standing Petrovalve Installation Guide
9 pages
LB1845 PWM Motor Driver Overview
No ratings yet
LB1845 PWM Motor Driver Overview
7 pages
NSOM: Techniques and Applications
No ratings yet
NSOM: Techniques and Applications
25 pages
HSM Archive License Agreement
No ratings yet
HSM Archive License Agreement
486 pages
RRB Technician I Answer Key English
No ratings yet
RRB Technician I Answer Key English
17 pages
Five J Taxi vs. NLRC Case Summary
No ratings yet
Five J Taxi vs. NLRC Case Summary
2 pages
Vitamin C (VC) Colorimetric Assay Kit: 8th Edition, Revised in February, 2018
No ratings yet
Vitamin C (VC) Colorimetric Assay Kit: 8th Edition, Revised in February, 2018
5 pages
Capriccio - Brahms Piano
No ratings yet
Capriccio - Brahms Piano
6 pages
Environmental Biotechnology Curriculum Guide
No ratings yet
Environmental Biotechnology Curriculum Guide
2 pages
Protein Digestion and Absorption Explained
No ratings yet
Protein Digestion and Absorption Explained
8 pages
Bombay High Court Cause List - Jan 15, 2025
No ratings yet
Bombay High Court Cause List - Jan 15, 2025
10 pages
Detailed Construction Estimate for Steps
No ratings yet
Detailed Construction Estimate for Steps
2 pages
84 Years of Service To The Nation: Published Simultaneously From &
No ratings yet
84 Years of Service To The Nation: Published Simultaneously From &
20 pages
Grade 5 Teacher's Program SY 2025-2026
No ratings yet
Grade 5 Teacher's Program SY 2025-2026
9 pages
Testbank South Koreas Engagement With Africa History of The Relationship in Multiple Aspects 1st Ed 2020 Edition Yongkyu Chang Download
No ratings yet
Testbank South Koreas Engagement With Africa History of The Relationship in Multiple Aspects 1st Ed 2020 Edition Yongkyu Chang Download
296 pages
CHEM 121 Practice Questions Guide
No ratings yet
CHEM 121 Practice Questions Guide
1,017 pages
Reliabilityweb Uptime Element Chart
No ratings yet
Reliabilityweb Uptime Element Chart
1 page
Analyzing Externalities in Markets
No ratings yet
Analyzing Externalities in Markets
4 pages
HDFC Processing Fees Refund Policy
No ratings yet
HDFC Processing Fees Refund Policy
1 page
Pepe and Pilar's Risk Management Choices
No ratings yet
Pepe and Pilar's Risk Management Choices
3 pages
McDonald's Revenue and Digital Strategy Analysis
No ratings yet
McDonald's Revenue and Digital Strategy Analysis
26 pages
NACH Debit Mandate Cancellation Form
No ratings yet
NACH Debit Mandate Cancellation Form
1 page

Data Science With Python Notes

Uploaded by

Data Science With Python Notes

Uploaded by

UNIT -1

MEASURES OF CENTRAL TENDENCY

A measure of central tendency (also referred to as measures of centre or central location) is

The mode is the most commonly occurring value in a distribution.

Looking at the retirement age distribution again:

The mean is calculated by adding together all the values

 Absolute Measure of Dispersion

Absolute Measure of Dispersion

Relative Measure of Dispersion

As a function, a random variable is needed to be measured, which allows probabilities to be

Types of Random Variable

 Discrete Random Variable

Discrete Random Variable

Examples of discrete random variables include:

Continuous Random Variables

The probability mass function is given by

E(X) = 1*p + 0*(1-p) = p

The variance of a random variable from a bernoulli distribution is:

V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)

1. Each trial is independent.

The mathematical representation of binomial distribution is given by:

1. The number of emergency calls recorded at a hospital in a day.

 λ is the rate at which an event occurs,

The PMF of X following a Poisson distribution is given by:

Other examples are:

1. Length of time beteeen metro arrivals,

A random variable X is said to have an exponential distribution with PDF:

and parameter λ>0 which is also called the rate.

Mean and Variance of a random variable X following an exponential distribution:

Mean -> E(X) = 1/λ

Variance -> Var(X) = (1/λ)²

standard normal variables.

Chi-Square Test for Feature Selection

Steps to perform the Chi-Square Test:

1. The mean, median and mode of the distribution coincide.

The PDF of a random variable X following a normal distribution is given by:

Mean -> E(X) = µ

Variance -> Var(X) = σ^2

Here, µ (mean) and σ (standard deviation) are the parameters.

White noise is an important concept in time series forecasting.

It is important for two main reasons:

How to Calculate Variance

The formula for variance is as follows:

Correlation analysis is a method of statistical evaluation used to study the strength of a

The main result of a correlation is called the correlation coefficient.

HYPOTHESIS AND INFERENCE

Exploratory data analysis tools

Types of exploratory data analysis

There are four primary types of EDA:

Exploratory Data Analysis Tools

 Python: An interpreted, object-oriented programming language with dynamic

Philosophy of Exploratory Data Analysis

 Namely, to gain intuition about the data;

2. Data Understanding: After enterprise understanding, the subsequent step is data

6. Model Evaluation: Here the model is evaluated for checking if it is geared up to be

Types of descriptive statistics

 The distribution concerns the frequency of each value.

Frequency distribution in statistics is a representation that displays the number of

Measures of central tendency

Mean number of library visits

Median number of library visits

Mode number of library visits

Mode number of library visits

There are six steps for finding the standard deviation:

1. List each score and find their mean.

15 15 – 9.5 = 5.5 30.25

3 3 – 9.5 = -6.5 42.25

12 12 – 9.5 = 2.5 6.25

0 0 – 9.5 = -9.5 90.25

24 24 – 9.5 = 14.5 210.25

3 3 – 9.5 = -6.5 42.25

M = 9.5 Sum = 0 Sum of squares = 421.5

Step 5: 421.5/5 = 84.3

Step 6: √84.3 = 9.18

More specific examples of methods to visualize data:

1. Demonstration of the relationship between two variables

E(X) = 1p + 0(1-p) = p