STATISTICS FOR DATA
SCIENCE
STATISTICS
• Statistics is the discipline that concerns the collection, organization,
analysis, interpretation, and presentation of data. In applying statistics to a
scientific, industrial, or social problem, it is conventional to begin with a
statistical population or a statistical model to be studied.
CENTRAL TENDENCY
• The central tendency is stated as the statistical measure that represents the
single value of the entire distribution or a dataset. It aims to provide an
accurate description of the entire data in the distribution.
• In statistics, the central tendency is the descriptive summary of a data set.
Through the single value from the dataset, it reflects the centre of the data
distribution. Moreover, it does not provide information regarding individual
data from the dataset, where it gives a summary of the dataset
MEAN
• The mean, median and mode are the three commonly used measures of central tendency.
• Mean is the average of the given numbers and is calculated by dividing the sum of given
numbers by the total number of numbers.
• Mean = (Sum of all the observations/Total number of observations)
• Example: What is the mean of 2, 4, 6, 8 and 10?
• Solution: First, add all the numbers.
• 2 + 4 + 6 + 8 + 10 = 30
• Now divide by 5 (total number of observations).
• Mean = 30/5 = 6
• Calculate the mean from the data showing marks of students in a class in
a test: 40, 50, 55, 78, 58.
• A class consists of 50 students, out of which 30 are girls. The mean of
marks scored by girls in a test is 73 (out of 100), and that of boys is 71.
Determine the mean score of the whole class.
• A batsman has a certain average of 9 matches. In the 10th match, he
scores 100 runs and his average increased by 8 runs. What is his new
average?
Find the mean salary of 60 workers of a
factory from
Salary (in Rs.)
the following
Number of workers
table:
3000 16
4000 12
5000 10
6000 8
7000 6
8000 4
9000 3
10000 1
Total 60
MEdian
• The median is the middle number in a sorted, ascending or descending list of numbers and can be more descriptive of that
data set than the average. It is the point above and below which half (50%) the observed data falls, and so represents the
midpoint of the data.
• The median is the middle number in a sorted list of numbers and can be more descriptive of that data set than the
average.
• The median is sometimes used as opposed to the mean when there are outliers in the sequence that might skew the
average of the values.
• If there is an odd amount of numbers, the median value is the number that is in the middle, with the same amount of
numbers below and above.
• If there is an even amount of numbers in the list, the middle pair must be determined, added together, and divided by two
to find the median value.
• In a normal distribution, the median is the same as the mean and the mode.
If the number of observations n is an odd number, then the median is represented by the numerical value of x,
corresponds to the positioning point of n+1 / 2 in ordered observations. That is,
Median = value of (n+1 / 2)th observation in the data array
If the number of observations n is an even number, then the median is defined as the arithmetic mean of the middle values in
the array That is
• The export of agricultural product in million dollars from a country during
eight quarters in 1974 and 1975 was recorded as 29.7, 16.6, 2.3, 14.1, 36.6,
18.7, 3.5, 21.3
Find the median of the given set of values
• The number of rooms in the seven five stars hotel in Chennai city is 71, 30,
61, 59, 31, 40 and 29. Find the median number of rooms
Cumulative Frequency
In a grouped distribution, values are associated with frequencies.
The cumulative frequencies are calculated to know the total number of items above or below a certain limit.
This is obtained by adding the frequencies successively up to the required level.
This cumulative frequencies are useful to calculate median, quartiles, deciles and percentiles.
Median for Discrete grouped data
We can find median using following steps
i. Calculate the cumulative frequencies
ii. Find (N+1)/2, where N=Σf=total frequencies
iii. Identify the cumulative frequency just greater than (N+1)/2
iv. The value of x corresponding to that cumulative frequency is the (N+1)/2 median.
Solution:
The cumulative frequency greater than
30.5 is [Link] value of x corresponding to
38 is 40. The median weight of the
• The following data are the weights of students
students isin
40akgs
class. Find the median weights o
the students
MODE
• The mode is one of the values of the measures of central tendency. This
value gives us a rough idea about which of the items in a data set tend to
occur most frequently.
• For example, you know that a college is offering 10 different courses for
students. Now, out of these, the course that has the highest number of
registrations from the students will be counted as the mode of our given
data (number of students taking each course).
VARIANCE
• The term variance refers to a statistical measurement of the spread between
numbers in a data set. More specifically, variance measures how far each number in
the set is from the mean (average), and thus from every other number in the set.
Variance is often depicted by this symbol: σ2. It is used by both analysts and
traders to determine volatility and market security.
• The square root of the variance is the standard deviation (SD or σ), which helps
determine the consistency of an investment’s returns over a period of time.
• Variance is a measurement of the spread between numbers in a data set.
• In particular, it measures the degree of dispersion of data around the sample's
mean.
• Investors use variance to see how much risk an investment carries and whether it
will be profitable.
• Variance is also used in finance to compare the relative performance of each asset
in a portfolio to achieve the best asset allocation.
• The square root of the variance is the standard deviation.
Let’s say the heights (in mm) are 610, 450, 160, 420,
310.
Mean and Variance is interrelated. The first step is
finding the mean which is done as follows,
Mean = ( 610+450+160+420+310)/ 5 = 390
So the mean average is 390 mm.
To calculate the Variance, compute the difference of
each from the mean, square it and find then find the
average once again.
So for this particular case the variance is :
= (2202 + 602 + (-230)2 +302 + (-80)2)/5
= (48400 + 3600 + 52900 + 900 + 6400)/5
Final answer : Variance = 22440
Q. Find the variance of the numbers 3, 8, 6, 10, 12, 9,
11, 10, 12, 7.
STANDARD DEVIATION
• Standard deviation is a statistic that measures the dispersion of a dataset
relative to its mean and is calculated as the square root of the variance. The
standard deviation is calculated as the square root of variance by
determining each data point's deviation relative to the mean.
• If the data points are further from the mean, there is a higher deviation
within the data set; thus, the more spread out the data, the higher the
standard deviation.
• Standard deviation measures the dispersion of a dataset relative to its mean.
• It is calculated as the square root of the variance.
• Standard deviation, in finance, is often used as a measure of a relative riskiness of
an asset.
• A volatile stock has a high standard deviation, while the deviation of a stable blue-
chip stock is usually rather low.
• As a downside, the standard deviation calculates all uncertainty as risk, even when
it’s in the investor's favor—such as above-average returns.
FORMULA
• Find the variance and standard deviation of the following scores on an exam: 92,
95, 85, 80, 75, 50
• Find the standard deviation of the average temperatures recorded over a five-day
period last winter: 18, 22, 19, 25, 12
• Find the variance and standard deviation of the following scores on an exam: 92,
95, 85, 80, 75, 50
• Find the variance and standard deviation for the five states with the most covered
bridges: Oregon: 106 Vermont: 121 Indiana: 152 Ohio: 234 Pennsylvania: 347
Difference between variance, covariance and
correlation
• Variance tells us how much a quantity varies w.r.t. its mean. Its the spread of data
around the mean value. You only know the magnitude here, as in how much the
data is spread.
• Covariance tells us direction in which two quantities vary with each other.
• Correlation shows us both, the direction and magnitude of how two quantities vary
with each other.
COVARIANCE
• Covariance is a measure of the relationship between two random variables,
in statistics. The covariance indicates the relation between the two variables
and helps to know if the two variables vary together. In the covariance
formula, the covariance between two random variables X and Y can be
denoted as Cov(X, Y).
• Covariance formula
• Covariance formula for population:
• Where,
• XiXi is the values of the X-variable
• YiYi is the values of the Y-variable
• ¯¯¯¯¯XX¯ is the mean of the X-variable
• ¯¯¯¯YY¯ is the mean of the Y-variable
• nn is the number of data points
Question: The table below describes the rate of economic growth (xi) and the rate of return on the S&P 500 (yi). Using the covariance formula, determine whether
economic growth and S&P 500 returns have a positive or inverse relationship. Before you compute the covariance, calculate the mean of x and y.
Economic Growth % (xi) S&P 500 Returns % (yi)
2.1 8
2.5 12
4.0 14
3.6 10
• Find covariance for following data set x = {2,5,6,8,9}, y = {4,3,7,5,6}
• Using the covariance formula, find covariance for following data set x =
{5,6,8,11,4,6}, y = {1,4,3,7,9,12}.
What is a Correlation?
• Correlation refers to a process for establishing whether or not relationships
exist between two given variables.
• So, through this coefficient, one can get a general idea about whether or not
two variables are related
• A correlation coefficient very close to zero, but either positive or negative,
will imply little or no relationship between the two variables. Correlation
coefficient close to +1 means an increase in one of the variables being
associated with increases in the other variable.
PEARSON CORRELATION COEFFICIENT
• The covariance of two variables divided by the product of their standard deviations gives
Pearson’s correlation coefficient. It is usually represented by ρ (rho).
• ρ (X,Y) = cov (X,Y) / σX.σY.
• Here cov is the covariance. σX is the standard deviation of X and σY is the standard deviation of
Y. The given equation for correlation coefficient can be expressed in terms of means and
expectations.
• ρ(X,Y)=E(X−μx)(Y−μy)σx.σy
• μx and μy are mean of x and mean of y respectively. E is the expectation.
•
x 50 51 52 53 54
y 3.1 3.2 3.3 3.4 3.5
• Example 1: Calculate the Correlation coefficient of given data:
x 50 51 52 53 54
y 3.1 3.2 3.3 3.4 3.5
xy 155 163.2 171.6 180.2 189
2
x 2500 2601 2704 2809 2916
2
y 9.61 10.24 10.89 11.56 12.25
x 12 15 18 21 27
• Calculate the Correlation coefficient of given data:
y 2 4 6 8 12