0% found this document useful (0 votes)
9 views32 pages

Introduction to Statistics Overview

The document provides an overview of statistics, dividing it into descriptive and inferential statistics. Descriptive statistics summarize data characteristics, while inferential statistics draw conclusions about populations based on sample data. Key concepts include measures of central tendency (mean, median, mode) and measures of variance (range, quartiles, variance, standard deviation).

Uploaded by

raj.nandurkar23
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views32 pages

Introduction to Statistics Overview

The document provides an overview of statistics, dividing it into descriptive and inferential statistics. Descriptive statistics summarize data characteristics, while inferential statistics draw conclusions about populations based on sample data. Key concepts include measures of central tendency (mean, median, mode) and measures of variance (range, quartiles, variance, standard deviation).

Uploaded by

raj.nandurkar23
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Statistics

Vaishnavee Rathod
1.
Introduction to Statistics
Statistics is a mathematical science that includes methods for collecting,
organizing, analyzing and visualizing data in such a way that meaningful
conclusions can be drawn.

Statistics is also a field of study that summarizes the data, interpret the data
making decisions based on the data.

Statistics is composed of two broad categories:


1. Descriptive Statistics
2. Inferential Statistics
2
2.
Descriptive Statistics
Descriptive statistics describes the characteristics or properties of the data. It
helps to summarize the data in a meaningful data in a meaningful way. It allows
important patterns to emerge from the data. Data summarization techniques are
used to identify the properties of data. It is helpful in understanding the
distribution of data. They do not involve in generalizing beyond the data.

Two types of descriptive statistics:


1. Measures of Central Tendency: (Mean , Median , Mode)
2. Measures of data spread or dispersion (range, quartiles, variance and standard
deviation)
3
3.
Measures of Central Tendency:
(Mean , Median , Mode)
A measure of central tendency is a single value that attempts to describe a set of
data by identifying the central position within that set of data. The mean,
median and mode are all valid measures of central tendency.

4
4. Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of central tendency. It
can be used with both discrete and continuous data, although its use is most often with
continuous data.

The mean is equal to the sum of all the values in the data set divided by the number of values
in the data set. So, if we have values in a data set and they have values x1,x2,...xn, the sample
mean, usually denoted by x
̿ .
x̿= (x1,x2,...xn )/ n .

An important property of the mean is that it includes every value in the data set as part of
the calculation. In addition, the mean is the only measure of central tendency where the sum
of the deviations of each value from the mean is always zero.

5
5. Median
The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. It is a holistic measure.
It is easy method of approximation of median value of a large data set.

6
6. Mode
The mode is the most frequent score in our data set. The mode is used for categorical data
where we want to know which is the most common category occurring in the population.
There are possibilities for the greatest frequency to correspond to different values. This
results in more than one, two or more modes in a dataset. They are called as unimodal,
bimodal and multimodal datasets. If each data occurs only once then the mode is equal to
zero.

Unimodal frequency curve with symmetric data distribution , the mean median and mode
are all the same.

In real applications the data is not symmetrical and they are asymmetric. It might be
positively skewed or negatively skewed. If positively skewed then mode is smaller than
median and in negatively skewed the mode occurs at a value greater than the median. 7
Mode

8
7. Measures of Variance
Measures of spread are the ways of summarizing a group of data by describing how scores
are spread out. To describe this spread, a number of statistics are available to us, including
the range, quartiles, absolute deviation, variance and standard deviation.
• The degree to which numerical data tend to spread is called the dispersion, or variance of
the data. The common measures of data dispersion: Range, Quartiles, Outliers, and
Boxplots.

9
7. Measures of Variance
Range : Range of the set is the difference between the largest (max()) and smallest (min()) values.

Example:
Step 1: Sort the numbers in order, from smallest to largest: 7, 10, 21, 33, 43, 45, 45, 65, 67, 87, 98, 99
Step 2: Subtract the smallest number in the set from the largest number in the set: 99 – 7 = 92
The range is 92

10
7. Measures of Variance
Quartiles : Percentile : kth percentile of a set of data in numerical order is the value xi having the
property that k percent of the data entries lie at or below xi
• The first quartile (Q1) is the 25th percentile;
• The third quartile (Q3) is the 75th percentile
• The distance between the first and third quartiles is the range covered by the middle half of the
data.
• Interquartile range (IQR) and is defined as IQR = Q3 - Q1.
• Outliers is to single out values falling at least 1.5 *IQR above the third quartile or below the first
quartile.
• Five-number summary: median, the quartiles Q1 and Q3, and the smallest and largest individual
observations comprise the five number summary: Minimum; Q1; Median; Q3; Maximum

11
7. Measures of Variance
Example : Quartiles
• Start with the following data set: 1, 2, 2, 3, 4, 6, 6, 7, 7, 7, 8, 11, 12, 15, 15, 15, 17, 17, 18, 20
• There are a total of twenty data points in the set. There is an even number of data values, hence
the median is the mean of the tenth and eleventh values.
• the median is: (7 + 8)/2 = 7.5.
• The median of the first half of the set is found between the fifth and sixth values of:
• 1, 2, 2, 3, 4, 6, 6, 7, 7, 7
• Thus the first quartile is found to equal Q1 = (4 + 6)/2 = 5
• To find the third quartile, examine the top half of the original data set. The median of
• 8, 11, 12, 15, 15, 15, 17, 17, 18, 20
• is (15 + 15)/2 = 15. Thus the third quartile Q3 = 15.
A small interquartile range indicates data that is clumped about the median. A larger
interquartile range shows that the data is more spread out 12
7. Measures of Variance
Example : Quartiles Data: 7, 10, 21, 33, 43, 45, 45, 65, 67, 87, 98, 99
An outlier is a value that is significantly higher or lower than most of the other values in a dataset. It
can distort averages and spread measures like mean and range.
Common Method to Detect Outliers – IQR Rule (Interquartile Range)
Step 1: Find Q1 (25th percentile) and Q3 (75th percentile)
Q1(lower quartile) = median of lower half = median of `7, 10, 21, 33, 43, 45` → Q1 = 27
Q3 (upper quartile) = median of upper half = median of `45, 65, 67, 87, 98, 99` → Q3 = 77
Step 2: Compute IQR — IQR = Q3 - Q1 = 77 - 27 = 50
Step 3: Calculate Outlier Boundaries
Lower Bound = Q1 – 1.5 × IQR = 27 – 75 = -48 (no data points this low)
Upper Bound = Q3 + 1.5 × IQR = 77 + 75 = 152
All data points are between 7 and 99, and both are within the acceptable range (-48 to 152).
There are no outliers in this data set using the IQR method. 13
8. Variance and Standard Deviation
Variance measures how much the values in a dataset vary from the mean (i.e., how spread out they are). It
is denoted by sigma^2 (sigma squared).

14
8. Variance and Standard Deviation
This is mathematically equivalent to the first but easier for computation, especially for large datasets.

The variance measures the average of the squared differences from the mean.
The two forms shown:
1. One is intuitive (deviation from the mean),
2. The other is computationally efficient.

15
9. Inferential Statistics
Inferential statistics is generally used when the user needs to make a conclusion about the
whole population at hand, and this is done using the various types of tests available. It is a
technique which is used to understand trends and draw the required conclusions about a large
population by taking and analyzing a sample from it. Descriptive statistics, on the other hand,
is only about the smaller sized data set at hand – it usually does not involve large populations.
Using variables and the relationships between them from the sample, we will be able to make
generalizations and predict other relationships within the whole population, regardless of how
large it is.

With inferential statistics, data is taken from samples and generalizations are made about a
population. Inferential statistics use statistical models to compare sample data to other
samples or to previous research.

16
10. Two main areas of inferential statistics:
1. Estimating parameters:
This means taking a statistic from the sample data (for example the sample mean) and using it
to infer about a population parameter (i.e. the population mean).There may be sampling
variations because of chance fluctuations, variations in sampling techniques, and other
sampling errors. Estimation about population characteristics may be influenced by such
factors. Therefore, in estimation the important point is that to what extent our estimate is close
to the true value.

Characteristics of Good Estimator: A good statistical estimator should have the following
characteristics, (i) Unbiased (ii) Consistent (iii) Accuracy

17
10. Estimating Parameters
i) Unbiased

An unbiased estimator is one in which, if we were to obtain an infinite number of random samples of a
certain size, the mean of the statistic would be equal to the parameter. The sample

mean, ( x ) is an unbiased estimate of population mean (μ)because if we look at possible random


samples of size N from a population, then mean of the sample would be equal to μ.

ii) Consistent

A consistent estimator is one that asthe sample size increased, the probability that estimate has a
value close to the parameter also increased. Because it is a consistent estimator, a sample mean based
on 20 scores has a greater probability of being closer to (μ) than does a sample mean based upon only
5 scores

iii) Accuracy

The sample mean is an unbiased and consistent estimator of population mean (μ).But we should not
over look the fact that an estimate is just a rough or approximate calculation. It is unlikely in any
estimate that ( x ) will be exactly equal to population mean (μ). Whether or not x is a good estimate of
(μ) depends upon the representativeness of sample, the sample size, and the variability of scores in the
18
population.
10. Two main areas of inferential statistics:

19
11. Random Variables
A random variable, X, is a variable whose possible values are numerical outcomes of
a random
phenomenon. There are two types of random variables, discrete and continuous.
Example of Random variable
- A person’s blood type
- Number of leaves on a tree
- Number of times a user visits LinkedIn in a day
- Length of a tweet.

20
12. Discrete Random Variable
A discrete random variable is one which may take on only a countable number of distinct values
such as 0,1,2,3,4,........ Discrete random variables are usually counts. If a random variable can take
only a finite number of distinct values, then it must be discrete. Examples of discrete random
variables include the number of children in a family, the Friday night attendance at a cinema, the
number of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.
The probability distribution of a discrete random variable is a list of probabilities associated with
each of its possible values. It is also sometimes called the probability function or the probability
mass function
Suppose a random variable X may take k different values, with the probability that X = xi defined
to be P(X = xi) = pi. The probabilities pi must satisfy the following:
1: 0 < pi < 1 for each i
2: p1 + p2 + ... + pk = 1.
21
12. Example
Suppose a variable X can take the values 1, 2, 3, or 4.
The probabilities associated with each outcome are described by the following table: Outcome 1 2 3 4
Probability 0.1 0.3 0.4 0.2
The probability that X is equal to 2 or 3 is the sum of the two probabilities:
P(X = 2 or X = 3) = P(X = 2) + P(X = 3) = 0.3+ 0.4 = 0.7.
Similarly, the probability that X is greater than 1 is equal to 1 - P(X = 1) = 1 - 0.1 = 0.9, by the
complement rule.

22
13. Continuous Random Variables
A continuous random variable is one which takes an infinite number of possible values. Continuous
random variables are usually measurements. Examples include height, weight, the amount of sugar
in an orange, the time required to run a mile. A continuous random variable is not defined at specific
values. Instead, it is defined over an interval of values, and isrepresented by the area under a curve
(known as an integral). The probability of observing any single value is equal to 0, since the number
of values which may be assumed by the random variable is infinite.

Suppose a random variable X may take all values over an interval of real numbers. Then the
probability that X is in the set of outcomes A, P(A), is defined to be the area above A and under a
curve. The curve, which represents a function p(x), must satisfy the following:
1: The curve has no negative values (p(x) > 0 for all x)
2: The total area under the curve is equal to 1.
A curve meeting these requirements is known as a density curve. 23
13. Continuous Random Variables
All random variables (discrete and continuous) have a cumulative distribution function. It is a
function giving the probability that the random variable X is less than or equal to x, for every value x.
For a discrete random variable, the cumulative distribution function is found by summing up the
probabilities.

24
14. Normal Probability Distribution
The Bell-Shaped Curve
The Bell-shaped Curve is commonly called the normal curve and is mathematically referred to as the
Gaussian probability distribution. Unlike Bernoulli trials which are based on discrete counts, the
normal distribution is used to determine the probability of a continuous random variable.

25
14. Normal Probability Distribution
The normal or Gaussian Probability Distribution is most popular and important because of its unique
mathematical properties which facilitate its application to practically any physical problem in the real
world. The constants μ and σ2 are the parameters;

● “μ” is the population true mean (or expected value) of the subject phenomenon characterized by
the continuous random variable, X,
● “σ2” is the population true variance characterized by the continuous random variable, X.
● Hence, “σ” the population standard deviation characterized by the continuous random variable X;
● the points located at μ−σ and μ+σ are the points of inflection; that is, where the graph changes
from cupping up to cupping down

26
14. Normal Probability Distribution
The normal curve graph of the normal probability distribution) is symmetric with respect to the mean
μ as the central position. That is, the area between μ and κ units to the left of μ is equal to the area
between μ and κ units to the right of μ.

The figure below is a graphical representation of the


normal distribution for a fixed value of μ with varying σ2.

There is not a unique normal probability distribution.


The figure below is a graphical representation of the
normal distribution for a fixed value of σ2 with μ varying.

27
15. SAMPLING
Sampling is a process used in statistical analysis in which a predetermined number of observations are
taken from a larger population. It helps us to make statistical inferences about the population. A
population can be defined as a whole that includes all items and characteristics of the research taken
into study. However, gathering all this information is time consuming and costly. We therefore make
inferences about the population with the help of samples.

Random sampling:
In data collection, every individual observation has equal probability to be selected into a
sample. In random sampling, there should be no pattern when drawing a sample.

Probability sampling:
It is the sampling technique in which every individual unit of the population has greater
than zero probability of getting selected into a sample.

Non-probability sampling:
It is the sampling technique in which some elements of the population have no probability
of getting selected into a sample. 28
15. SAMPLING
Cluster samples:
It divides the population into groups (clusters). Then a random sample is chosen from the clusters.

Systematic sampling :
select sample elements from an ordered frame. A sampling frame is just a list of participants that we
want to get a sample from.

Stratified sampling :
sample each subpopulation independently. First, divide the population into homogeneous (very
similar) subgroups before getting the sample. Each population member only belongs to one group.
Then apply simple random or a systematic method within each group to choose the sample.

29
16. Sampling Distribution
A sampling distribution is a probability distribution of a statistic. It is obtained through a large number
of samples drawn from a specific population. It is the distribution of all possible values taken by the
statistic when all possible samples of a fixed size n are taken from the population.

Sampling distributions are important for inferential statistics. A population is specified and the
sampling distribution of the mean and the range were determined. In practice, the process proceeds the
other way: the sample data is collected and from these data we estimate parameters of the sampling
distribution. This knowledge of the sampling distribution can be very useful. Knowing the degree to
which means from different samples would differ from each other and from the population mean ( this
would give an idea of how close the particular sample mean is likely to be to the population mean )

The most common measure of how much sample means differ from each other is the standard
deviation of the sampling distribution of the mean. This standard deviation is called the standard error
of the mean.

If all the sample means were very close to the population mean, then the standard error of the mean
would be small. On the other hand, if the sample means varied considerably, then the standard error of
30
the mean would be large.
17. Sampling distribution of the sample mean

31
17. Application

32

You might also like