0% found this document useful (0 votes)
18 views10 pages

Basic Statistics Overview

Statistics is the science of collecting, analyzing, interpreting, and presenting data, divided into descriptive and inferential statistics. Key concepts include populations and samples, types of data (qualitative and quantitative), scales of measurement, and measures of central tendency and dispersion. Additional topics covered are moments, skewness, kurtosis, and the theory of attributes, which help in understanding data distributions and classifications.

Uploaded by

dhritimedigeshi
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

Basic Statistics Overview

Statistics is the science of collecting, analyzing, interpreting, and presenting data, divided into descriptive and inferential statistics. Key concepts include populations and samples, types of data (qualitative and quantitative), scales of measurement, and measures of central tendency and dispersion. Additional topics covered are moments, skewness, kurtosis, and the theory of attributes, which help in understanding data distributions and classifications.

Uploaded by

dhritimedigeshi
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

STATISTICS

Introduction to statistics
Statistics is the science concerned with developing and studying methods for collecting,
analyzing, interpreting and presenting empirical data. It is the art of learning from data.
1. Descriptive statistics: The part of statistics concerned with the description and
summarization of data is called descriptive statistics. It includes methods like
classification, tabulation, and measures of central tendency (such as averages) and
variability (such as the range and standard deviation)."
2. Inferential statistics: The part of statistics concerned with the drawing of conclusions
from data is called inferential statistics.
Population: a population is the entire set of individuals, items, or data points that share
certain characteristics and are the subject of a study.
 In a study about college students' study habits, the population could be all college students.
 In a medical study on the effects of a new drug, the population could be all individuals
with a specific health condition.
Sample: A subgroup or subset of the population that will be studied in detail is called a
sample. A smaller group of members of a population selected to represent the population. It
should ideally represent the characteristics of the entire population

PYQ: Give examples of:


i. finite population and its sample
ii. Infinite population and its sample
Ans: A finite population is a population where all the members are known and can be
counted. E.g. all the employees of a company, all the students in a school.
Possible samples: group of 100 students across all grades, random selection of 50 employees
An infinite population is a population that is either too large to count or cannot be measured
precisely.
Examples of theoretically infinite population: the number of throws of dice, all potential
customers of a product
Possible samples: Rolling the dice 100 times and recording the outcome, a survey of 1000
randomly chosen individuals

Data: data refers to individual pieces of factual information collected through observation,
measurement, or experimentation.
1. Qualitaive Data: Non-numerical information that describes qualities or
characteristics. The color of a ball (e.g., red, green, blue) or the breed of a dog (e.g.,
collie, shepherd, terrier) would be examples of qualitative or categorical variables.
2. Quantitative Data: Numerical information that represents quantities or amounts like
the population of a city or the amount of grains produced per year on a
farm .Quantitative variables can be further classified as discrete or continuous. If a
variable can take on any value between its minimum value and its maximum value, it
is called a continuous variable; otherwise, it is called a discrete variable.
Scales of Measurement
1. Nominal scale: The nominal scale is the most basic level of measurement, used for
labeling or categorizing data without implying any quantitative value or order.
Examples: Gender (male, female), eye color (blue, green, brown), types of fruits
(apple, banana, orange).
2. Ordinal scale: categorizes data into ordered or ranked categories, where the order
matters but the intervals between categories are not equal or defined. Generally
ordinal scale is used when we want to measure the attitude scores towards the level of
liking, satisfaction, preference, etc.
Examples: Class rankings (1st, 2nd, 3rd), levels of satisfaction (satisfied, neutral,
dissatisfied)
Mathematical operations cannot be performed in these scales
3. Interval scale: The interval scale represents ordered data with equal distances
(intervals) between values, but it lacks a true zero point. The difference between two
variables has a meaningful result. The values can only be added and subtracted but
not multiplied or divided.
Examples: Temperature in Celsius or Fahrenheit (where 0 doesn’t mean “no
temperature”)
4. Ratio scale: The ratio scale is the most advanced scale of measurement, including
ordered data with equal intervals and an absolute zero. All mathematical operations
(addition, subtraction, multiplication, division) are meaningful. Zero represents the
absence of the property measured. Examples: Height, weight, age, distance, income.
For instance, 0 kg means no weight, and a weight of 10 kg is twice as much as 5 kg.
Frequency: The number of times a value is occuring/repeating in a data set
Frequency distribution: Representation of frequencies in a tabular form wherein
data is organised into categories or intervals, along with their respective frequencies.
Frequency distribution: The proportion of times a particular value occurs in relation
to the total number of observations.
frequency of a value
Relative frequency=
total number of observations
Class interval: A class interval is a range of values used to group continuous or large sets of
data into smaller, more manageable categories or "classes.
A class interval may be:
(a) exclusive – wherein the upper limit of class interval is not included in series
(b) inclusive – wherein both the upper as well as lower limit are included in the series
Data that is arranged in class intervals is called grouped data.
Grouped data can be visualised by histogram, frequency polygon and ogive curve

Cumulative frequency: running total of frequencies up to a certain class interval in a dataset.

Ogive curve - An ogive is a type of line graph that represents cumulative frequencies. It can
be used to display either "less than" or "more than" cumulative frequency distributions.
 For a "less than" ogive, plot the cumulative frequency against the upper boundary of each
class interval and connect the points with a smooth curve.
 For a "more than" ogive, plot the cumulative frequency against the lower boundary of each
class interval and connect the points.
The point where the less than and more than ogive curves intersect is the
MEDIAN.

PYQ: Write a short note on histogram. Which average can be obtained from it? Explain
the method of finding out this average from a histogram.
Ans. Histogram – it is like a bar graph where the adacent bars are touching to emphasize the
continuous nature of the data. It is a graphical representation of data that organizes a group of
data points into specified ranges. It is used to display the frequency distribution of a dataset.
The x-axis represents the intervals and th y-axis represents the frequency of occurences
within each interval.
The mode can be obtained from a histogram. It represents the value or range of values that
occurs most frequently in a dataset. In a histogram, it corresponds to the tallest bar, which
represents the class interval with the highest frequency.
Frequency polygon – It is created by joining the mid-points of eaach bar of a histogram.

Measures of Central Tendency


A measure of central tendency is a statistical value that represents the center or typical value
of a dataset. It provides a single value that summarizes the data by indicating where most
values in the dataset cluster. In statistics, the three most common measures of central
tendency are mean, median and mode.
1. Mean – It is the arithematic average. The sample mean is defined by sum of all values in a
dataset divided by the number of values.

Mean is affected by extreme scores. Sometimes, mean is a value which is not present in the
series.
For continuous data;

Shortcut method (Assumed mean method)

where di = xi – A and A is any arbitrary value from x

Properties of arithematic mean


1. Algebraic sum of the deviations of a set of values from their mean is zero. The
deviations are the differences between the data values and the sample mean. The
value of the ith deviation is xi − x.
2. The sum of the square of a deviation of a set of values is minimum when A = x
3. Mean of a composite series (Combined mean)

Geometric mean
Harmonic mean – It is the reciprocal of the arithematic mean of the reciprocal of the values
in a data set. The harmonic mean is a measure of central tendency for data expressed as rates
such as kilometers per hour, tonnes per day, kilometers per litre etc

Median - It is the value that divides the data set into two equal halves.
For discrete data; Order the data values from smallest to largest. If the number of data values
is odd, then the sample median is the middle value in the ordered list; if it is even, then the
sample median is the average of the two middle values
For continuous data;
h N
M d =l+ ( −C)
f 2
where l = lower limit of median class
h = magnitude of median class
f = frequency of m.c.
C = c.f. preceding the m.c.

Mode - defined as the value that occurs most frequently in the data
For continuous data;

Mode=3 Med ian−2 Mean


Measures of Dispersion
A measure of dispersion (also called a measure of variability or spread) is a statistical value
that describes how much the data values in a dataset vary or spread out from the central value
(mean, median, or mode). There are two basic kinds of a measure of dispersion (i) Absolute
measures and (ii) Relative measures. The absolute measures of dispersion are used to measure
the variability of a given data expressed in the same unit, while the relative measures are used
to compare the variability of two or more sets of observations.

PYQ: Explain the term ‘dispersion’


Ans Dispersion refers to the degree to which data points in a dataset are spread out or
scattered around a central value. It measures the variability or diversity within the dataset. A
smaller dispersion indicates that the data points are closely clustered around the central value,
while a larger dispersion suggests they are spread out.

1. Range - The range is the simplest measure of dispersion, calculated as the difference
between the highest and lowest values in the dataset.
merit = simplicity of calculation
demerit = It utilizes only the maximum and the minimum values of variable in the
series and gives no importance to other observations
2. Quartile deviation
Quartile = quartiles are values that divide a complete given set of observations into
four equal parts.
The quartile deviation is also known as the semi-interquartile deviation and is given
by:
(QD) = (Q3 – Q1) / 2.
For ungrouped data;
Q1 = value of (n+1)/4 th observation
Q3 = value of 3(n+1)/4th observation

For grouped data


Q1 = L1 +(N/4-C1)/f1 x h
Q3 = L3 + (3N/4 + C3)/f3 x h
Quartile deviation is definitely a better measure than the range as it makes use of 50% of data.
But since it ignores the other 50% of the data, it can not be regarded as reliable measure.

3. Mean deviation
average of the sum of the absolute values of deviation from any arbitrary value viz. mean,
median, mode, etc.
Since mean deviation is based on all the observations, it is a better measure of dispersion than
range or quartile deviation. But the step of ignoring the signs of the deviations (xi – A) creates
artificiality and renders it useless for further mathematical treatment.

MEAN DEVIATION STANDARD DEVIATION


Algebraic signs are ignored Signs are taken into account
Can be computed either from mean or Always computed from the arithmetic mean
median

4. Standard deviation
For grouped data:

For ungrouped data:

* Note: X =( x ¿¿ i−x )¿

Standard deviation is also known as root mean square deviation for the reason that it is the
square root of the means of the squared deviations from the arithematic mean. The standard
deviation measures the absolute dispersion or variability of a distribution.
Mathematical properties of standard deviation

Standard deviation of combined series:

Where d 1=(x ¿¿ 1−x)¿


and d 2=( x ¿¿ 2−x )¿
And x is the combined mean of the series

Variance
It is defined as the square of standard deviation
1 2
Var = σ 2= Σ f i (x ¿¿ i−x) ¿
N

MOMENTS
Let the symbol x be used to represent the deviation of any item in a distribution from
the arithmetic mean of that distribution. The arithmetic mean of the various powers of
these deviations in any distribution is called the moments of the distribution. The
moments about mean are called central moments and are denoted by μ.
In a symmetrical distribution all odd moments i.e. μ1, μ3 etc. would always be zero

Moments about arbitrary origin


Moments about arbitrary origin (A) are also called ‘raw moments’ and are denoted by
'
μ

For the sake of simplicity, moments are first calculated about an arbitrary origin. If we want
to obtain moments about mean, we can do so with the help of the following relationships:

SKEWNESS
It refers to lack of symmetry. When a distribution is asymmetrical, it is called a
skewed distribution.
In a symmetrical distribution: mean=median=mode – Bell shaped curve
Negatively
Skewed
Normal
Curve
Curve
Positively
Skewed
Measures of Skewness
1. Absolute measures:
Sk = mean – median
Sk = mean – mode
Sk = Q3 + Q1 – 2Median (when skewness is based on quartiles, absolute skewness is
given by this formula)
2. Relative measures of skewness
a. Karl Pearson’s coeff.
b. Bowley’s coeff.
c. Kelly’s coeff.
d. Measure of skewness based on moments
These measures of skewness are mainly used for making comparisons between two or more
distributions.
a. Karl Pearson’s coefficient of skewness:

b. Bowley’s coeff of skewness

PYQ: Define Bowley’s coefficient of skewness. What are its limits?


Ans Bowley’s measure of skewness is based on quartiles. In a symmetrical
distribution the first and third quartiles are equidistant from the median i.e. in a
symmetrical distribution, thr third quartile is the same distance over the median as the
first quartile is below it. If the distribution is positively skewed, the top 25% of values
will tend to be farther from the median than the bottom 25% i.e. Q3 will be farther
from median than Q1 is from the median and the reverse for negative skewness.
The Bowley’s coefficient is limited to values between -1 and +1.

c. Kelly’s coeff of skewness

(D= decile)

(P = percentile)
d. Based on moments

γ 1=± √ β1
and

KURTOSIS
In Greek, kurtosis mean ‘bulginess’. In stats, it refers to the degree of flatness or
peakedness of a frequency curve.
Kurtosis is the degree of peakedness of a distribution usually taken relative to a normal
distribution.
If a curve is more peaked = leptokurtic
If a curve is more flat-topped = platykurtic
Normal curve = mesokurtic
Measures of kurtosis

if β 2=3 , γ 2=0, curve is normal i.e. mesokurtic


if β 2> 3 , γ 2 >0 , curve is leptokurtic
if β 2< 3 , γ 2 <0 , curve is platykurtic

THEORY OF ATTRIBUTES
Qualitative characteristic of an individual is called attribute. It is denoted by capital letters A,
B, C and their absence by Greek letters α,β,γ.
When the variable x is not a measurable numeric quantity, we use the theory of attributes.

Classification:
1. If only one attribute is studied, the population is divided into two classes according to
its presence or absence and such classification is termed as dichotomous
classification.
2. If a class is divided into more than two classes such classification is termed as
manifold classification.

Class frequency: The number of observation within a particular class is known as the
frequency of that class. It is denoted through bracket. As the frequency of class A is denoted
by (A), similarly the frequencies of AB or α by (AB) or α respectively.

If the number of attributes is n, the total number of class frequencies will be 3 n

Order of Class: Combination of a class of n attributes is called as nth order class or order will
depend on the number of classes involved in the [Link], A, B, α, β etc. are first
order classes. AB,
αβ are second order classes. ABC, αβγ are third order classes

The frequencies of the classes of highest order are called as ultimate class frequencies.
The frequency of a lower order class can always be expressed in terms of the higher order
class frequencies.
N = (A) + (α) = (B) + (β)
(A) = (AB) + (Aβ)
(α) = (αB) + (αβ)
(B) = (AB) + (αB)
(β) = (Aβ) + (αβ)

Consistency of attributes
σ
Coefficient of variation = ∗100
x
A - CVA
B - CVB
If CVA>CVB = A is less consistent
If CVA<CVB = A is more consistent
( AB)≯ ( A)

Theorem: Necessary and sufficient condition for consistency of data is that the ultimate
class frequency should be non-negative

Independency of Attributes
The attributes are said to be independent the presence or absence of the attribute does not
affect the presence or absence of the other.
Criterion 1: Proportion of attribute A is equal in B and β. Proportion of B is equal in A and α.

( A β)
∧( AB )
( AB) (β) (αB )
= =
(B) ( A) (α )
Criterion 2: If A and B are independent, the proportion of AB’s in total population N is equal
to the product of proportion of individual attributes.

(A )
⇒ ∗(B)
( A)(B) (AB) N
( AB )= . =
N N N

Criterion 3:

Association of attributes
Two attributes A and B are said to be associated if they are not independent. i. e. they are
related in some way or the other

( A)(B)
If ( AB ) > then they are positively associated
N
( A)(B)
If ( AB ) < then they are negatively associated
N
( A)(B)
δ=( AB )−
N
if δ >0 , A and B are +vely associated
if δ <0, A and B are –vely associated

Yule’s coefficient of association of attributes:

You might also like