Stats Full Notes
Stats Full Notes
Introduction
1.1 Definitions of Statistics:
1. Statistics is the science which deals with collection , classification, tabulation, analysis
and interpretation of the numerical data.
Statistics is used in agricultural research for efficient planning of experiments and for
interpreting experimental data. e.g.
iv) To study the effect of pest and disease incidence on crop yields.
i) Statistics can be applied only when there is a variability in the experimental material.
ii) Statistics deals with only aggregates or groups and not with individual objects..
iii) The lows of statistics are true on an average in the long run.
1.4 Types of data: There are two types of data which is based on nature of the characteristic
observed.
2. Qualitative data (attribute data) : The nature of the characteristic observed are not
measured but they are classified in to some categories is called qualitative data.
1.5 Classification of data: Arrangement of data according to some common features is know
as classification.
The classification is done according to the type of data i.e. measurement or attribute data. In
case of measurement data they are classified according to classes and in case of attribute
data, we account items with specified character and put them in appropriate categories.
Types of classification:
2. Qualitative classification: Data are classified according to the attributes of the subjects or
items. E.g. qualification, colour etc.
3. Geographical classification: Data are classified according to the place, area, region.
4. Chronological classification: Data are classified according to the lapse of time. E.g.
weekly, monthly, yearly etc.
Frequency: The number of times a category or class occurs and it is denoted by ‘f’.
Frequency distribution: Sorting of data into categories or classes will lead to formation of
frequency distribution. i.e. Frequency distribution gives the no. of times a category or class
occurs.
The boundary figures of the classes 2-3, 3-4, 4-5 etc. are called class limits.
The lower bound values are called lower limit and upper bound values are called upper
limit of class.
The difference between lower and upper limit of a class is called as class interval.
Table no.2 Frequency distribution of number of seeds germinated in 50 plots.
4. To find out mutual relationship between certain measures and their effects.
1.8 Tabulation:
It is the process of presenting data which is collected through survey , experiment, or record
in rows and column so that it can more easily be understood and can be used for further
statistical analysis.
1.9 Objectives of Tabulation:
Tabulation is the logical and systematic arrangement of data in rows and columns . In
a table , data may be presented in modified form as well, e.g. in percent, proportion , total or
average values etc.
Lecture. 2
Measures of Central Tendency
A measure of central tendency is typical value around which other figure congrate.
The averages are representative of the entire series. 5 measure central tendencies are mean,
median , mode, geometric mean and harmonic mean.
Characteristics of ideal measures :
[Link] should be rigidly defined
2. It should be easy to calculate
[Link] should be based on all observations.
[Link] should be suitable for further mathematical treatment
5. It should not be affected much by extreme values.
In the study of a population with respect to one in which we are interested we may get a
large number of observations. It is not possible to grasp any idea about the characteristic when
we look at all the observations. So it is better to get one number for one group. That number
must be a good representative one for all the observations to give a clear picture of that
characteristic. Such representative number can be a central value for all these observations. This
central value is called a measure of central tendency or an average or a measure of locations.
There are five averages. Among them mean, median and mode are called simple averages and
the other two averages geometric mean and harmonic mean are called special averages.
Grouped Data
The mean for grouped data is obtained from the following formula:
Where
A = any value in x
n = total frequency
c = width of the class interval
2. Median
The median is the middle most item that divides the group into two equal parts, one part
comprising all values greater, and the other, all values less than that item.
Ungrouped or Raw data
Arrange the given values in the ascending order. If the number of values are odd, median
is the middle value
If the number of values are even, median is the mean of middle two values.
By formula
Grouped data
In a grouped distribution, values are associated with frequencies.
Grouping can be in the form of a discrete frequency distribution or a continuous
frequency distribution. Whatever may be the type of distribution, cumulative
frequencies have to be calculated.
Cumulative frequency (cf)
Cumulative frequency of each class is the sum of the frequency of the class and the
frequencies of the pervious classes, ie adding the frequencies successively, so that the last
cumulative frequency gives the total number of items.
Discrete Series
Step1: Find cumulative frequencies.
Step3: See in the cumulative frequencies the value just greater than
Continuous Series
The steps given below are followed for the calculation of median in continuous series.
Step1: Find cumulative frequencies.
Step2: Find
Step3: See in the cumulative frequency the value first greater than , Then the
corresponding
class interval is called the Median class. Then apply the formula
𝑛
– 𝑐.𝑓.
Median = L + ( 2
𝑓
𝑋 𝐶 )
Where l = lower limit of the median class
f = frequency of the median class
c = width of the median class
c.f . = cumulative frequency preceding the median class
Merits of Median
• Median is not influenced by extreme values because it is a positional average.
• Median can be calculated in case of distribution with open-end intervals.
• Median can be located even if the data are incomplete.
Demerits of Median
• A slight change in the series may bring drastic change in median value.
• It is not suitable for further mathematical treatment except its use in calculating
mean deviation.
• It does not take into account all the observations.
3. Mode
The value which occur most frequently in a set of observations is
called mode.
It is an actual value, which has the highest concentration of items in and
around it. It shows the centre of concentration of the frequency in around a given
value. Therefore, where the purpose is to know the point of the highest
concentration it is preferred. It is, thus, a positional measure.
Its importance is very great in agriculture like to find typical height of a
crop variety, maximum source of irrigation in a region, maximum disease prone
paddy variety. Thus the mode is an important measure in case of qualitative data.
Grouped Data
For Discrete distribution, see the highest frequency and corresponding value of x is
mode.
Continuous distribution
Locate the highest frequency the class corresponding to that frequency is called the
modal class.
Then apply the formula.
Mode = L+ ( 2𝑓1𝑓−𝑓
1− 𝑓0
0 −𝑓2
x C)
c = class interval
Merits :
1. Mode is easy to calculate
2. Mode is not at all affected by extreme values.
De Merits :
1. Mode is ill-defined
2. It is not based on all the observations
3. It is not capable of further mathematical treatment
[Link] mean :
The geometric mean of a series containing n observations is the nth root of the product of the
values.
If x1, x2…, xn are observations then
G.M. = 𝑛√ 𝑥1 𝑥2 𝑥3 … 𝑥𝑛
∑ 𝑙𝑜𝑔𝑥𝑖
GM = Antilog
𝑛
For grouped data
GM = Antilog
1. It is rigidly defined
2. It is based on all the observations
3. It is capable of further mathematical treatment
4. It is not affected much by fluctuation of sampling.
5. It gives comparatively more weights to small items / values
Demerits :
1. G.M. is not easy to understand
2. It is not easy to calculate
3. If one of the observation is ‘ 0’ ZRRO G.M. is ‘ 0’ ZRRO
4. If one of the observation is negative GM is not define ( imaginary
nary)
Merits of H.M
1. It is rigidly defined.
2. It is defined on all observations.
3. It is amenable to further algebraic treatment.
4. It is the most suitable average when it is desired to give greater weight to smaller
observations and less weight to the larger ones.
Demerits of H.M
1. It is not easily understood.
2. It is difficult to compute.
3. It is only a summary figure and may not be the actual item in the series
4. It gives greater importance to small items and is therefore, useful only when
small items have to be given greater weightage.
5. It is rarely used in grouped data.
Weighted mean :
In calculating arithmetic mean we suppose that all the items in the distribution have
equal importance If some items in a distribution are more importance than others . In this case
proper weightage is given to various items.
e.g. If we wants to have an idea of the change in living of a certain group of people then simple
mean can not work . Because all the commodities are not equally importance wheat, rice , pulses
are more importance than tea, confectionery etc.
If X1 X2 …… Xn be the items and W1 W2 W3 …… Wn be the weights attached to each
item respectively then weighte mean is given by the formula
∑ 𝑤𝑖 𝑥𝑖
W.M. =
∑ 𝑤𝑖
Quartiles :
The quartiles divide the distribution in four parts. There are three quartiles.
The second quartile divides the distribution into two halves and therefore is the
same as the median. The first (lower).quartile (Q1) marks off the first one-fourth,
the third (upper) quartile (Q3) marks off the three-fourth.
Step2: Find
Step3: See in the cumulative frequencies, the value just greater than , then the
corresponding value of x is Q1
Step4: Find
Step5: See in the cumulative frequencies, the value just greater than ,then the
corresponding value of x is Q3
Continuous series
Step1: Find cumulative frequencies
Step2: Find
Step3: See in the cumulative frequencies, the value just greater than , then the
Step4: Find See in the cumulative frequencies the value just greater than
then the
corresponding class interval is called 3rd quartile class. Then apply the respective
formulae.
𝑛
4
– 𝑐.𝑓.
Q1 = L + ( 𝑋 𝐶 )
𝑓
𝑛
– 𝑐.𝑓.
4
Q3 = L + ( 𝑋 𝐶 )
𝑓
Percentile:
The ninety nine points which divide series into 100 equal parts is
called percentiles.
. The xth percentile is that value below which x percent of values in the
distribution fall. It may be noted that the median is the 50th percentile.
For raw data, first arrange the n observations in increasing order. Then the
xth percentile is given by
Where
= lower limit of the percentile calss which contains the xth percentile value (x. n /100)
= cumulative frequency uotp
= frequency of the percentile class
C= class interval n = total number of observations
Deciles:
The nine points which divide series into 10 equal parts is called deciles.
For raw data, first arrange the n observations in increasing order. Then the xth decile is
given by
Dx = ( 𝑥(𝑛+1)
10
)th item
𝑥.𝑛
– 𝑐.𝑓.
Dx = L + ( 10
𝑓
𝑋 𝐶 )
Where
= lower limit of the decile calss which contains the xth decile value (x. n /10)
= cumulative frequency of preceding classing median
= frequency of the decile class
C= class interval
n = total number of observations
IMPORTANT
1. The relation between H.M. A.M and G.M. is HM < GM < AM
2. If distribution is absolutely symmetrical / Symmetrical , the Mean, Median and
Mode are coincide i.e. Mean = Median = Mode
3. If distribution is moderately symmetrical / Skewed / non symmetrical then,
Mode = 3 Median – 2 mean
4. Mathematical averages: AM, GM, HM, Quadratic mean
5. Positional averages : Median and Mode these averages are located /determine by
graphically.
Lecture-3
Measures of Dispersion
The averages are representatives of a frequency distribution. But they fail to give a
complete picture of the distribution. They do not tell anything about the scatterness of
observations within the distribution.
Suppose that we have the distribution of the yields (kg per plot) of two
paddy varieties from 5 plots each. The distribution may be as follows
Variety I 45 42 42 41 40
Variety II 54 48 42 33 30
It can be seen that the mean yield for both varieties is 42 kg but cannot say that
the performances of the two varieties are same. There is greater uniformity of yields in
the first variety whereas there is more variability in the yields of the second variety. The
first variety may be preferred since it is more consistent in yield performance.
Form the above example it is obvious that a measure of central tendency alone is not
sufficient to describe a frequency distribution. In addition to it we should have a measure of
scatterness of observations. The scatterness or variation of observations from their
average are called the dispersion. There are different measures of dispersion like the
range, the quartile deviation, the mean deviation and the standard deviation.
Absolute measures of dispersion are expressed in the same unit in which the original data are
given such as rupees, kilograms, grams, centimeter, meter etc.
1. Range
This is the simplest possible measure of dispersion and is defined as the difference
between the largest and smallest values of the in the set of observations.
In symbols, Range = L – S.
Where L = Largest value. and S = Smallest value.
In individual observations and discrete series, L and S are easily identified.
In continuous series, the following two methods are followed.
Method 1
L = Upper boundary of the highest class
S = Lower boundary of the lowest class.
Method 2
L = Mid value of the highest class.
S = Mid value of the lowest class.
Merits and Demerits of Range
Merits
• It is simple to understand.
• It is easy to calculate.
• In certain types of problems like quality control, weather forecasts, share
price analysis, etc., range can be used.
Demerits
• It is very much affected by the extreme items.
• It is based on only two extreme observations.
• It cannot be calculated from open-end class intervals.
• It is not suitable for mathematical treatment.
• It is a very rarely used measure.
𝑸𝟑 −𝑸𝟏
Quartile deviation = Where, Q3 = Third quartile
𝟐
Q1 = First quartile
3. Mean deviation:
Mean deviation is the mean of the deviations of individual values from their averages
In case of row data (ungrouped data) the mean deviation from median is the least.
∑|𝑥−𝑥̅ |
Formula for ungrouped data: M.D.=
𝑛
Where N = ∑ 𝑓
4. Standard Deviation:
It is defined as the positive square-root of the arithmetic mean of the Square of the
deviations of the given observation from their arithmetic mean.
The standard deviation is denoted by’ s ‘ in case of sample and Greek letter (sigma-𝞼) in case of
population.
The formula for calculating standard deviation for raw data i.e
For ungrouped data
( xi − x ) 2 ( xi − x ) 2
s= 𝞼=
n −1 n
5. Variance:
The square of the standard deviation is called variance.
4. Coefficient of Variation:
The Standard deviation is an absolute measure of dispersion. It is expressed in
terms of units in which the original figures are collected and stated. The standard deviation of
heights of plants can not be compared with the standard deviation of weights of the grains, as
both are expressed in different units, i.e heights in centimeter and weights in kilograms.
Therefore the standard deviation must be converted into a relative measure of dispersion for
the purpose of comparison. The relative measure is known as the coefficient of variation.
The coefficient of variation is obtained by dividing the standard deviation
by the mean and expressed in percentage. Symbolically,
𝑆.𝐷.
Coefficient of Variation (C.V.) = x 100
𝑀𝑒𝑎𝑛
Q.2 What do you understand by dispersion ? Write important measures of dispersion and
describe standard deviation with its characteristics. (2014-15)
Q.4 Write various relative and absolute measures of dispersion and describe importance of
Coefficient of Variation. ( 2017-18)
Q.5 Explain various measures of dispersion in details along with examples. ( 2019-20)
Objective :
3 .If the CV of variety I is 30% and variety II is 25% then Variety II is more consistent.
Ans: True
[Link] the set of data 5, 5, 5,5,5,5 the Standard deviation value is zero.
Ans: True
It has been shown that for any distribution, (M-Md) / σ lies between ±1. Hence
the limits for the co-efficient of skewness are ±3. In practice these limits are
rarely attained.
Skewness is positive if M>M0 or M>Md and negative if M<M0 or M<Md.
2. Prof. Bowleys Co-efficient of skewness based on quartiles,
(Q3-Md) - (Md-Q1) = Q3 + Q1 – 2Md
Remarks 1. Bowleys Co-efficient of skewness is also known as Quartile Co-efficient of
skewness. Especially useful in following situations.
i)When the mode is ill defined and extreme observations are present in the data.
ii)When the distribution has open end classes or unequal class intervals.
In these situations Pearsons coefficient of skewness cannot be used.
Remark 2. Limits of . Bowleys Co-efficient of skewness : -1 to +1
Remark 3 Skewness is positive if Q3+Q1> 2Md and
negative if Q3+Q1< 2Md
Kurtosis :– If we know the measures of central tendency, dispersion and skewness, we still
cannot form a complete idea about the distributions as will be clear from the figure in which
all the three curves A, B and C are symmetrical about the mean ‘m’ and have the same range.
In addition to these measures we should know one more measure which Prof. Karl Pearson
calls as the ‘Convexity of a curve or Kurtosis’.
Kurtosis enables us to have an idea about the flatness of peakedness of the curve.
It is measured by the co-efficient β2 or its deviation γ2 .
Diagram:
Curve of the type ‘A’ which is neither flat nor peaked is called normal curve
or mesokurtic curve and for such a curve β2 = 3, i.e., γ2 = 0.
Curve of the type ‘B’ which is flatter than the normal curve is known as
platykurtic and for such a curve β2 < 3, i.e., γ2 < 0.
Curve of the type ‘C’ which is more peaked than the normal curve is called
leptokurtic and for such a curve β2 > 3, i.e., γ2 > 0.
4. Probability Theory
Probability
The concept of probability is difficult to define in precise terms. In ordinary
language, the word probable means likely (or) chance. Generally the word, probability, is
used to denote the happening of a certain event, and the likelihood of the occurrence of
that event, based on past experiences. By looking at the clear sky, one will say that there
will not be any rain today. On the other hand, by looking at the cloudy sky or overcast
sky, one will say that there will be rain today. In the earlier sentence, we aim that there
will not be rain and in the latter we expect rain. On the other hand a mathematician says
that the probability of rain is ‘0’ in the first case and that the probability of rain is ‘1’ in
the second case. In between 0 and 1, there are fractions denoting the chance of the event
occurring. In ordinary language, the word probability means uncertainty about
[Link] Mathematics and Statistics, a numerical measure of uncertainty is provided
by the important branch of statistics – called theory of probability. Thus we can say, that
the theory of probability describes certainty by 1 (one), impossibility by 0 (zero) and
uncertainties by the co-efficient which lies between 0 and 1.
Trial and Event An experiment which, though repeated under essentially identical (or)
same conditions does not give unique results but may result in any one of the several
possible outcomes. Performing an experiment is known as a trial and the outcomes of the
experiment are known as events.
Example:
1..Seed germination – either germinates or does not germinates are events.
2. In a lot of 5 seeds none may germinate (0), 1 or 2 or 3 or 4 or all 5 may germinate.
Exhaustive Events
The total number of possible outcomes in any trial is known as exhaustive events
(or) exhaustive cases.
Example
When pesticide is applied a pest may survive or die. There are two exhaustive
cases namely ( survival, death)
In throwing of a die, there are six exhaustive cases, since anyone of the 6
faces
1, 2, 3, 4, 5, 6 may come uppermost.
In drawing 2 cards from a pack of cards the exhaustive number of cases is
52C2, since 2 cards can be drawn out of 52 cards in 52C2 ways
{HHH,HHT,HTT,TTT,
3
(3) Three coins are tossed 2 =8 TTH,THT,THH,HTH }
Favourable Events
The number of cases favourable to an event in a trial is the number of outcomes
which entail the happening of the event.
Example
When a seed is sown if we observe non germination of a seed, it is a
favourable event. If we are interested in germination of the seed then
germination is the favourable event.
Mutually Exclusive Events
Events are said to be mutually exclusive (or) incompatible if the happening of any
one of the events excludes (or) precludes the happening of all the others i.e.) if no two or
more of the events can happen simultaneously in the same trial. (i.e.) The joint
occurrence is not possible.
Example
In observation of seed germination the seed may either germinate or it will not
germinate. Germination and non germination are mutually exclusive events.
Equally Likely Events
Outcomes of a trial are said to be equally likely if taking in to consideration all the
relevant evidences, there is no reason to expect one in preference to the others. (i.e.) Two
or more events are said to be equally likely if each one of them has an equal chance of
occurring.
Independent Events
Several events are said to be independent if the happening of an event is not
affected by the happening of one or more events.
Example
When two seeds are sown in a pot, one seed germinates. It would not
affect the germination or non germination of the second seed. One event
does not affect the other event.
Dependent Events
If the happening of one event is affected by the happening of one or more events,
then the events are called dependent events.
Example
If we draw a card from a pack of well shuffled cards, if the first card drawn is not
replaced then the second draw is dependent on the first draw.
Note: In the case of independent (or) dependent events, the joint occurrence is possible.
Definition of Probability
1. Mathematical (or) Classical (or) a-priori Probability
If an experiment results in ‘n’ exhaustive cases which are mutually exclusive and
equally likely cases out of which ‘m’ events are favourable to the happening of an event
‘A’, then the probability ‘p’ of happening of ‘A’ is given by
Note
If m = 0 ⇒ P(A) = 0, then ‘A’ is called an impossible event. (i.e.) also by P(φ) = 0.
If m = n ⇒ P(A) = 1, then ‘A’ is called assure (or) certain event.
The probability is a non-negative real number and cannot exceed unity (i.e.)
lies between 0 to 1.
The probability of non-happening of the event ‘A’ (i.e.) P(A) It is denoted by ‘q’.
q=1–p
p+q=1
(or) P (A) + P ( ) = 1.
2. Statistical (or) Empirical Probability (or) a-posteriori Probability
If an experiment is repeated a number (n) of times, an event ‘A’ happens ‘m’
times then the statistical probability of ‘A’ is given by
Total Probability, =
If A and B are mutually exclusive (or) disjoint events then the probability
of occurrence of either A (or) B denoted by P(AUB) shall be given by
P(A∪B) = P(A) + P(B)
P(E1∪E2∪….∪En) = P (E1) + P (E2) + …… + P (En)
-----------------------------------------------------------------------------------------------------------------------------
Example 1: Two dice are tossed. What is the probability of getting (i) Sum 6 (ii) Sum 9
Solution
When 2 dice are tossed. The exhaustive number of cases is 36.
Sum 6 = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}
Favourable number of cases = 5
5
P (Sum 6) =
36
Sum 9 = {(3, 6), (4, 5), (5, 4), (6, 3)}
Favourable number of cases = 4
4
P (Sum 9) =
36
Example 2: A card is drawn from a pack of cards. What is a probability of getting (i) a
king (ii) a spade (iii) a red card?
Solution
There are 52 cards in a pack.
One can be selected in 52C1 ways.
Exhaustive number of cases is = 52C1 = 52.
(i) A king
There are 4 kings in a pack.
One king can be selected in 4C1 ways.
Favourable number of cases is = 4C1 = 4
Example 3: What is the probability of getting 53 Sundays when a leap year selected at
random?
Solution
A leap year consists of 366 days.
This has 52 full weeks and 2 days remained.
The remaining 2 days have the following possibilities.
(i) Sun. Mon (ii) Mon, Tues (iii) Tues, Wed (iv) Wed, Thurs (v) Thurs, Fri (vi) Fri, Sat
(vii) Sat, Sun.
In order that a lap year selected at random should contain 53 Sundays, one of the
2 over days must be Sunday.
Exhaustive number of cases is = 7
Favourable number of cases is = 2
Required Probability is = 2/7
Conditional Probability
Two events A and B are said to be dependent, when B can occur only when A is
known to have occurred (or vice versa). The probability attached to such an event is
called the conditional probability and is denoted by P (A/B) (read it as: A given B) or, in
other words, probability of A given that B has occurred.
If two events A and B are dependent, then the conditional probability of B given A is,
Theorems of Probability
Proof
P (A∪B) =
=
P (A∪B) = P (A) + P (B)
Note
In the case of 3 events, (not mutually exclusive events)
P (A or B or C) = P (A∪B∪C) = P (A + B + C)
P (A) + P (B) + P (C) – P (A∩B) – P (B∩C) – P (A∩C) + P (A∩B∩C)
In the case of 3 events, (mutually exclusive events)
P (A or B or C) = P (A∪B∪C) = P (A + B + C) = P (A) + P (B) + P (C)
Example
Using the additive law of probability we can find the probability that in one roll of a
die, we will obtain either a one-spot or a six-spot. The probability of obtaining a one-spot is
1/6. The probability of obtaining a six-spot is also 1/6. The probability of rolling a die and
getting a side that has both a one-spot with a six-spot is 0. There is no side on a
die that has both these events. So substituting these values into the equation gives the
following result:
Example :Finding the probability of drawing a 4 of hearts or a 6 or any suit using the
additive law of probability would give the following:
There is only a single 4 of hearts, there are 4 sixes in the deck and there isn't a single
card that is both the 4 of hearts and a six of any suit.
Now using the additive law of probability, you can find the probability of drawing
either a king or any club from a deck of shuffled cards. The equation would be completed
like this:
There are 4 kings, 13 clubs, and obviously one card is both a king and a club. We
don't want to count that card twice, so you must subtract one of it's occurrences away to
obtain the result.
P (A∩B) =
Note
(i) In the case of 3 events, (dependent)
P (A∩B∩C) = P (A). P (B/A). P (C/AB)
Example
So in finding the probability of drawing a 4 and then a 7 from a well shuffled
deck of cards, this law would state that we need to multiply those separate probabilities
together. Completing the equation above gives:
----------------------------------------------------------------------------------------------------------------------------------------------------------
Questions
Q.1 Write short note on ‘ Addition theorem of Probability.( 2013-14, 2015-16, 2017-18, 5 marks)
Q.2 Define mathematical , statistical and the axioms of probability. Explain addition theorem on
probability for any two events. (2014-15)
Q.3. Define probability and conditional probability. State and prove the addition theorem of probability
for any two events. (2016-17)
Objective Questions
1. Probability is expressed as
(a) Ratio (b) percentage (c) Proportion (d) all the above
Ans: all the above
2. Probability can take values from
(a) - ∞ to +∞ (b) - ∞ to 1 (c) 0 to +1 (d) –1 to +1
Ans: 0 to +1
3. The probability of a sure event is One.
Ans: True
4. If A and B are mutually exclusive events, then P (AUB) = ------------
Ans: P (A) + P (B)
5. An integer is chosen from 1 to 20. The probability that the number is divisible
by 4 is ¼.
Ans: True
------------------------------------------------------------------------------------------------------------------------
5. Theoretical Distributions
Random variable – random variable we mean a real number ‘X’ connected with the outcome of
random experiment.
e.g. consider an experiment of tossing of 2 coins, we get outcome
S = { HH, TH, HT, TT}
Outcome – HH TH HT TT
Value of X – 2 1 1 0
(Here no. of heads)
Then,
X: 0 1 2 Frequency distribution
F: 1 2 1
Discrete random variable – if a random variable takes at most a countable number of values, it
is called a discrete random variable. In other words a real valued function delivered on a discrete
sample space is called a discrete random variable.
Example: 1. No. of outcomes in tossing of coin.
Continuous random variable – a random variable X is said to be continuous if it can take all
possible values between certain limits.
Example: 1. Height of plant.
2. Grain yield in kg/plot
1. Binomial distribution
Discrete distribution
2. Poisson distribution
Bernoulli distribution
A random variable x takes two values 0 and 1, with probabilities q and p ie.,
p(x=1) = p and p(x=0)=q, q-1-p is called a Bernoulli variate and is said to be Bernoulli
distribution where p and q are probability of success and failure. It was given by Swiss
mathematician James Bernoulli (1654-1705)
Example
• Tossing a coin(head or tail)
• Germination of seed(germinate or not)
1. Binomial distribution
But x successes in n trails can occur in ncx ways. Probability for each of these ways is px qn-x.
P(sss…ff…fsf…f)=p(s)p(s)….p(f)p(f)….
p,p…q,q…
(p,p…p)(q,q…q)
(x times) (n-x times)
Hence the probability of x success in n trials is given by
ncx pxqn-x
Definition
P(X=x) =p(x) =
q=1-p
0, otherwise
The two independent constants n and p in the distribution are known as the parameters of
the distribution.
Condition for Binomial distribution
• Mean = E(x) = np
and
If the random variable X denotes the number of heads, then the probability of a success in
n trials is given by
P(X = x) = ncx px qn-x , x = 0 , 1, 2, ..., n
It differs from the binomial distribution in the sense that we count the number of
success and number of failures, while in Poisson distribution, the average number of
success in given unit of time or space.
Definition
The probability that exactly x events will occur in a given time is as follows
P(x) = , x=0,1,2…
λ = np
λ=2
The Poisson distribution is
= 0.036
IMPORTANT
λ
p (x+1) = p( x )
𝑥+1
3. Normal distribution:
The most important and widely used probability distribution is normal distribution. It is also
know as Gaussian distribution.
Binomial, Poisson distribution etc. can be approximated by Normal distribution.
Many of sampling distribution like student ’t’ , F and χ 2 tends to normality for large
samples. i.e. for sample size > = 30.
Definition
0 otherwise.
Where, π = constant
µ= population mean
ơ = standard deviation of population
x = random variable
v. The entire theory of small samples tests viz. t , F. ᵡ2 are based on the fundamental
assumptions that the parent population from which the samples have been drawn
follow normal distribution.
vi. Normal distribution finds large applications in statistical quality control in
industry for setting of control limits.
Let X be random variable which follows normal distribution with mean m and
1
1 − 𝑥2
𝑒 2 -∞<x<∞
√2𝜋
f (x) =
0 otherwise
advantage of the above function is that it doesn’t contain any parameter. This enables us to
compute the area under the normal probability curve.
Note
Property of
1.
2.
Solution
=X1=6.05
Q.1 State Normal distribution and explain its properties. ( 2013-14) ,( 2014-15), (2015-16)
Q.2 Define Normal distribution and standard normal distribution. Write properties and applications /
Importance of normal distribution.( 2016-17) ( 2017-18)
Q. 3 Define Binomial and Poisson Distribution and write their properties.( 2017-18)
-------------------------------------------------------------------------------------------------------------------
Objective Questions
Ans: True
Ans: True
Ans: False
Ans: True
1 (𝑥−30)2
𝑒− 2
1
P(X=x)= 25 -∞<x<∞
5 √2𝜋
Definitions :
1. Population – the aggregate of all units is known as population.
e.g. i) population may be consist of all paddy plants in a field.
ii) all the fields under specified crop.
iii) all farms in a village.
iv) all houses in a village.
There are two types of populations:
Sampling Distribution
By drawing all possible samples of same size from a population we can calculate the
statistic, for example, for all samples. Based on this we can construct a frequency
distribution and the probability distribution of . Such probability distribution of a statistic is
known a sampling distribution of that statistic.
Standard Error
The Standard deviation (SD) of the sampling distributions of the statistic is
called standard error of the Statistic.
It may be noted that the standard deviation is for units whereas the standard
error is for the statistic.
Hypothesis
Hypothesis is a quantitative statement or assumption about population.
Statistical Hypothesis
When the assumption or statement that occurs under certain conditions is
formulated as scientific hypothesis, we can construct criteria by which a scientific
hypothesis is either rejected or provisionally accepted. For this purpose, the
2
For eg:-
The yield of a new paddy variety will be 3500 kg per hectare – scientific hypothesis.
In Statistical language if may be stated as the random variable (yield of
paddy) is distributed normally with mean 3500 kg/ha.
Simple Hypothesis:
When a hypothesis specifies all the parameters of a probability distribution,
it is known as simple hypothesis. The hypothesis specifies all the parameters, i.e µ
and σ of a normal distribution.
For eg:-The random variable x is distributed normally with mean µ=0 & SD=1 is a simple
hypothesis.
The hypothesis specifies all the parameters (µ & σ) of a normal distributions.
Composite Hypothesis:
If the hypothesis specific only some of the parameters of the probability
distribution, it is known as composite hypothesis. In the above example if only the
µ is specified or only the σ is specified it is a composite hypothesis.
Null Hypothesis - Ho
The hypothesis which is under test is called null hypothesis.
Consider for example, the hypothesis may be put in a form ‘paddy variety A
will give the same yield per hectare as that of variety B’ or there is no difference
between the average yields of paddy varieties A and B. These hypotheses are in
definite terms. Thus these hypothesis form a basis to work with. Such a working
hypothesis in known as null hypothesis. It is called null hypothesis because if
nullities the original hypothesis, that variety A will give more yield than variety B.
The null hypothesis is stated as ‘there is no significant difference between
the effect of two treatments or there is no association between two attributes (ie) the
two attributes are independent. Null hypothesis is denoted by Ho.
Eg:-
There is no significant difference between the yields of two paddy varieties
(or) they give same yield per unit area. Symbolically, Ho: µ 1=µ2.
Alternative Hypothesis:
Any hypothesis which is complementary to null hypothesis is called
alternative hypothesis, usually denoted by H1.
3
Testing of Hypothesis :
Once the hypothesis is formulated we have to make a decision on it. A
statistical procedure by which we decide to accept or reject a statistical hypothesis
is called testing of hypothesis.
Sampling Error
From sample data, the statistic is computed and the parameter is estimated
through the statistic. The difference between the parameter and the statistic is
known as the sampling error.
Test of Significance:
Based on the sampling error the sampling distributions are derived. The
observed results are then compared with the expected results on the basis of
sampling distribution. If the difference between the observed and expected results is
more than specified quantity of the standard error of the statistic, it is said to be
significant at a specified probability level. The process up to this stage is known as
test of significance.
Decision Errors:
By performing a test we make a decision on the hypothesis by accepting or
rejecting the null hypothesis Ho. In the process we may make a correct decision on
Ho or commit one of two kinds of error.
Type I error :We may reject Ho based on sample data when in fact it is true. This
error in decisions is known as Type I error. i.e Reject the null hypothesis H0
when it is true.
Type II error : We may accept Ho based on sample data when in fact it is not true. It is known
as Type II error. i.e Accept the null hypothesis H0 when it is not true
Accept Ho Reject Ho
Ho is true Correct Decision Type I error
Ho is false Type II error Correct Decision
4
The relationship between type I & type II errors is that if one increases the other
will decrease. The probability of type I error is denoted by α. The probability of
type II error is denoted by β. The correct decision of rejecting the null
hypothesis when it is false is known as the power of the test. The probability of
the power is given by 1-β.
Critical Region:
The testing of statistical hypothesis involves the choice of a region on the
sampling distribution of statistic. If the statistic falls within this region, the null
hypothesis is rejected: otherwise it is accepted. This region is called critical region
( rejection region).
Test statistic:
The sampling distribution of a statistic like Z, t, and χ2 are known as test statistic.
Generally, in case of quantitative data
Note
The choice of the test statistic depends on the nature of the variable (ie) qualitative or
quantitative, the statistic involved (i.e) mean or variance and the sample size, (i.e) large or small.
Level of Significance:
𝛼 𝛼
The probability that the statistic will fall in the critical region + . This α is
2 2
nothing but the probability of committing type I error. Technically the probability of
committing type I error is known as level of Significance.
One and two tailed test:
The nature of the alternative hypothesis determines the position of the
critical region. For example, if H1 is µ1≠µ2 it does not show the direction and hence
the critical region falls on either end of the sampling distribution. If H1 is µ1 < µ2 or
µ1 > µ2 the direction is known. In the
first case the critical region falls on the left of the distribution whereas in the second
case it falls on the right side.
One tailed test – When the critical region falls on one end of the sampling
distribution, it is called one tailed test.
5
Two tailed test – When the critical region falls on either end of the sampling
distribution, it is called two tailed test.
For example, consider the mean yield of new paddy variety (µ2) is compared with
that of a ruling variety (µ1). Unless the new variety is more promising that the
ruling variety in terms of yield we are not going to accept the new variety. In this
case H1 : µ1 > µ2 for which one tailed test is used. If both the varieties are new our
interest will be to choose the best of the two. In this case H1: µ1 ≠ µ2 for which we
use two tailed test.
Degrees of freedom:
The number of degrees of freedom is the number of observations that are
free to vary after certain restriction have been placed on the data. If there are n
observations in the sample, for each restriction imposed upon the original
observation the number of degrees of freedom is reduced by one.
The number of independent observations in a set is called as the degrees
of freedom.
Steps in testing of hypothesis:
The process of testing a hypothesis involves following steps.
1. Formulation of null & alternative hypothesis.
2. Specification of level of significance.
3. Selection of test statistic and its computation.
4. Finding out the critical value from tables using the level of
significance, sampling distribution and its degrees of freedom.
5. Determination of the significance of the test statistic.
6. Decision about the null hypothesis based on the significance of the test statistic.
7. Writing the conclusion in such a way that it answers the question on hand.
6
Student’s t test
Let x1, x2, ----xn be a random sample drawn from a normal distribution having men µ
and standard deviation 𝞼 (unknown). Then student ‘t’ statistic is given by,
̅ −µ|
|𝑥
t= where 𝑥̅ = sample mean
𝑠/ √𝑛
s = sample standard deviation
µ = population mean
This follows t distribution with (n-1) degrees of freedom which can be written as t(n-1) d.f.
This fact was brought out by Sir William Gossest and Prof. R.A Fisher. Sir
William Gossest published his discovery in 1905 under the pen name Student
and later on developed and extended by Prof. R.A Fisher. He gave a test known as t-
test.
∑ 𝑥𝑖
where 𝑥̅ = 𝑛
5. Find the table value of t corresponding to (n-1) d.f. and the specified level of
significance.
6. Inference
If tcal < ttab we accept the null hypothesis H0. We conclude that there is no
significant difference between sample mean and population mean
(or) if tcal > ttab we reject the null hypothesis H0. (ie) we accept the alternative
hypothesis and conclude that there is significant difference between the sample mean
and the population mean.
Test statistic
𝑛1 𝑠1 2 + 𝑛2 𝑠2 2
and the combined variance s2 = where 𝑠1 2 first sample variance
𝑛1 +𝑛2 −2
𝑠2 2 second sample variance
̅𝟏 −𝒙
|𝒙 ̅𝟐 |
t=
𝒔𝟏 𝟐 𝒔𝟐 𝟐
√ +
𝒏𝟏 𝒏𝟐
8
This statistic follows neither t nor normal distribution but it follows Behrens-Fisher d
distribution. The Behrens – Fisher test is laborious one. An alternative simple method
has been suggested by Cochran & Cox. In this method the critical value of t is altered
as tw (i.e) weighted t
where t1is the critical value for t with (n1-1) d.f. at and specified level of significance
and t2 is the critical value for t with (n2-1) d.f. at and specified level of significance.
Inference:
t > tw We reject the null hypothesis H0.
the mean may be positive or negative. Hence we take the absolute value as .
The test statistic t follows t distribution with (n-1) d.f.
Inference:
t cal. > t(n-1) We reject the null hypothesis H0.
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Let s12 and s22 be the estimates of 𝞼12 and 𝞼22 based on a sample of size n1 and n2 respectively.
Objective questions:
1. In paired t test with n observations in each group the degrees of
freedom is (a) n (b) n-1 (c) n-2 (d) n+1
Ans: n-1
2. Student t- test is applicable in case of small samples.
Ans: True
3. F test is also known as variance ratio test.
Ans: True
4. In case of comparing the equality of two variances the greater variance
should be taking in the numerator.
Ans: True
5. While comparing the means of two independent samples the variances of the
two samples will be always equal.
Ans: False
6. A hypothesis may be classified as
a) Simple b) Composite (c) Null (d) All the above
Ans: All the above
7. Area of the critical region depends on
a) Size of type I error b) Size of type II error
c) Value of the statistics d) Number of observations
Ans: Size of type I error
8. If the calculated test statistic is greater than the critical value, the null hypothesis isrejected.
Ans: True
9. .The standard error of mean is given by
Ans: True
[Link] the alternative hypothesis is µ1≠ µ2 then the test is known as one tailed test.
Ans: False
1
In case of attributes we can not employ the parametric tests such as F and t. Instead of that
we have to apply χ2 test . When we want to test whether a set of observed values are in agreement
with those expected on the basis of some theories or hypothesis.
(𝑶𝒊 −𝑬𝒊 )𝟐
χ2 = ∑
𝑬𝒊
It follows a χ2 distribution with n-1 d.f. In case of χ2 only one tailed test is used.
---------------------------------------------------------------------------------------------------------------
2
When the number of rows and number of columns are equal to 2 it is termed as 2 x2
contingency table . It will be in the following form.
B1 B2 Row Total
A1 a b a+b = r1
A2 c d c+d = r2
Column Total a+c = c1 b+d = c2 a+b+c+d = N
Where a, b, c and d are cell frequencies c1 and c2 are column totals, r1 and r2 are row totals and n
is the total number of observations.
In case of 2 x 2 contingency table χ2 can be directly found using the short cut formula,
𝑵(𝒂𝒅−𝒃𝒄)𝟐
χ2 =
𝒄𝟏 𝒄𝟐 𝒓 𝟏 𝒓 𝟐
𝑵
𝑵 (|𝒂𝒅−𝒃𝒄|− 𝟐 )𝟐
χ =
2
𝒄𝟏 𝒄𝟐 𝒓𝟏 𝒓𝟐
Broad questions:
Q.1 χ2 test for testing the independences of attributes. ( Short note- 5 marks) (2013-14, 2017-18)
Q.2 Explain χ2 test and its applications. ( 10 marks) ( 2014-15)
Q.3 χ2 test for testing the goodness of fit. ( Short note- 5 marks) (2015-16)
Q.4 χ2 test ( Short note- 5 marks) (2016-17)
Objective questions:
[Link] condition for application of χ 2 test is that no cell frequency should be less than five.
Ans: True
Lecture.9
Correlation
Correlation
Correlation is the study of relationship between two or more variables. Whenever
we conduct any experiment we gather information on more related variables. When there are
two related variables their joint distribution is known as bi-variate normal distribution and if
there are more than two variables their joint distribution is known as multivariate normal
distribution.
In case of bi-variate or multivariate normal distribution, we are interested in
discovering and measuring the magnitude and direction of relationship between 2 or more
variables. For this we use the tool known as correlation.
Suppose we have two continuous variables X and Y and if the change in X affects Y,
the variables are said to be correlated. In other words, the systematic relationship between the
variables is termed as correlation.
When only 2 variables are involved the correlation is known as simple correlation
and when more than 2 variables are involved the correlation is known as multiple
correlation.
When the variables move in the same direction, these variables are said to be
positively correlated and if they move in the opposite direction they are said to be
negatively correlated.
Types of correlation :
1. Positive 2. Negative 3. Simple 4. Multiple
5. Partial 6. Linear 7. Non-linear
Scatter Diagram:
To investigate whether there is any relation between the variables X and Y we use
scatter diagram. Let (x1,y1), (x2,y2)….(xn,yn) be n pairs of observations. If the variables X and
Y are plotted along the X-axis and Y-axis respectively in the x-y plane of a graph sheet the
resultant diagram of dots is known as scatter diagram. From the scatter diagram we can say
whether there is any correlation between x and y and whether it is positive or negative or
the correlation is linear or curvilinear.
2
Curvilinear no correlation
(or) non linear
Symbolically,
∑𝑥∑𝑦
∑ 𝑥𝑦 −
𝑛
r= 2 2
√∑ 𝑥 2 − (∑ 𝑥) √∑ 𝑦 2 − (∑ 𝑦)
𝑛 𝑛
3
The denominator in the above formula is always positive. The numerator may be
positive or negative making r to be either positive or negative.
Assumptions in correlation analysis:
Correlation coefficient r is used under certain assumptions,
1. The variables under study are continuous random variables and they are normally distributed.
2. The relationship between the variables is linear.
3. Each pair of observations is unconnected with other pair(independent).
Properties
1. The correlation coefficient r value ranges between –1 and +1.
2. Correlation coefficient is a pure number i.e it has no unit.
3. The correlation coefficient is not affected by change of origin or scale or both.
4. The relationship between correlation coefficient r and regression coefficients byx and bxy is
r = √𝒃𝒚𝒙 𝒃𝒙𝒚
5. If r > 0 it denotes positive correlation
r < 0 it denotes negative correlation between the two variables x and y.
6. If the two variables are independent , the correlation coefficient between them is 0.(zero)
7. r = +1 then the correlation is perfect positive
r = -1 then the correlation is perfect negative.
|𝑟|
t= 2
√1−𝑟
𝑛−2
coefficient of alienation. If r2 is 0.72, it implies that on the basis of the samples 72% of the
variation in one variable is caused by the variation of the other variable.
-------------------------------------------------------------------------------------------------------
Rank Correlation:
When both the variables are not normal, the linear correlation coefficient procedure is not
applicable and we have to use rank correlation. The 2 methods of computing rank
correlation are, one proposed by Spearman and another by Kendall. Spearman’s rank
correlation procedure starts within ranking of the measurements of the values of X and Y
separately. The differences between the ranks of each of n pairs are found out (d). Then the
Spearman’s rank correlation is computed by the formula and denoted by ρ or rs
6 ∑ 𝑑𝑖 2
ρ = 1- where di = Rank (Xi) –Rank (Yi)
𝑛(𝑛2 −1)
𝑏 𝑠𝑒 2
t= ~ t statistic with n-2 d.f and SE (b) = √
𝑆𝐸(𝑏) SS (X)
Uses of Regression
1. The regression analysis is useful in predicting the value of one variable from the
given value of another variable. Such predictions are useful when it is very
difficult or expensive to measure the dependent variable, Y.
2. The other use of the regression analysis is to find out the causal relationship
between variables. Suppose we manipulate the variable X and obtain a significant
regression of variables Y on the variable X. Thus we can say that there is a causal
relationship between the variable X and Y. The causal relationship between
nitrogen content of soil and growth rate in a plant, or the dose of an insecticide and
mortality of the insect population may be established in this way.
Comparison ( distinguish / difference ) between Correlation and Regression
Correlation Regression
1 Correlation study is concerned with the Regression study is concerned with the
relationship between variables. average relationship between variables.
2 Correlation indicates the extent of Regression analysis helps in estimating or
Relationship between variables. predicting value of dependent variable
3 Cause and effect relationship may not be The cause and effect relationship is clearly
identified separately indicated through regression equation.
4 The correlation coefficient are equal or The regression coefficients are different i.e.
symmetric i.e. ryx = rxy byx ≠ bxy
5 Range: It is lies between -1 to +1 Range: It is lies between -∞ to + ∞
6 It is unit free measure It is not unit free measure
Lines of regression –
If the variable in a bivariate distribution are related, we will find that the points in the scatter
diagram will cluster round some curve called the curve of “regression”. If the curve is straight
line, It is called the line of regression and said to be linear regression between the variables
otherwise regression is said to be curvilinear.
The lines of regression of Y on X and X on Y
Since ‘byx’ is the slope of the line of regression of Y and X .The equation of the line of
regression Y on X is given by
Y - ̅𝑦 = byx ( X - 𝑥̅ )
𝜎𝑦
where byx = 𝑟
𝜎𝑥
X - ̅𝑥 = bxy ( Y - 𝑦̅)
𝜎𝑥
where bxy = 𝑟
𝜎𝑦
1. The angle between two lines of regression is 900 then correlation coefficient r =0
2. The angle between two lines of regression is 0 or 1800 then correlation coefficient r =1
Broad Questions:
Q.1 Describe the linear regression and derive regression coefficients with its application.
(2013-14, 2015-16)
Q.2 Describe the linear regression and derive regression coefficients. State the properties of
regression coefficient and its applications. ( 2014-15)
Q.3 Describe the simple linear regression and derive regression coefficients. State the
properties of regression coefficient and its applications. ( 2016-17)
Q.4 Write a short note on ‘Non linear regression’ (2016-17, 2019-20, 5 marks)
Q.5 Describe the linear regression and derive formula for constant and regression
coefficients. (2017-18)
Q.6 Describe the multiple linear regression equation along with its derivation .(2019-20)
--------------------------------------------------------------------------------------------------------------
Questions objectives:
[Link] the correlation coefficient r = +1, then the two regression lines
Ans: True
[Link] Dependent variable is also called as predicted variable.
Ans: True
[Link] coefficient is the geometric mean of two regression coefficients.
Ans: True
[Link] gives the functional relationship between two variables.
Ans: True