0% found this document useful (0 votes)
10 views81 pages

Stats Full Notes

The document provides an overview of statistics, including definitions, applications in agriculture, limitations, types of data, classification, frequency distribution, and measures of central tendency. It discusses various statistical concepts such as mean, median, mode, geometric mean, harmonic mean, and weighted mean, along with their merits and demerits. Additionally, it explains the importance of classification and tabulation in presenting data for analysis.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views81 pages

Stats Full Notes

The document provides an overview of statistics, including definitions, applications in agriculture, limitations, types of data, classification, frequency distribution, and measures of central tendency. It discusses various statistical concepts such as mean, median, mode, geometric mean, harmonic mean, and weighted mean, along with their merits and demerits. Additionally, it explains the importance of classification and tabulation in presenting data for analysis.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1.

Introduction
1.1 Definitions of Statistics:

1. Statistics is the science which deals with collection , classification, tabulation, analysis
and interpretation of the numerical data.

2. Statistics may be called as science of counting.

3. Statistics is the science of estimates and probabilities.

1.2 Applications of statistics in Agriculture:

Statistics is used in agricultural research for efficient planning of experiments and for
interpreting experimental data. e.g.

i) To study the effect of fertilizer application on yield of crops.

ii) To study the effect of cross- fertilization of different crop varieties.

iii) To study the effect of cultural practices on yield of crops.

iv) To study the effect of pest and disease incidence on crop yields.

v) To study the effect of climate changes on yield of crops.

1.3 Limitations of statistics:

i) Statistics can be applied only when there is a variability in the experimental material.

ii) Statistics deals with only aggregates or groups and not with individual objects..

iii) The lows of statistics are true on an average in the long run.

1.4 Types of data: There are two types of data which is based on nature of the characteristic
observed.

1. Quantitative data (measurement data) : The nature of the characteristic observed


are measured in different units is called quantitative data.

2. Qualitative data (attribute data) : The nature of the characteristic observed are not
measured but they are classified in to some categories is called qualitative data.

1.5 Classification of data: Arrangement of data according to some common features is know
as classification.

The classification is done according to the type of data i.e. measurement or attribute data. In
case of measurement data they are classified according to classes and in case of attribute
data, we account items with specified character and put them in appropriate categories.
Types of classification:

1. Quantitative classification: data are classified according to magnitude of the numerical


values. E.g. age, income, height, weight etc.

2. Qualitative classification: Data are classified according to the attributes of the subjects or
items. E.g. qualification, colour etc.

3. Geographical classification: Data are classified according to the place, area, region.

4. Chronological classification: Data are classified according to the lapse of time. E.g.
weekly, monthly, yearly etc.

1.6 Frequency distribution:

Frequency: The number of times a category or class occurs and it is denoted by ‘f’.

Frequency distribution: Sorting of data into categories or classes will lead to formation of
frequency distribution. i.e. Frequency distribution gives the no. of times a category or class
occurs.

Examples of frequency distribution:

Table No.1 Frequency distribution of seed yield of 200 plants.

Seed yield , g (X) Number of plants (f)


2-3 8
3-4 12
4-5 20
5-6 52
6-7 48
7-8 30
8-9 20
9-10 10
Total 200
In the above table seed yield in gram is denoted by ‘X’ ( Classes).

The boundary figures of the classes 2-3, 3-4, 4-5 etc. are called class limits.

The lower bound values are called lower limit and upper bound values are called upper
limit of class.

The difference between lower and upper limit of a class is called as class interval.
Table no.2 Frequency distribution of number of seeds germinated in 50 plots.

No. of seeds germinated (X) No. of plots (f)


0 4
1 13
2 16
3 9
4 5
5 3
Total 50

Table no.3 Frequency distribution of petal colour in linseed flowers.

Colour No. of plants


Blue 174
White 58
Pink 18
Total 250

1.7 Objectives of classification:

1. To present facts in a simple manner.

2. To highlight items which possess or do not possess certain attributes or qualities.

3. To provide help in making comparisons between items.

4. To find out mutual relationship between certain measures and their effects.

5. To present the data in a manner which is suitable for further treatment.

6. To provide basis for tabulation.

1.8 Tabulation:

It is the process of presenting data which is collected through survey , experiment, or record
in rows and column so that it can more easily be understood and can be used for further
statistical analysis.
1.9 Objectives of Tabulation:

1. To clarify the object of investigation .

2. To reduce complexity of data.

3. To depict the relation among data if it exist.

4. To facilitate analysis of data.

1.10 Difference between classification and tabulation.

Classification is meant for arranging the data in to characteristic or groups where


each group has the number of items attached to it . In case of variables , it is given in the form
of frequency distribution.

Tabulation is the logical and systematic arrangement of data in rows and columns . In
a table , data may be presented in modified form as well, e.g. in percent, proportion , total or
average values etc.
Lecture. 2
Measures of Central Tendency

A measure of central tendency is typical value around which other figure congrate.
The averages are representative of the entire series. 5 measure central tendencies are mean,
median , mode, geometric mean and harmonic mean.
Characteristics of ideal measures :
[Link] should be rigidly defined
2. It should be easy to calculate
[Link] should be based on all observations.
[Link] should be suitable for further mathematical treatment
5. It should not be affected much by extreme values.

In the study of a population with respect to one in which we are interested we may get a
large number of observations. It is not possible to grasp any idea about the characteristic when
we look at all the observations. So it is better to get one number for one group. That number
must be a good representative one for all the observations to give a clear picture of that
characteristic. Such representative number can be a central value for all these observations. This
central value is called a measure of central tendency or an average or a measure of locations.
There are five averages. Among them mean, median and mode are called simple averages and
the other two averages geometric mean and harmonic mean are called special averages.

1. Arithmetic mean or mean


Arithmetic mean or simply the mean of a variable is defined as the sum of the
observations divided by the number of observations. It is denoted by the symbol If the
variable x assumes n values x1, x2 … xn then the mean is given by

This formula is for the ungrouped or raw data.

Grouped Data
The mean for grouped data is obtained from the following formula:

Where x = the mid-point of individual class


f = the frequency of individual class
n = the sum of the frequencies or total frequencies in a sample.
Short-cut method

Where

A = any value in x
n = total frequency
c = width of the class interval

Merits and demerits of Arithmetic mean


Merits
• It is rigidly defined.
• It is easy to understand and easy to calculate.
• If the number of items is sufficiently large, it is more accurate and more reliable.
• It is a calculated value and is not based on its position in the series.
• It is possible to calculate even if some of the details of the data are lacking.
• Of all averages, it is affected least by fluctuations of sampling.
• It provides a good basis for comparison.
Demerits
• It cannot be obtained by inspection nor located through a frequency graph.
• It cannot be in the study of qualitative phenomena not capable of numerical measurement i.e.
Intelligence, beauty, honesty etc.,
• It can ignore any single item only at the risk of losing its accuracy.
• It is affected very much by extreme values.
• It cannot be calculated for open-end classes.

2. Median
The median is the middle most item that divides the group into two equal parts, one part
comprising all values greater, and the other, all values less than that item.
Ungrouped or Raw data
Arrange the given values in the ascending order. If the number of values are odd, median
is the middle value
If the number of values are even, median is the mean of middle two values.
By formula

When n is odd, Median = Md =


When n is even, Average of

Grouped data
In a grouped distribution, values are associated with frequencies.
Grouping can be in the form of a discrete frequency distribution or a continuous
frequency distribution. Whatever may be the type of distribution, cumulative
frequencies have to be calculated.
Cumulative frequency (cf)
Cumulative frequency of each class is the sum of the frequency of the class and the
frequencies of the pervious classes, ie adding the frequencies successively, so that the last
cumulative frequency gives the total number of items.
Discrete Series
Step1: Find cumulative frequencies.

Step3: See in the cumulative frequencies the value just greater than

Step4: Then the corresponding value of x is median.

Continuous Series
The steps given below are followed for the calculation of median in continuous series.
Step1: Find cumulative frequencies.

Step2: Find

Step3: See in the cumulative frequency the value first greater than , Then the
corresponding

class interval is called the Median class. Then apply the formula

Median : Grouped data

𝑛
– 𝑐.𝑓.
Median = L + ( 2
𝑓
𝑋 𝐶 )
Where l = lower limit of the median class
f = frequency of the median class
c = width of the median class
c.f . = cumulative frequency preceding the median class

Merits of Median
• Median is not influenced by extreme values because it is a positional average.
• Median can be calculated in case of distribution with open-end intervals.
• Median can be located even if the data are incomplete.

Demerits of Median
• A slight change in the series may bring drastic change in median value.
• It is not suitable for further mathematical treatment except its use in calculating
mean deviation.
• It does not take into account all the observations.

3. Mode
The value which occur most frequently in a set of observations is
called mode.
It is an actual value, which has the highest concentration of items in and
around it. It shows the centre of concentration of the frequency in around a given
value. Therefore, where the purpose is to know the point of the highest
concentration it is preferred. It is, thus, a positional measure.
Its importance is very great in agriculture like to find typical height of a
crop variety, maximum source of irrigation in a region, maximum disease prone
paddy variety. Thus the mode is an important measure in case of qualitative data.

Grouped Data
For Discrete distribution, see the highest frequency and corresponding value of x is
mode.
Continuous distribution
Locate the highest frequency the class corresponding to that frequency is called the
modal class.
Then apply the formula.

Mode = L+ ( 2𝑓1𝑓−𝑓
1− 𝑓0
0 −𝑓2
x C)

here L = lower limit of the model class


f0 = the frequency of preceding the model class

f1 = the frequency of model class

f2 = the frequency of succeeding the model class

c = class interval

Merits :
1. Mode is easy to calculate
2. Mode is not at all affected by extreme values.

De Merits :
1. Mode is ill-defined
2. It is not based on all the observations
3. It is not capable of further mathematical treatment

[Link] mean :
The geometric mean of a series containing n observations is the nth root of the product of the
values.
If x1, x2…, xn are observations then

G.M. = 𝑛√ 𝑥1 𝑥2 𝑥3 … 𝑥𝑛

∑ 𝑙𝑜𝑔𝑥𝑖
GM = Antilog
𝑛
For grouped data

GM = Antilog

GM is used in studies like bacterial growth, cell division, etc.


Merits :

1. It is rigidly defined
2. It is based on all the observations
3. It is capable of further mathematical treatment
4. It is not affected much by fluctuation of sampling.
5. It gives comparatively more weights to small items / values
Demerits :
1. G.M. is not easy to understand
2. It is not easy to calculate
3. If one of the observation is ‘ 0’ ZRRO G.M. is ‘ 0’ ZRRO
4. If one of the observation is negative GM is not define ( imaginary
nary)

[Link] mean (H.M)


Harmonic mean of a set of observations is defined as the reciprocal of the arithmetic
average of the reciprocal of the given values. If x1, x2…..xn are n observations,

For a frequency distribution

H.M is used when we are dealing with speed, rates, etc.

Merits of H.M
1. It is rigidly defined.
2. It is defined on all observations.
3. It is amenable to further algebraic treatment.
4. It is the most suitable average when it is desired to give greater weight to smaller
observations and less weight to the larger ones.
Demerits of H.M
1. It is not easily understood.
2. It is difficult to compute.
3. It is only a summary figure and may not be the actual item in the series
4. It gives greater importance to small items and is therefore, useful only when
small items have to be given greater weightage.
5. It is rarely used in grouped data.

Weighted mean :
In calculating arithmetic mean we suppose that all the items in the distribution have
equal importance If some items in a distribution are more importance than others . In this case
proper weightage is given to various items.
e.g. If we wants to have an idea of the change in living of a certain group of people then simple
mean can not work . Because all the commodities are not equally importance wheat, rice , pulses
are more importance than tea, confectionery etc.
If X1 X2 …… Xn be the items and W1 W2 W3 …… Wn be the weights attached to each
item respectively then weighte mean is given by the formula

∑ 𝑤𝑖 𝑥𝑖
W.M. =
∑ 𝑤𝑖

Quartiles :
The quartiles divide the distribution in four parts. There are three quartiles.
The second quartile divides the distribution into two halves and therefore is the
same as the median. The first (lower).quartile (Q1) marks off the first one-fourth,
the third (upper) quartile (Q3) marks off the three-fourth.

Raw or ungrouped data


item and item
Discrete Series
Step1: Find cumulative frequencies.

Step2: Find

Step3: See in the cumulative frequencies, the value just greater than , then the

corresponding value of x is Q1

Step4: Find

Step5: See in the cumulative frequencies, the value just greater than ,then the

corresponding value of x is Q3

Continuous series
Step1: Find cumulative frequencies

Step2: Find

Step3: See in the cumulative frequencies, the value just greater than , then the

corresponding class interval is called first quartile class.

Step4: Find See in the cumulative frequencies the value just greater than
then the
corresponding class interval is called 3rd quartile class. Then apply the respective
formulae.

First Quartile: Grouped data

𝑛
4
– 𝑐.𝑓.
Q1 = L + ( 𝑋 𝐶 )
𝑓

Where l = lower limit of the first quartile class


f = frequency of the first quartile class
c = width of the first quartile class
c.f . = cumulative frequency preceding the first quartile class
Third quartile: Grouped data

𝑛
– 𝑐.𝑓.
4
Q3 = L + ( 𝑋 𝐶 )
𝑓

l = 1ower limit of the 3rd quartile class


f = frequency of the 3rd quartile class
c = width of the 3rd quartile class
c.f . = c.f. preceding the 3rd quartile class.

Percentile:
The ninety nine points which divide series into 100 equal parts is
called percentiles.

. The xth percentile is that value below which x percent of values in the

distribution fall. It may be noted that the median is the 50th percentile.

For raw data, first arrange the n observations in increasing order. Then the
xth percentile is given by

For a continuous frequency distribution the xth percentile is given by

Where
= lower limit of the percentile calss which contains the xth percentile value (x. n /100)
= cumulative frequency uotp
= frequency of the percentile class
C= class interval n = total number of observations
Deciles:
The nine points which divide series into 10 equal parts is called deciles.
For raw data, first arrange the n observations in increasing order. Then the xth decile is
given by
Dx = ( 𝑥(𝑛+1)
10
)th item

For a continuous frequency distribution the xth decile is given by

𝑥.𝑛
– 𝑐.𝑓.
Dx = L + ( 10
𝑓
𝑋 𝐶 )

Where
= lower limit of the decile calss which contains the xth decile value (x. n /10)
= cumulative frequency of preceding classing median
= frequency of the decile class
C= class interval
n = total number of observations

IMPORTANT
1. The relation between H.M. A.M and G.M. is HM < GM < AM
2. If distribution is absolutely symmetrical / Symmetrical , the Mean, Median and
Mode are coincide i.e. Mean = Median = Mode
3. If distribution is moderately symmetrical / Skewed / non symmetrical then,
Mode = 3 Median – 2 mean
4. Mathematical averages: AM, GM, HM, Quadratic mean
5. Positional averages : Median and Mode these averages are located /determine by
graphically.
Lecture-3
Measures of Dispersion
The averages are representatives of a frequency distribution. But they fail to give a
complete picture of the distribution. They do not tell anything about the scatterness of
observations within the distribution.
Suppose that we have the distribution of the yields (kg per plot) of two
paddy varieties from 5 plots each. The distribution may be as follows

Variety I 45 42 42 41 40
Variety II 54 48 42 33 30

It can be seen that the mean yield for both varieties is 42 kg but cannot say that
the performances of the two varieties are same. There is greater uniformity of yields in
the first variety whereas there is more variability in the yields of the second variety. The
first variety may be preferred since it is more consistent in yield performance.
Form the above example it is obvious that a measure of central tendency alone is not
sufficient to describe a frequency distribution. In addition to it we should have a measure of
scatterness of observations. The scatterness or variation of observations from their
average are called the dispersion. There are different measures of dispersion like the
range, the quartile deviation, the mean deviation and the standard deviation.

Characteristics of a good measure of dispersion


An ideal measure of dispersion is expected to possess the following properties
• It should be rigidly defined
• It should be based on all the items.
• It should not be unduly affected by extreme items.
• It should lend itself for algebraic manipulation.
• It should be simple to understand and easy to calculate
Objectives:
1. To determine the reliability of an average.
2. To compare two or more series with regard to their variability.
3. To facilate the use of other statistical measures.
4. To serve as basis for the control of variability
I) Absolute Measures of Dispersion:

Absolute measures of dispersion are expressed in the same unit in which the original data are
given such as rupees, kilograms, grams, centimeter, meter etc.

1. Range
This is the simplest possible measure of dispersion and is defined as the difference
between the largest and smallest values of the in the set of observations.
In symbols, Range = L – S.
Where L = Largest value. and S = Smallest value.
In individual observations and discrete series, L and S are easily identified.
In continuous series, the following two methods are followed.
Method 1
L = Upper boundary of the highest class
S = Lower boundary of the lowest class.
Method 2
L = Mid value of the highest class.
S = Mid value of the lowest class.
Merits and Demerits of Range
Merits
• It is simple to understand.
• It is easy to calculate.
• In certain types of problems like quality control, weather forecasts, share
price analysis, etc., range can be used.
Demerits
• It is very much affected by the extreme items.
• It is based on only two extreme observations.
• It cannot be calculated from open-end class intervals.
• It is not suitable for mathematical treatment.
• It is a very rarely used measure.

2. Quartile deviation: The quartile deviation is given by the following formula

𝑸𝟑 −𝑸𝟏
Quartile deviation = Where, Q3 = Third quartile
𝟐
Q1 = First quartile
3. Mean deviation:
Mean deviation is the mean of the deviations of individual values from their averages
In case of row data (ungrouped data) the mean deviation from median is the least.

∑|𝑥−𝑥̅ |
Formula for ungrouped data: M.D.=
𝑛

Formula for grouped data:


∑ 𝑓|𝑥−𝑥̅ |
M.D.=
𝑁

Where N = ∑ 𝑓

4. Standard Deviation:
It is defined as the positive square-root of the arithmetic mean of the Square of the
deviations of the given observation from their arithmetic mean.
The standard deviation is denoted by’ s ‘ in case of sample and Greek letter (sigma-𝞼) in case of
population.
The formula for calculating standard deviation for raw data i.e
For ungrouped data

( xi − x ) 2 ( xi − x ) 2
s= 𝞼=
n −1 n

For grouped data


 f i ( xi − x ) 2  f i ( xi − x ) 2
S = 𝞼=
n −1 n

where f is frequency and n=∑ 𝑓


Characteristics of Standard Deviation:

i. It takes all the items into consideration.


ii. It is more stable compared to other measures.
iii. It is inflated by extreme items.
iv. It is not affected by adding or subtracting a constant value to each observed value.
v. It is affected by multiplying or dividing each observation by constant.
Merits and Demerits of Standard Deviation:
Merits:
▪ It is rigidly defined and its value is always definite and based on all the
observations and the actual signs of deviations are used.
▪ As it is based on arithmetic mean, it has all the merits of arithmetic mean.
▪ It is the most important and widely used measure of dispersion.
▪ It is possible for further algebraic treatment.
▪ It is less affected by the fluctuations of sampling and hence stable.
▪ It is the basis for measuring the coefficient of correlation and sampling.
Demerits
▪ It is not easy to understand and it is difficult to calculate.
▪ It gives more weight to extreme values because the values are squared up.
▪ As it is an absolute measure of variability, it cannot be used for the
purpose of comparison.

5. Variance:
The square of the standard deviation is called variance.

(i.e.) Variance = (S.D.) 2


If original data in meter, m then variance is in m2 .
II) Relative Measures of Dispersion:
It is sometimes called a coefficient of dispersion, because coefficient means a
pure number that is independent of the unit of measurement. ( Without unit / unitless
measure ). Different relative measures of dispersion are as follows.
1. Coefficient of Range:
It is obtained by following formula:
𝐿−𝑆
Coefficient of Range = where L is largest value in a data set.
𝐿+𝑆
S is smallest value in a data set.
2. Coefficient of Mean Deviation:
It is obtained by dividing mean deviation by the particular average used in computing
mean deviation. Thus if mean deviation has been computed from median , the coefficient of mean
deviation shall be obtained by dividing the mean deviation by median.
Mean Deviation
Coefficient of Mean Deviation =
𝑀𝑒𝑑𝑖𝑎𝑛
If mean deviation has been computed from mean , the coefficient of mean deviation
shall be obtained by dividing the mean deviation by mean.
Mean Deviation
Coefficient of Mean Deviation =
𝑀𝑒𝑎𝑛

3. Coefficient of Quartile Deviation:


The coefficient of quartile deviation is calculated as follows:
𝑸𝟑−𝑸𝟏
𝟐 𝑄3 −𝑄1
Coefficient of Quartile Deviation = 𝑸𝟑 +𝑸𝟏 =
𝑄3 +𝑄1
𝟐

4. Coefficient of Variation:
The Standard deviation is an absolute measure of dispersion. It is expressed in
terms of units in which the original figures are collected and stated. The standard deviation of
heights of plants can not be compared with the standard deviation of weights of the grains, as
both are expressed in different units, i.e heights in centimeter and weights in kilograms.
Therefore the standard deviation must be converted into a relative measure of dispersion for
the purpose of comparison. The relative measure is known as the coefficient of variation.
The coefficient of variation is obtained by dividing the standard deviation
by the mean and expressed in percentage. Symbolically,
𝑆.𝐷.
Coefficient of Variation (C.V.) = x 100
𝑀𝑒𝑎𝑛

Importance of Coefficient of Variation (C.V.):


If we want to compare the variability of two or more series of different units, we can use
C.V. The series or groups of data for which the C.V. is greater it indicates that the group is
more variable, less stable, less uniform, less consistent or less homogeneous. If the C.V. is
less, it indicates that the group is less variable or more stable or more uniform or more
consistent or more homogeneous.
Questions
Q.1 Write important measures of dispersion and describe standard deviation in detail.
( 2013-14) , (2015-16)

Q.2 What do you understand by dispersion ? Write important measures of dispersion and
describe standard deviation with its characteristics. (2014-15)

Q.3 What is importance of studying dispersion ? Explain important measures of dispersion


alongwith formulas. Write characteristics of standard deviation. (2016-17)

Q.4 Write various relative and absolute measures of dispersion and describe importance of
Coefficient of Variation. ( 2017-18)

Q.5 Explain various measures of dispersion in details along with examples. ( 2019-20)

Objective :

[Link] measure is affected most by the presence of extreme values.


a) Range b) Standard Deviation c) Quartile Deviation d) Mean deviation
Ans: Standard Deviation
2. Variance is square of ____________
a) Range b) Standard Deviation c) Quartile Deviation d) Mean deviation
Ans: Standard Deviation

3 .If the CV of variety I is 30% and variety II is 25% then Variety II is more consistent.

Ans: True

[Link] the set of data 5, 5, 5,5,5,5 the Standard deviation value is zero.

Ans: True

5. The absolute measures of dispersion will have the original units.


Ans: True
Concept of measures of Skewness and Kurtosis.
Skewness : Skewness means ‘lack of symmetry’. We study skewness to have an idea
about the shape of the curve which we can draw with the help of the given data. A
distribution is said to be skewed if
i) Mean, Median and Mode fall at different points,
i.e., Mean ≠ Median ≠ Mode.
ii) Quartiles are not equidistant from median, and
iii) The curve drawn with the help of the given data is not symmetrical but stretched
more to one side than the other.

Measure of Skewness – various measures of skewness are


1) Sk = M - Md 2) Sk = M - M0
Where M is the mean, Md is the median and M0 is the mode of the
distribution
3) Sk = (Q3 – Md) - (Md – Q1)
These are the absolute measures of skewness. As in dispersion, for comparing two
series we do not calculate these absolute measures but we calculate the relative
measures called the co-efficients of skewness which are pure numbers
independent of units of measurement. The following are the co-efficients of
skewness :
1. Prof. Karl Pearsons Co-efficient of skewness.
3(𝑀 − Mo)
σ
Where, σ is the standard deviation of the distribution.
If mode is ill defined, then using the relation, M0 = 3Md - 2M for a moderately
asymmetrical distribution, we get
3(𝑀 − Md)
σ

It has been shown that for any distribution, (M-Md) / σ lies between ±1. Hence
the limits for the co-efficient of skewness are ±3. In practice these limits are
rarely attained.
Skewness is positive if M>M0 or M>Md and negative if M<M0 or M<Md.
2. Prof. Bowleys Co-efficient of skewness based on quartiles,
(Q3-Md) - (Md-Q1) = Q3 + Q1 – 2Md
Remarks 1. Bowleys Co-efficient of skewness is also known as Quartile Co-efficient of
skewness. Especially useful in following situations.
i)When the mode is ill defined and extreme observations are present in the data.
ii)When the distribution has open end classes or unequal class intervals.
In these situations Pearsons coefficient of skewness cannot be used.
Remark 2. Limits of . Bowleys Co-efficient of skewness : -1 to +1
Remark 3 Skewness is positive if Q3+Q1> 2Md and
negative if Q3+Q1< 2Md
Kurtosis :– If we know the measures of central tendency, dispersion and skewness, we still
cannot form a complete idea about the distributions as will be clear from the figure in which
all the three curves A, B and C are symmetrical about the mean ‘m’ and have the same range.
In addition to these measures we should know one more measure which Prof. Karl Pearson
calls as the ‘Convexity of a curve or Kurtosis’.
Kurtosis enables us to have an idea about the flatness of peakedness of the curve.
It is measured by the co-efficient β2 or its deviation γ2 .

Diagram:
Curve of the type ‘A’ which is neither flat nor peaked is called normal curve
or mesokurtic curve and for such a curve β2 = 3, i.e., γ2 = 0.
Curve of the type ‘B’ which is flatter than the normal curve is known as
platykurtic and for such a curve β2 < 3, i.e., γ2 < 0.
Curve of the type ‘C’ which is more peaked than the normal curve is called
leptokurtic and for such a curve β2 > 3, i.e., γ2 > 0.
4. Probability Theory

Probability
The concept of probability is difficult to define in precise terms. In ordinary
language, the word probable means likely (or) chance. Generally the word, probability, is
used to denote the happening of a certain event, and the likelihood of the occurrence of
that event, based on past experiences. By looking at the clear sky, one will say that there
will not be any rain today. On the other hand, by looking at the cloudy sky or overcast
sky, one will say that there will be rain today. In the earlier sentence, we aim that there
will not be rain and in the latter we expect rain. On the other hand a mathematician says
that the probability of rain is ‘0’ in the first case and that the probability of rain is ‘1’ in
the second case. In between 0 and 1, there are fractions denoting the chance of the event
occurring. In ordinary language, the word probability means uncertainty about
[Link] Mathematics and Statistics, a numerical measure of uncertainty is provided
by the important branch of statistics – called theory of probability. Thus we can say, that
the theory of probability describes certainty by 1 (one), impossibility by 0 (zero) and
uncertainties by the co-efficient which lies between 0 and 1.

Trial and Event An experiment which, though repeated under essentially identical (or)
same conditions does not give unique results but may result in any one of the several
possible outcomes. Performing an experiment is known as a trial and the outcomes of the
experiment are known as events.

Example:
1..Seed germination – either germinates or does not germinates are events.
2. In a lot of 5 seeds none may germinate (0), 1 or 2 or 3 or 4 or all 5 may germinate.

Sample space (S)


A set of all possible outcomes from an experiment is called sample space. For
example, a set of five seeds are sown in a plot, none may germinate, 1, 2, 3 ,4 or all five
may germinate. i.e the possible outcomes are {0, 1, 2, 3, 4, 5. The set of numbers is called
a sample space. Each possible outcome (or) element in a sample space is called sample
point.

Exhaustive Events
The total number of possible outcomes in any trial is known as exhaustive events
(or) exhaustive cases.
Example
When pesticide is applied a pest may survive or die. There are two exhaustive
cases namely ( survival, death)
In throwing of a die, there are six exhaustive cases, since anyone of the 6
faces
1, 2, 3, 4, 5, 6 may come uppermost.
In drawing 2 cards from a pack of cards the exhaustive number of cases is
52C2, since 2 cards can be drawn out of 52 cards in 52C2 ways

Trial Total number Sample Space


of trials
(1) One coin is tossed 21=2 {H,T}
2
(2) Two coins are tossed 2 =4 {HH,HT,TH,TT}

{HHH,HHT,HTT,TTT,
3
(3) Three coins are tossed 2 =8 TTH,THT,THH,HTH }

Favourable Events
The number of cases favourable to an event in a trial is the number of outcomes
which entail the happening of the event.
Example
When a seed is sown if we observe non germination of a seed, it is a
favourable event. If we are interested in germination of the seed then
germination is the favourable event.
Mutually Exclusive Events
Events are said to be mutually exclusive (or) incompatible if the happening of any
one of the events excludes (or) precludes the happening of all the others i.e.) if no two or
more of the events can happen simultaneously in the same trial. (i.e.) The joint
occurrence is not possible.
Example
In observation of seed germination the seed may either germinate or it will not
germinate. Germination and non germination are mutually exclusive events.
Equally Likely Events
Outcomes of a trial are said to be equally likely if taking in to consideration all the
relevant evidences, there is no reason to expect one in preference to the others. (i.e.) Two
or more events are said to be equally likely if each one of them has an equal chance of
occurring.
Independent Events
Several events are said to be independent if the happening of an event is not
affected by the happening of one or more events.
Example
When two seeds are sown in a pot, one seed germinates. It would not
affect the germination or non germination of the second seed. One event
does not affect the other event.
Dependent Events
If the happening of one event is affected by the happening of one or more events,
then the events are called dependent events.
Example
If we draw a card from a pack of well shuffled cards, if the first card drawn is not
replaced then the second draw is dependent on the first draw.

Note: In the case of independent (or) dependent events, the joint occurrence is possible.

Definition of Probability
1. Mathematical (or) Classical (or) a-priori Probability
If an experiment results in ‘n’ exhaustive cases which are mutually exclusive and
equally likely cases out of which ‘m’ events are favourable to the happening of an event
‘A’, then the probability ‘p’ of happening of ‘A’ is given by

Note
If m = 0 ⇒ P(A) = 0, then ‘A’ is called an impossible event. (i.e.) also by P(φ) = 0.
If m = n ⇒ P(A) = 1, then ‘A’ is called assure (or) certain event.
The probability is a non-negative real number and cannot exceed unity (i.e.)
lies between 0 to 1.
The probability of non-happening of the event ‘A’ (i.e.) P(A) It is denoted by ‘q’.

q=1–p
p+q=1

(or) P (A) + P ( ) = 1.
2. Statistical (or) Empirical Probability (or) a-posteriori Probability
If an experiment is repeated a number (n) of times, an event ‘A’ happens ‘m’
times then the statistical probability of ‘A’ is given by

3. Axioms for Probability


The probability of an event ranges from 0 to 1. If the event cannot take
place its probability shall be ‘0’ if it certain, its probability shall be ‘1’.
Let E1, E2, …., En be any events, then P (Ei) ≥ 0.
The probability of the entire sample space is ‘1’. (i.e.) P(S) = 1.

Total Probability, =

If A and B are mutually exclusive (or) disjoint events then the probability
of occurrence of either A (or) B denoted by P(AUB) shall be given by
P(A∪B) = P(A) + P(B)
P(E1∪E2∪….∪En) = P (E1) + P (E2) + …… + P (En)

If E1, E2, …., En are mutually exclusive events.

-----------------------------------------------------------------------------------------------------------------------------
Example 1: Two dice are tossed. What is the probability of getting (i) Sum 6 (ii) Sum 9

Solution
When 2 dice are tossed. The exhaustive number of cases is 36.

( 1,1), (1,2), (1,3), (1,4), (1,5), (1,6)


( 2,1), (2,2), (2,3), (2,4), (2,5), (2,6)
( 3,1), (3,2), (3,3), (3,4), (3,5), (3,6)
( 4,1), 4,2), (4,3), (4,4), (4,5), (4,6)
( 5,1), (5,2), (5,3), (5,4), (5,5), (5,6)
( 6,1), (6,2), (6,3), (6,4), (6,5), (6,6)

Sum 6 = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}
Favourable number of cases = 5
5
P (Sum 6) =
36
Sum 9 = {(3, 6), (4, 5), (5, 4), (6, 3)}
Favourable number of cases = 4
4
P (Sum 9) =
36
Example 2: A card is drawn from a pack of cards. What is a probability of getting (i) a
king (ii) a spade (iii) a red card?

Solution
There are 52 cards in a pack.
One can be selected in 52C1 ways.
Exhaustive number of cases is = 52C1 = 52.
(i) A king
There are 4 kings in a pack.
One king can be selected in 4C1 ways.
Favourable number of cases is = 4C1 = 4

Hence the probability of getting a king =


(ii) A spade
There are 13 spade in a pack.
One spade can be selected in 13C1 ways.
Favourable number of cases is = 13C1 = 13 Hence the
probability of getting a spade = 13/52
(iii) A red card
There are 26 red in a pack.
One red card can be selected in 26C1 ways.
Favourable number of cases is = 26C1 = 26
Hence the probability of getting a red card = 26/52

Example 3: What is the probability of getting 53 Sundays when a leap year selected at
random?

Solution
A leap year consists of 366 days.
This has 52 full weeks and 2 days remained.
The remaining 2 days have the following possibilities.

(i) Sun. Mon (ii) Mon, Tues (iii) Tues, Wed (iv) Wed, Thurs (v) Thurs, Fri (vi) Fri, Sat
(vii) Sat, Sun.
In order that a lap year selected at random should contain 53 Sundays, one of the
2 over days must be Sunday.
Exhaustive number of cases is = 7
Favourable number of cases is = 2
Required Probability is = 2/7
Conditional Probability
Two events A and B are said to be dependent, when B can occur only when A is
known to have occurred (or vice versa). The probability attached to such an event is
called the conditional probability and is denoted by P (A/B) (read it as: A given B) or, in
other words, probability of A given that B has occurred.

If two events A and B are dependent, then the conditional probability of B given A is,
Theorems of Probability

There are two important theorems of probability namely,


The addition theorem on probability
The multiplication theorem on probability.

I. Addition Theorem on Probability


(i) Let A and B be any two events which are not mutually exclusive
P (A or B) = P (A∪B) = P (A + B) = P (A) + P (B) – P (A∩B) (or)
= P (A) + P (B) – P (AB)
Proof:
(ii) Let A and B be any two events which are mutually exclusive

P (A or B) = P (A∪B) = P (A + B) = P (A) + P (B)

Proof

We know that, n (A∪B) = n (A) + n (B)

P (A∪B) =

=
P (A∪B) = P (A) + P (B)

Note
In the case of 3 events, (not mutually exclusive events)
P (A or B or C) = P (A∪B∪C) = P (A + B + C)
P (A) + P (B) + P (C) – P (A∩B) – P (B∩C) – P (A∩C) + P (A∩B∩C)
In the case of 3 events, (mutually exclusive events)
P (A or B or C) = P (A∪B∪C) = P (A + B + C) = P (A) + P (B) + P (C)

Example
Using the additive law of probability we can find the probability that in one roll of a
die, we will obtain either a one-spot or a six-spot. The probability of obtaining a one-spot is
1/6. The probability of obtaining a six-spot is also 1/6. The probability of rolling a die and
getting a side that has both a one-spot with a six-spot is 0. There is no side on a
die that has both these events. So substituting these values into the equation gives the
following result:
Example :Finding the probability of drawing a 4 of hearts or a 6 or any suit using the
additive law of probability would give the following:

There is only a single 4 of hearts, there are 4 sixes in the deck and there isn't a single
card that is both the 4 of hearts and a six of any suit.

Now using the additive law of probability, you can find the probability of drawing
either a king or any club from a deck of shuffled cards. The equation would be completed
like this:

There are 4 kings, 13 clubs, and obviously one card is both a king and a club. We
don't want to count that card twice, so you must subtract one of it's occurrences away to
obtain the result.

II. Multiplication Theorem on Probability


(i) If A and B be any two events which are not independent, then (i.e.) dependent.
P (A and B) = P (A∩B) = P (AB) = P (A). P (B/A) (I)
= P (B). P (A/B) (II)
Where P (B/A) and P (A/B) are the conditional probability of B given A and A
given B respectively.
Proof

Let n is the total number of events


n (A) is the number of events in A
n (B) is the number of events in B
n (A∪B) is the number of events in (A∪B)
n (A∩B) is the number of events in (A∩B)

P (A∩B) =

P (A∩B) = P (A). P (B/A) (I)


P (A∩B)

P (A∩B) = P (B). P (A/B) (II)

If A and B be any two events which are independent, then,


P (B/A) = P (B) and P (A/B) = P (A)
P (A and B) = P (A∩B) = P (AB) = P (A) . P (B)

Note
(i) In the case of 3 events, (dependent)
P (A∩B∩C) = P (A). P (B/A). P (C/AB)

In the case of 3 events, (independent)


P (A∩B∩C) = P (A). P (B). P (C)

Example
So in finding the probability of drawing a 4 and then a 7 from a well shuffled
deck of cards, this law would state that we need to multiply those separate probabilities
together. Completing the equation above gives:

----------------------------------------------------------------------------------------------------------------------------------------------------------

Questions

Q.1 Write short note on ‘ Addition theorem of Probability.( 2013-14, 2015-16, 2017-18, 5 marks)
Q.2 Define mathematical , statistical and the axioms of probability. Explain addition theorem on
probability for any two events. (2014-15)
Q.3. Define probability and conditional probability. State and prove the addition theorem of probability
for any two events. (2016-17)
Objective Questions
1. Probability is expressed as
(a) Ratio (b) percentage (c) Proportion (d) all the above
Ans: all the above
2. Probability can take values from
(a) - ∞ to +∞ (b) - ∞ to 1 (c) 0 to +1 (d) –1 to +1
Ans: 0 to +1
3. The probability of a sure event is One.
Ans: True
4. If A and B are mutually exclusive events, then P (AUB) = ------------
Ans: P (A) + P (B)
5. An integer is chosen from 1 to 20. The probability that the number is divisible
by 4 is ¼.
Ans: True
------------------------------------------------------------------------------------------------------------------------
5. Theoretical Distributions

Random variable – random variable we mean a real number ‘X’ connected with the outcome of
random experiment.
e.g. consider an experiment of tossing of 2 coins, we get outcome
S = { HH, TH, HT, TT}
Outcome – HH TH HT TT
Value of X – 2 1 1 0
(Here no. of heads)
Then,
X: 0 1 2 Frequency distribution
F: 1 2 1

Discrete random variable – if a random variable takes at most a countable number of values, it
is called a discrete random variable. In other words a real valued function delivered on a discrete
sample space is called a discrete random variable.
Example: 1. No. of outcomes in tossing of coin.
Continuous random variable – a random variable X is said to be continuous if it can take all
possible values between certain limits.
Example: 1. Height of plant.
2. Grain yield in kg/plot

Theoretical distributions are

1. Binomial distribution
Discrete distribution
2. Poisson distribution

3. Normal distribution Continuous distribution

Discrete Probability distribution

Bernoulli distribution

A random variable x takes two values 0 and 1, with probabilities q and p ie.,
p(x=1) = p and p(x=0)=q, q-1-p is called a Bernoulli variate and is said to be Bernoulli
distribution where p and q are probability of success and failure. It was given by Swiss
mathematician James Bernoulli (1654-1705)

Example
• Tossing a coin(head or tail)
• Germination of seed(germinate or not)

1. Binomial distribution

Binomial distribution was discovered by James Bernoulli (1654-1705). Let a


random experiment be performed repeatedly and the occurrence of an event in a trial be
called as success and its non-occurrence is failure. Consider a set of n independent trails
(n being finite), in which the probability p of success in any trail is constant for each trial.
Then q=1-p is the probability of failure in any trail.
The probability of x success and consequently n-x failures in n independent trails.

But x successes in n trails can occur in ncx ways. Probability for each of these ways is px qn-x.
P(sss…ff…fsf…f)=p(s)p(s)….p(f)p(f)….
p,p…q,q…
(p,p…p)(q,q…q)
(x times) (n-x times)
Hence the probability of x success in n trials is given by
ncx pxqn-x

Definition

A random variable x is said to follow binomial distribution if it assumes non-


negative values and its probability mass function is given by

P(X=x) =p(x) =

ncx pxqn-x , x=0,1,2…n

q=1-p

0, otherwise

The two independent constants n and p in the distribution are known as the parameters of
the distribution.
Condition for Binomial distribution

We get the binomial distribution under the following experimentation conditions


• The number of trial n is finite

• The trials are independent of each other.

• The probability of success p is constant for each trial.

• Each trial must result in a success or failure.

• The events are discrete events.


Properties

If p and q are equal, the given binomial distribution will be symmetrical. If p


and q are not equal, the distribution will be skewed distribution.

• Mean = E(x) = np

• Variance =V(x) = npq (mean>variance)


Application

• Quality control measures and sampling process in industries to classify


items as defectives or non-defective.

• Medical applications such as success or failure, cure or no-cure.


-------------------------------------------------------------------------------------------------------------------------------------------------
Example 1
Eight coins are tossed simultaneously. Find the probability of getting atleast six heads.
Solution
Here number of trials, n = 8, p denotes the probability of getting a head.

and
If the random variable X denotes the number of heads, then the probability of a success in
n trials is given by
P(X = x) = ncx px qn-x , x = 0 , 1, 2, ..., n

Probability of getting at least six heads is given by


P(x  6) = P(x = 6) + P(x = 7) + P(x = 8)
2. Poisson distribution
The Poisson distribution, named after Simeon Denis Poisson (1781-1840).
Poisson distribution is a discrete distribution. It describes random events that occurs
rarely over a unit of time or space.

It differs from the binomial distribution in the sense that we count the number of
success and number of failures, while in Poisson distribution, the average number of
success in given unit of time or space.

Definition

The probability that exactly x events will occur in a given time is as follows

P(x) = , x=0,1,2…

called as probability mass function of Poisson distribution.

where λ is the average number of occurrences per unit of time

λ = np

Condition for Poisson distribution

Poisson distribution is the limiting case of binomial distribution under the


following assumptions.

• The number of trials n should be indefinitely large ie., n->∞


• The probability of success p for each trial is indefinitely small.
• np= λ, should be finite where λ is constant.
Properties

Poisson distribution is defined by single parameter λ.


• Mean = λ
• Variance = λ. Mean and Variance are equal.
Application
• It is used in quality control statistics to count the number of defects of an item.
• In biology, to count the number of bacteria.
• In determining the number of deaths in a district in a given period, by rare
disease.
• The number of error per page in typed material.
• The number of plants infected with a particular disease in a plot of field.
• Number of weeds in particular species in different plots of a field.
Example: Suppose on an average 1 house in 1000 in a certain district has a fire during a
year. If there are 2000 houses in that district, what is the probability that exactly 5 houses
will have a fire during the year? [given that e-2 = 0.13534]
Solution:

Mean, = np , n = 2000 and p =

λ=2
The Poisson distribution is

= 0.036

IMPORTANT

Recurrence Relation for the Probabilities of Binomial Distribution

( used for Fitting of Binomial Distribution )


𝑛−𝑥 𝑝
p (x+1) = ( ) p (x)
𝑥+1 𝑞

Recurrence Relation for the Probabilities of Poisson Distribution

( used for Fitting of Poisson Distribution )

λ
p (x+1) = p( x )
𝑥+1
3. Normal distribution:

The most important and widely used probability distribution is normal distribution. It is also
know as Gaussian distribution.
Binomial, Poisson distribution etc. can be approximated by Normal distribution.
Many of sampling distribution like student ’t’ , F and χ 2 tends to normality for large
samples. i.e. for sample size > = 30.

Definition

A continuous random variable x is said to be follow a normal


distribution with parameters µ and σ2, if the probability density function
is given by.

f(x) = ; - < x < , - < <, >0

0 otherwise.

Where, π = constant
µ= population mean
ơ = standard deviation of population
x = random variable

Properties of Normal distribution:


1 The curve of the Normal distribution is bell shaped and symmetric about mean.
2. The shape of the curve is depends on µ ,ơ , and n
3. The variables vary continuously.
4. The majority of observations have tendency to cluster about mean
5. Mean , Median, Mode are coincides .i.e. mean=median=mode
6. The area under normal curve within its range -∞ to ∞ is always unity.
7. Area property
P (µ- ơ < X< µ-+ơ) = 0.6826 i.e 68.26% area lies within µ+ ơ limit
P (µ- 2ơ < X< µ-+2ơ) = 0.9544 i.e 95.44% area lies within µ+ 2ơ limit
P (µ- 3ơ < X< µ-+3ơ) = 0.9973 i.e 99.73% area lies within µ+ 3ơ limit
8. Pearson’s coefficient β1=0 and β2=3
Importance / Applications of Normal distribution:
i. Most of the distributions occurring in practice, e.g. Binomial, Poisson, etc.
can be approximated by Normal distribution. Many of the sampling
distributions e.g. Student t, Snedecor’s F , Chi-square distribution etc. tends
to normality for large samples.
ii. Even if a variable is not normally distributed , it can some times brought to
normal form by simple transformation of variable. e.g. if the distribution of X is
skewed , the distribution of √x might come to be normal.
iii. If x ~ N ( µ, ơ2 ) then,
P (µ- 3ơ < X< µ-+3ơ) = 0.9973
P(-3 < Z < 3 ) = 0.9973
P ( Z < 3 ) = 0.9973
P ( Z > 3) ) = 0.0027
This property of the normal distribution forms basis of entire Large Sample theory
iv. Many of the distributions of sample statistic e.g. sample mean, variance [Link]
to normality for large samples and as such they can best studied with the help of the
normal curve.

v. The entire theory of small samples tests viz. t , F. ᵡ2 are based on the fundamental
assumptions that the parent population from which the samples have been drawn
follow normal distribution.
vi. Normal distribution finds large applications in statistical quality control in
industry for setting of control limits.

4. Standard Normal distribution

Let X be random variable which follows normal distribution with mean m and

variance s2 .The standard normal variate is defined as which follows


standard normal distribution with mean 0 and standard deviation 1 i.e., Z ~
N(0,1). The

standard normal distribution is given by

1
1 − 𝑥2
𝑒 2 -∞<x<∞
√2𝜋
f (x) =
0 otherwise
advantage of the above function is that it doesn’t contain any parameter. This enables us to
compute the area under the normal probability curve.

Note

Property of

1.

2.

Example 6: In a normal distribution whose mean is 12 and standard deviation is


2. Find the probability for the interval from x = 9.6 to x = 13.8

Solution

Given that Z~ N (12, 4)

= P(-1.2 ≤ Z ≤ 0)+P(0 ≤ Z ≤ 0.9)


= P(0≤ Z ≤ 1.2)+P(0 ≤ Z ≤ 0.9) [by using symmetric property]
=0.3849 +0.3159
=0.7008

When it is converted to percentage (ie) 70% of the observations are covered


between 9.6 to 13.8.

Example: For a normal distribution whose mean is 2 and standard deviation 3.


Find the value of the variate such that the probability of the variate from the mean
to the value is 0.4115

Solution: Given that Z~ N (2, 9)


To find X1:

We have P (2 ≤ Z ≤X1) =0.4115

P (0 ≤ Z ≤ Z1) =0.4115 where


[From the normal table where 0.4115 lies is rthe

value of Z1] Form the normal table we have Z1=1.35


⇒3(1.35)+2=X1

=X1=6.05

(i.e) 41 % of the observation converged between 2 and 6.05


-------------------------------------------------------------------------------------------------------------------
Questions:

Q.1 State Normal distribution and explain its properties. ( 2013-14) ,( 2014-15), (2015-16)
Q.2 Define Normal distribution and standard normal distribution. Write properties and applications /
Importance of normal distribution.( 2016-17) ( 2017-18)
Q. 3 Define Binomial and Poisson Distribution and write their properties.( 2017-18)
-------------------------------------------------------------------------------------------------------------------

Objective Questions

1. For a Poisson distribution


(a) mean > variance (b) mean = variance
(c) mean < variance (d) mean < variance

Ans: mean = variance

2. In normal distribution, skewness is


(a) one (b) zero
(c) greater than one (d) less than one
Ans: zero

3. Poisson distribution is a distribution for rare events

Ans: True

4. The total area under normal probability curve is one.

Ans: True

5. Poisson distribution is for continuous variable.

Ans: False

6. In a symmetrical curve mean, median and mode will coincide.

Ans: True

7. Probability density function of normal variable

1 (𝑥−30)2
𝑒− 2
1
P(X=x)= 25 -∞<x<∞
5 √2𝜋

Then what are mean and variance ?


1

Lecture.6 Test of Significance

Definitions :
1. Population – the aggregate of all units is known as population.
e.g. i) population may be consist of all paddy plants in a field.
ii) all the fields under specified crop.
iii) all farms in a village.
iv) all houses in a village.
There are two types of populations:

i) Finite population : e.g Populations of farmers or farms in a region

ii) Infinite population: e.g Populations of plants or insects in a region is infinite

2. Frame – A convenient list of all units in the population is called as frame.


3. Sample – A part of population is known as sample.
4. Parameter: – Population quantities are called parameter viz., mean, variance etc.
5. Statistic: - Sample quantities are called statistic. viz mean, variance etc. calculated from samples.

Sampling Distribution
By drawing all possible samples of same size from a population we can calculate the
statistic, for example, for all samples. Based on this we can construct a frequency
distribution and the probability distribution of . Such probability distribution of a statistic is
known a sampling distribution of that statistic.

Standard Error
The Standard deviation (SD) of the sampling distributions of the statistic is
called standard error of the Statistic.

It may be noted that the standard deviation is for units whereas the standard
error is for the statistic.

Hypothesis
Hypothesis is a quantitative statement or assumption about population.
Statistical Hypothesis
When the assumption or statement that occurs under certain conditions is
formulated as scientific hypothesis, we can construct criteria by which a scientific
hypothesis is either rejected or provisionally accepted. For this purpose, the
2

scientific hypothesis is translated into statistical language. If the hypothesis in given


in a statistical language it is called a statistical hypothesis.

For eg:-
The yield of a new paddy variety will be 3500 kg per hectare – scientific hypothesis.
In Statistical language if may be stated as the random variable (yield of
paddy) is distributed normally with mean 3500 kg/ha.
Simple Hypothesis:
When a hypothesis specifies all the parameters of a probability distribution,
it is known as simple hypothesis. The hypothesis specifies all the parameters, i.e µ
and σ of a normal distribution.
For eg:-The random variable x is distributed normally with mean µ=0 & SD=1 is a simple
hypothesis.
The hypothesis specifies all the parameters (µ & σ) of a normal distributions.

Composite Hypothesis:
If the hypothesis specific only some of the parameters of the probability
distribution, it is known as composite hypothesis. In the above example if only the
µ is specified or only the σ is specified it is a composite hypothesis.
Null Hypothesis - Ho
The hypothesis which is under test is called null hypothesis.

Consider for example, the hypothesis may be put in a form ‘paddy variety A
will give the same yield per hectare as that of variety B’ or there is no difference
between the average yields of paddy varieties A and B. These hypotheses are in
definite terms. Thus these hypothesis form a basis to work with. Such a working
hypothesis in known as null hypothesis. It is called null hypothesis because if
nullities the original hypothesis, that variety A will give more yield than variety B.
The null hypothesis is stated as ‘there is no significant difference between
the effect of two treatments or there is no association between two attributes (ie) the
two attributes are independent. Null hypothesis is denoted by Ho.
Eg:-
There is no significant difference between the yields of two paddy varieties
(or) they give same yield per unit area. Symbolically, Ho: µ 1=µ2.

Alternative Hypothesis:
Any hypothesis which is complementary to null hypothesis is called
alternative hypothesis, usually denoted by H1.
3

Eg:-There is a significance difference between the yields of two paddy varieties.


Symbolically,
H1: µ1≠µ2 (two sided or directionless alternative)
If the statement is that A gives significantly less yield than B (or) A gives
significantly more yield than B. Symbolically,
H1: µ1 < µ2 (one sided alternative-left tailed) H1: µ1 > µ2 (one sided alternative -
right tailed)

Testing of Hypothesis :
Once the hypothesis is formulated we have to make a decision on it. A
statistical procedure by which we decide to accept or reject a statistical hypothesis
is called testing of hypothesis.
Sampling Error
From sample data, the statistic is computed and the parameter is estimated
through the statistic. The difference between the parameter and the statistic is
known as the sampling error.

Test of Significance:
Based on the sampling error the sampling distributions are derived. The
observed results are then compared with the expected results on the basis of
sampling distribution. If the difference between the observed and expected results is
more than specified quantity of the standard error of the statistic, it is said to be
significant at a specified probability level. The process up to this stage is known as
test of significance.
Decision Errors:
By performing a test we make a decision on the hypothesis by accepting or
rejecting the null hypothesis Ho. In the process we may make a correct decision on
Ho or commit one of two kinds of error.
Type I error :We may reject Ho based on sample data when in fact it is true. This
error in decisions is known as Type I error. i.e Reject the null hypothesis H0
when it is true.
Type II error : We may accept Ho based on sample data when in fact it is not true. It is known
as Type II error. i.e Accept the null hypothesis H0 when it is not true
Accept Ho Reject Ho
Ho is true Correct Decision Type I error
Ho is false Type II error Correct Decision
4

The relationship between type I & type II errors is that if one increases the other
will decrease. The probability of type I error is denoted by α. The probability of
type II error is denoted by β. The correct decision of rejecting the null
hypothesis when it is false is known as the power of the test. The probability of
the power is given by 1-β.

Critical Region:
The testing of statistical hypothesis involves the choice of a region on the
sampling distribution of statistic. If the statistic falls within this region, the null
hypothesis is rejected: otherwise it is accepted. This region is called critical region
( rejection region).
Test statistic:
The sampling distribution of a statistic like Z, t, and χ2 are known as test statistic.
Generally, in case of quantitative data

Note
The choice of the test statistic depends on the nature of the variable (ie) qualitative or
quantitative, the statistic involved (i.e) mean or variance and the sample size, (i.e) large or small.
Level of Significance:
𝛼 𝛼
The probability that the statistic will fall in the critical region + . This α is
2 2

nothing but the probability of committing type I error. Technically the probability of
committing type I error is known as level of Significance.
One and two tailed test:
The nature of the alternative hypothesis determines the position of the
critical region. For example, if H1 is µ1≠µ2 it does not show the direction and hence
the critical region falls on either end of the sampling distribution. If H1 is µ1 < µ2 or
µ1 > µ2 the direction is known. In the
first case the critical region falls on the left of the distribution whereas in the second
case it falls on the right side.

One tailed test – When the critical region falls on one end of the sampling
distribution, it is called one tailed test.
5

Two tailed test – When the critical region falls on either end of the sampling
distribution, it is called two tailed test.

For example, consider the mean yield of new paddy variety (µ2) is compared with
that of a ruling variety (µ1). Unless the new variety is more promising that the
ruling variety in terms of yield we are not going to accept the new variety. In this
case H1 : µ1 > µ2 for which one tailed test is used. If both the varieties are new our
interest will be to choose the best of the two. In this case H1: µ1 ≠ µ2 for which we
use two tailed test.

Degrees of freedom:
The number of degrees of freedom is the number of observations that are
free to vary after certain restriction have been placed on the data. If there are n
observations in the sample, for each restriction imposed upon the original
observation the number of degrees of freedom is reduced by one.
The number of independent observations in a set is called as the degrees
of freedom.
Steps in testing of hypothesis:
The process of testing a hypothesis involves following steps.
1. Formulation of null & alternative hypothesis.
2. Specification of level of significance.
3. Selection of test statistic and its computation.
4. Finding out the critical value from tables using the level of
significance, sampling distribution and its degrees of freedom.
5. Determination of the significance of the test statistic.
6. Decision about the null hypothesis based on the significance of the test statistic.
7. Writing the conclusion in such a way that it answers the question on hand.
6

Student’s t test

Let x1, x2, ----xn be a random sample drawn from a normal distribution having men µ
and standard deviation 𝞼 (unknown). Then student ‘t’ statistic is given by,

̅ −µ|
|𝑥
t= where 𝑥̅ = sample mean
𝑠/ √𝑛
s = sample standard deviation
µ = population mean

This follows t distribution with (n-1) degrees of freedom which can be written as t(n-1) d.f.
This fact was brought out by Sir William Gossest and Prof. R.A Fisher. Sir
William Gossest published his discovery in 1905 under the pen name Student
and later on developed and extended by Prof. R.A Fisher. He gave a test known as t-
test.

Applications (or) uses


1. To test the single mean in single sample case.
2. To test the equality of two means in double sample case.
(i) Independent samples(Independent t test)
(ii) Dependent samples (Paired t test)
3. To test the significance of observed correlation coefficient.
4. To test the significance of observed partial correlation coefficient.
5. To test the significance of observed regression coefficient.

I) Student t Test for single Mean:


1. Set up the null hypothesis Ho: 𝑥̅ = µ
(i.e) There is no significance difference between the sample mean and the population mean.

2. Set up the Alternate hypothesis H1: 𝑥̅ ≠µ (µ>𝑥̅ or µ< ̅𝑥)


i e. There is significance difference between the sample mean and the population
mean
3. Level of Significance
The level may be fixed at either 5% or 1%
̅ −µ|
|𝑥
4. Test statistic t=
𝑠/ √𝑛

which follows t distribution with (n-1) degrees of freedom.


7

∑ 𝑥𝑖
where 𝑥̅ = 𝑛

5. Find the table value of t corresponding to (n-1) d.f. and the specified level of
significance.
6. Inference
If tcal < ttab we accept the null hypothesis H0. We conclude that there is no
significant difference between sample mean and population mean
(or) if tcal > ttab we reject the null hypothesis H0. (ie) we accept the alternative
hypothesis and conclude that there is significant difference between the sample mean
and the population mean.

II. Test for equality of two Means (Independent Samples)


Given two sets of sample observation x11,x12,x13…x1n , and x21,x22,x23…x2n
of sizes n1 and n2 respectively from the normal population.
1. Using F-Test , test their variances
(i) Variances are Equal
Ho: µ1=µ2 H1 µ1≠µ2 (or µ1<µ2 or µ1>µ2)

Test statistic

where 𝑥̅1 = mean of first sample


𝑥̅ 2 = mean of second sample

𝑛1 𝑠1 2 + 𝑛2 𝑠2 2
and the combined variance s2 = where 𝑠1 2 first sample variance
𝑛1 +𝑛2 −2
𝑠2 2 second sample variance

The test statistic t follows a t distribution with (n1+ n2 - 2) d.f.

(ii) Variances are unequal:

̅𝟏 −𝒙
|𝒙 ̅𝟐 |
t=
𝒔𝟏 𝟐 𝒔𝟐 𝟐
√ +
𝒏𝟏 𝒏𝟐
8

This statistic follows neither t nor normal distribution but it follows Behrens-Fisher d
distribution. The Behrens – Fisher test is laborious one. An alternative simple method
has been suggested by Cochran & Cox. In this method the critical value of t is altered
as tw (i.e) weighted t

where t1is the critical value for t with (n1-1) d.f. at and specified level of significance
and t2 is the critical value for t with (n2-1) d.f. at and specified level of significance.
Inference:
t > tw We reject the null hypothesis H0.

III ) Equality of two means (Dependant samples) / paired samples


Paired t test
In the t-test for difference between two means, the two samples were
independent of each other. Let us now take particular situations where the samples are
not independent.
In agricultural experiments it may not be possible to get required number of
homogeneous experimental units. For example, required number of plots which are
similar in all; characteristics may not be available. In such cases each plot may be
divided into two equal parts and one treatment is applied to one part and second
treatment to another part of the plot. The results of the experiment will result in two
correlated samples. In such situations we apply paired t test.
Suppose the observation before treatment is denoted by x and the observation
after treatment is denoted by y. for each experimental unit we get a pair of
observation(x,y). In case of n experimental units we get n pairs of observations :
(x1,y1), (x2,y2)…(xn,yn). In order to apply the paired t test we find out the differences
(x1,y1), (x2,y2)…(xn,yn) and denote them as d1,d2,…,dn. We apply the t test procedure
for one sample (i.e)
Hypothesis: Ho: µ1=µ2 v/s H1: µ1≠µ2 (or µ1<µ2 or µ1>µ2)
Test statistic:
|𝑑̅| |𝑑̅|
t= =
√𝑠 2 /𝑛 𝑠/√𝑛
9

the mean may be positive or negative. Hence we take the absolute value as .
The test statistic t follows t distribution with (n-1) d.f.
Inference:
t cal. > t(n-1) We reject the null hypothesis H0.
------------------------------------------------------------------------------------------------------------------------------------------------------------------

F test for equality of variances:


If there are two independent random samples from normal populations, we have to test
the hypothesis that the population variances 𝞼12 and 𝞼22 are same.
H0 : 𝞼12 = 𝞼22 v/s H1 : 𝞼12 ≠ 𝞼22

Let s12 and s22 be the estimates of 𝞼12 and 𝞼22 based on a sample of size n1 and n2 respectively.

H0 can be tested by the F test as follows,


𝑆12 𝑠2 2
i) F = 2 if 𝑠1 2 > 𝑠2 2 or ii) F= if 𝑠2 2 > 𝑠1 2
𝑠2 𝑠1 2

See the table value of F for n1 -1 and n2-1 d.f .


n1 -1 corresponds to the numerator of F with greater variance , n2-1 for the denominator at a specified
level of significance ( 5%, 1% etc.) for first case. And n2 -1 corresponds to the numerator of F with
greater variance , n1-1 for the denominator at a specified level of significance ( 5%, 1% etc.) for case
ii.
Inference: F cal. > F table value, We reject the null hypothesis H0.
-----------------------------------------------------------------------------------------------------------------------------
Broad questions:
Q.1 Explain various t- tests and its applications. (2013-14), (2015-16), (2017-18),(2019-20)---10 M
Q.2 Write short note on ‘paired t-test’ ( 2014-15), 5 marks.
Q.3. What are steps of testing of hypothesis.? Explain t-test along with its application.(2016-17)--10 M
Q.4 Write short note on ‘F -test’ ( 2019-20), 5 marks.
10

Objective questions:
1. In paired t test with n observations in each group the degrees of
freedom is (a) n (b) n-1 (c) n-2 (d) n+1
Ans: n-1
2. Student t- test is applicable in case of small samples.
Ans: True
3. F test is also known as variance ratio test.
Ans: True
4. In case of comparing the equality of two variances the greater variance
should be taking in the numerator.
Ans: True
5. While comparing the means of two independent samples the variances of the
two samples will be always equal.
Ans: False
6. A hypothesis may be classified as
a) Simple b) Composite (c) Null (d) All the above
Ans: All the above
7. Area of the critical region depends on
a) Size of type I error b) Size of type II error
c) Value of the statistics d) Number of observations
Ans: Size of type I error
8. If the calculated test statistic is greater than the critical value, the null hypothesis isrejected.
Ans: True
9. .The standard error of mean is given by
Ans: True
[Link] the alternative hypothesis is µ1≠ µ2 then the test is known as one tailed test.

Ans: False
1

Lecture. 7 Chi-square ( χ2 ) Tests

In case of attributes we can not employ the parametric tests such as F and t. Instead of that
we have to apply χ2 test . When we want to test whether a set of observed values are in agreement
with those expected on the basis of some theories or hypothesis.

The Chi-square ( χ2 ) test has a number of applications. It is used to:

1. Test the goodness of fit.


2. Test the independence of attributes.

1. Chi-square ( χ2 ) test for goodness of fit :


A very powerful test for testing the significance of the discrepancy between theory and
experiment was given by Prof. Karl Pearson in 1900 and is known as Chi-square test of
goodness of fit.
If 0i (i=1,2,…..n) is a set of observed (experimental frequencies) and Ei (i=1,2,…..n)
is the corresponding set of expected (theoretical or hypothetical) frequencies, then

(𝑶𝒊 −𝑬𝒊 )𝟐
χ2 = ∑
𝑬𝒊

It follows a χ2 distribution with n-1 d.f. In case of χ2 only one tailed test is used.

Conditions for the validity of χ2 test:


1 .The sample observations should be independent.
2. Constraints on the cell frequency, if any, should be linear. ∑ 𝑂𝑖 = ∑ 𝐸𝑖
3. N the total frequency should be reasonably large, say greater then > 50.
4. No theoretical cell frequency should be less than (<) 5. If any theoretical cell frequency is < 5
then for the application of - test, it is pooled with the preceding or succeeding frequency so
that the pooled frequency is more than 5 and finally adjust for degree’s of freedom lost in
pooling.

---------------------------------------------------------------------------------------------------------------
2

2. Chi-square ( χ2 ) test of independence of attributes in 2x2 – contingency table

When the number of rows and number of columns are equal to 2 it is termed as 2 x2
contingency table . It will be in the following form.

B1 B2 Row Total
A1 a b a+b = r1
A2 c d c+d = r2
Column Total a+c = c1 b+d = c2 a+b+c+d = N

Where a, b, c and d are cell frequencies c1 and c2 are column totals, r1 and r2 are row totals and n
is the total number of observations.

In case of 2 x 2 contingency table χ2 can be directly found using the short cut formula,
𝑵(𝒂𝒅−𝒃𝒄)𝟐
χ2 =
𝒄𝟏 𝒄𝟐 𝒓 𝟏 𝒓 𝟐

and degree of freedom for statistic is ( 2-1) x (2-1) = 1

Yate’s correction for continuity:


If any one of the cell frequency is < 5, we use Yates correction to make χ2 as
continuous. The Yate’s correction is made by adding 0.5 to the least cell frequency and
adjusting the other cell frequencies so that the column and row totals remains same.
Suppose, the first cell frequency is to be corrected then the contingency table will be as
follows:
B1 B2 Row Total
A1 a + 0.5 b - 0.5 a+b = r1
A2 c -0.5 d + 0.5 c+d = r2
Column Total a+c = c1 b+d = c2 a+b+c+d = N
Then χ2 test statistic becomes as follows:

𝑵
𝑵 (|𝒂𝒅−𝒃𝒄|− 𝟐 )𝟐
χ =
2
𝒄𝟏 𝒄𝟐 𝒓𝟏 𝒓𝟐

and degree of freedom ( d.f ) for this χ2 statistic is ( 2-1) x (2-1) = 1


3

Broad questions:
Q.1 χ2 test for testing the independences of attributes. ( Short note- 5 marks) (2013-14, 2017-18)
Q.2 Explain χ2 test and its applications. ( 10 marks) ( 2014-15)
Q.3 χ2 test for testing the goodness of fit. ( Short note- 5 marks) (2015-16)
Q.4 χ2 test ( Short note- 5 marks) (2016-17)

Objective questions:

1. The calculated value of χ 2 is


a) always positive b) always negative
c) can be either positive or negative d) none of these
Ans: always positive
2. Degrees of freedom for Chi-square in case of contingency table of order (4 ×3) are
(a) 12 (b) 9 (c) 8 (d) 6
Ans: 6

[Link] condition for application of χ 2 test is that no cell frequency should be less than five.
Ans: True

[Link] observed and expected frequencies completely coincide χ 2 will be zero.


Ans: True
1

Lecture.9
Correlation
Correlation
Correlation is the study of relationship between two or more variables. Whenever
we conduct any experiment we gather information on more related variables. When there are
two related variables their joint distribution is known as bi-variate normal distribution and if
there are more than two variables their joint distribution is known as multivariate normal
distribution.
In case of bi-variate or multivariate normal distribution, we are interested in
discovering and measuring the magnitude and direction of relationship between 2 or more
variables. For this we use the tool known as correlation.
Suppose we have two continuous variables X and Y and if the change in X affects Y,
the variables are said to be correlated. In other words, the systematic relationship between the
variables is termed as correlation.
When only 2 variables are involved the correlation is known as simple correlation
and when more than 2 variables are involved the correlation is known as multiple
correlation.
When the variables move in the same direction, these variables are said to be
positively correlated and if they move in the opposite direction they are said to be
negatively correlated.
Types of correlation :
1. Positive 2. Negative 3. Simple 4. Multiple
5. Partial 6. Linear 7. Non-linear
Scatter Diagram:
To investigate whether there is any relation between the variables X and Y we use
scatter diagram. Let (x1,y1), (x2,y2)….(xn,yn) be n pairs of observations. If the variables X and
Y are plotted along the X-axis and Y-axis respectively in the x-y plane of a graph sheet the
resultant diagram of dots is known as scatter diagram. From the scatter diagram we can say
whether there is any correlation between x and y and whether it is positive or negative or
the correlation is linear or curvilinear.
2

Positive Correlation Negative correlation

Curvilinear no correlation
(or) non linear

Pearson’s Correlation coefficient


The measures of the degree of relationship between two continuous variables is
called correlation coefficient. It is denoted by r ( in case of sample ). and ρ (in case of
population). The correlation coefficient r is known as Pearson’s correlation coefficient as
it was discovered by Karl Pearson. It is also called as product moment correlation.
The correlation coefficient r is given as the ratio of covariance of the variables x and
y to the product of the standard deviation of x and y.
𝑪𝒐𝒗(𝒙,𝒚)
r=
𝝈𝒙 𝝈𝒚

Symbolically,

which can be simplified as

∑𝑥∑𝑦
∑ 𝑥𝑦 −
𝑛
r= 2 2
√∑ 𝑥 2 − (∑ 𝑥) √∑ 𝑦 2 − (∑ 𝑦)
𝑛 𝑛
3

This correlation coefficient r is known as Pearson’s Correlation coefficient. The


numerator is termed as sum of product of x and y and abbreviated as SP(XY). In the
denominator the first term is called sum off squares of X (i.e) SS(X) and second term is
called sum of squares of Y (i.e) SS(Y)

The denominator in the above formula is always positive. The numerator may be
positive or negative making r to be either positive or negative.
Assumptions in correlation analysis:
Correlation coefficient r is used under certain assumptions,
1. The variables under study are continuous random variables and they are normally distributed.
2. The relationship between the variables is linear.
3. Each pair of observations is unconnected with other pair(independent).

Properties
1. The correlation coefficient r value ranges between –1 and +1.
2. Correlation coefficient is a pure number i.e it has no unit.
3. The correlation coefficient is not affected by change of origin or scale or both.
4. The relationship between correlation coefficient r and regression coefficients byx and bxy is
r = √𝒃𝒚𝒙 𝒃𝒙𝒚
5. If r > 0 it denotes positive correlation
r < 0 it denotes negative correlation between the two variables x and y.
6. If the two variables are independent , the correlation coefficient between them is 0.(zero)
7. r = +1 then the correlation is perfect positive
r = -1 then the correlation is perfect negative.

Testing the significance of r


The significance of r can be tested by Student’s t test. The test statistics is given by,

|𝑟|
t= 2
√1−𝑟
𝑛−2

This t has Student’s t distribution with (n-2) degrees of freedom.


The relationship between the variables is interpreted by the square of the correlation
coefficient (r2) which is called coefficient of determination. The value 1-r2 is called as
4

coefficient of alienation. If r2 is 0.72, it implies that on the basis of the samples 72% of the
variation in one variable is caused by the variation of the other variable.
-------------------------------------------------------------------------------------------------------
Rank Correlation:
When both the variables are not normal, the linear correlation coefficient procedure is not
applicable and we have to use rank correlation. The 2 methods of computing rank
correlation are, one proposed by Spearman and another by Kendall. Spearman’s rank
correlation procedure starts within ranking of the measurements of the values of X and Y
separately. The differences between the ranks of each of n pairs are found out (d). Then the
Spearman’s rank correlation is computed by the formula and denoted by ρ or rs
6 ∑ 𝑑𝑖 2
ρ = 1- where di = Rank (Xi) –Rank (Yi)
𝑛(𝑛2 −1)

In case of tied ranks it is given by


6 1
ρ = 1- [∑ 𝑑𝑖 2 + ∑(𝑡 3 − 𝑡)]
𝑛(𝑛2 −1) 12

t = the number of ties at a given rank for variable x and y


The values of ρ ranges from -1 to +1
Testing the significance of ρ :
The significance of ρ can be tested by Student’s t test. The test statistics is given by
|ρ|
t= 2
√1−𝜌
𝑛−2
Which follows Student’s t distribution with (n-2) degrees of freedom.
---------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------
Lecture.10
Regression
The term ‘regression’ literally means “stepping back towards the average”. It was
first used by a British biometrician Sir Francis Galton.
Definition-1: Regression is the mathematical measure of the average relationship
between two or more variables in terms of the original units of the data.
Definition-2 : The relationship between the dependent and the independent variable may be
expressed as a function and such functional relationship is termed as regression.
In case of two variables one may represent cause and the other may represent
effect. The variable representing cause is known as independent variable and is denoted
by X also known as predictor variable or regressor.
The variable representing effect is known as dependent variable and is denoted by
Y also known as predicted variable.
When there are only two variables the functional relationship is known as simple
regression and if the relation between the two variables is a straight line It is known a simple
linear regression.
When there are more than two variables and one of the variables is dependent upon others,
the functional relationship is known as multiple regression.
The regression equation is of the form y = a + bx where a is constant or intercept and b
is the regression coefficient or the slope. The values of ‘a’ and ‘b’ can be calculated by using
the method of least squares.
The regression coefficient of y on x is given by
∑𝑥∑𝑦
𝑆𝑃 (𝑥𝑦) ∑ 𝑥𝑦 −
𝑛
b= = 2
𝑆𝑆(𝑥) ∑ 𝑥2−
(∑ 𝑥)
𝑛
and then putting the value of b in the equation 𝑦̅ = a + b𝑥̅ we can get the value of a
The regression line indicates the average value of the dependent variable Y associated with a
particular value of independent variable X.
Assumptions
1. The x’s are non-random or fixed constants
2. At each fixed value of X the corresponding values of Y have a normal
distribution about a mean.
3. For any given x, the variance of Y is same.
4. The values of y observed at different levels of x are completely independent.
Properties of Regression coefficients

1. The range of regression coefficient is -∞ to + ∞


2. The correlation coefficient is the geometric mean of the two regression
coefficients. i.e r = √𝒃𝒚𝒙 𝒃𝒙𝒚 know as Fundamental property .
3. Regression coefficients are independent of change of origin but not of scale.
4. If one regression coefficient is greater than unit, then the other must be less than
unity. This property also known as magnitude property
5. The signs of regression coefficients and correlation are always same. This is know
as signature property.

Testing the significance of regression co-efficient:


To test the significance of the regression coefficient we can apply either a t test or
analysis of variance (F test). The ANOVA table for testing the regression
coefficient will be as follows:

Sources of variation d.f. SS MSS F


Due to regression 1 SS(b) Sb2 𝑆𝑏 2
𝐹=
Deviation from regression n-2 SS(y)-SS(b) S e2 𝑆𝑒 2

Total n-1 SS(y)

In case of t test the test statistic is given by

𝑏 𝑠𝑒 2
t= ~ t statistic with n-2 d.f and SE (b) = √
𝑆𝐸(𝑏) SS (X)

Uses of Regression
1. The regression analysis is useful in predicting the value of one variable from the
given value of another variable. Such predictions are useful when it is very
difficult or expensive to measure the dependent variable, Y.

2. The other use of the regression analysis is to find out the causal relationship
between variables. Suppose we manipulate the variable X and obtain a significant
regression of variables Y on the variable X. Thus we can say that there is a causal
relationship between the variable X and Y. The causal relationship between
nitrogen content of soil and growth rate in a plant, or the dose of an insecticide and
mortality of the insect population may be established in this way.
Comparison ( distinguish / difference ) between Correlation and Regression
Correlation Regression

1 Correlation study is concerned with the Regression study is concerned with the
relationship between variables. average relationship between variables.
2 Correlation indicates the extent of Regression analysis helps in estimating or
Relationship between variables. predicting value of dependent variable
3 Cause and effect relationship may not be The cause and effect relationship is clearly
identified separately indicated through regression equation.
4 The correlation coefficient are equal or The regression coefficients are different i.e.
symmetric i.e. ryx = rxy byx ≠ bxy
5 Range: It is lies between -1 to +1 Range: It is lies between -∞ to + ∞
6 It is unit free measure It is not unit free measure

Lines of regression –
If the variable in a bivariate distribution are related, we will find that the points in the scatter
diagram will cluster round some curve called the curve of “regression”. If the curve is straight
line, It is called the line of regression and said to be linear regression between the variables
otherwise regression is said to be curvilinear.
The lines of regression of Y on X and X on Y
Since ‘byx’ is the slope of the line of regression of Y and X .The equation of the line of
regression Y on X is given by
Y - ̅𝑦 = byx ( X - 𝑥̅ )
𝜎𝑦
where byx = 𝑟
𝜎𝑥

The equation of the line of regression X on Y is given by,

X - ̅𝑥 = bxy ( Y - 𝑦̅)
𝜎𝑥
where bxy = 𝑟
𝜎𝑦

1. The angle between two lines of regression is 900 then correlation coefficient r =0
2. The angle between two lines of regression is 0 or 1800 then correlation coefficient r =1
Broad Questions:

Q.1 Describe the linear regression and derive regression coefficients with its application.
(2013-14, 2015-16)
Q.2 Describe the linear regression and derive regression coefficients. State the properties of
regression coefficient and its applications. ( 2014-15)
Q.3 Describe the simple linear regression and derive regression coefficients. State the
properties of regression coefficient and its applications. ( 2016-17)
Q.4 Write a short note on ‘Non linear regression’ (2016-17, 2019-20, 5 marks)
Q.5 Describe the linear regression and derive formula for constant and regression
coefficients. (2017-18)
Q.6 Describe the multiple linear regression equation along with its derivation .(2019-20)
--------------------------------------------------------------------------------------------------------------
Questions objectives:

[Link] the correlation coefficient r = +1, then the two regression lines

a) are perpendicular to eachother b) coincide


c) are parallel toeachother d) none of these
Ans: coincide
[Link] one regression coefficient is greater than unity then the other must be

a) greater than unity b) equal to unity


c) less than unity d) none of these
Ans: less than unity
[Link] the correlation between the two variables is positive the regression coefficient will be
positive.

Ans: True
[Link] Dependent variable is also called as predicted variable.

Ans: True
[Link] coefficient is the geometric mean of two regression coefficients.

Ans: True
[Link] gives the functional relationship between two variables.

Ans: True

You might also like