0% found this document useful (0 votes)
17 views73 pages

Business Statistics Fundamentals Explained

Uploaded by

farzanafaiz222
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views73 pages

Business Statistics Fundamentals Explained

Uploaded by

farzanafaiz222
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Business Statistics and Research methods

Fundamentals of statistics

Meaning of Statistics
It refers to the numerical facts (plural noun).In other sense, statistics is some theories, methods,
principles, etc (singular noun).

Basic Words of Statistics

 Population
A population is a collection of all elements under statistical investigation about which we are
trying to draw some conclusion.
 Sample
The representative part of population selected for analysis.
 Parameter
A numerical/quantitative measure that describes the characteristics of a population.
 Statistic
A numerical/quantitative measure that describes the characteristics of a Sample.
 Variable
A variable is any value or characteristic that can change or vary from one person to
another or from one situation to another.
 Measurement
The process of applying numbers to objects according to the set of rules.

Scales
A scale is a tool or mechanism by which individuals are distinguished as to how they
differ from one another on the variables of interest of study.

Scales of measurement
 The data can be measured on one of four scales of measurement.
 Harvard psychologist S.S Stevens coined the terms Nominal, Ordinal, Interval and
Ratio scales.

Levels of Measurement

Nominal Classification

Classification
Ordinal
Order

Classification Distance
Interval
Order

Classification Distance
Ratio
Order Natural Origin

Nominal Scale
 It is the most elementary form of measurement.
 A nominal scale is the one that allow the researcher to assign subjects to certain
categories or groups.
 Assign numbers to objects where different numbers indicate different object.
 The numbers have no real ‘meaning’ other than differentiating between object
 It is the least powerful of the four.
 Mode is the only measure of central tendency.
 Chi square-test is used for testing.
Eg: - Gender (male, female).Religion.

Ordinal Scale
 An ordinal Scale not only categorizes the variables in such a way as to denote difference
among various categories, it also rank-orders the categories in some meaningful way.
 Assign numbers to objects (like nominal), but here the numbers also have meaningful order.
 Number indicates placement or order: Ist is better than second, and soon.
 An ordinal scale does not give any indication of the magnitude of the difference among the
ranks.
 Median is most popularly used and Quartile deviation is the best measure of dispersion.
Eg:- Students Letter Grading, Preference by rank.

Interval Scale
 Characteristics of Nominal and Ordinal scales plus the concept of equality of interval.
 An Interval scales allows us to perform certain mathematical operations on the data
collected from the respondents.
 Numbers have order (like ordinal), but there are also equal intervals between adjacent
categories.
 Equal distance between numbers.
 Origin could be an arbitrary number.
 Arithmetic Mean is most appropriate average and Standard Deviation most appropriate
measure of dispersion.
Eg:- Temperature in Fahrenheit, Time series.

Ratio Scale
 The Ratio scale overcomes the disadvantage of the arbitrary origin point of the interval scale,
in that has an absolute zero point, which is a meaningful measurement point.
 It is the most powerful of the four scales.
 Ratio Scale is the sum total of Nominal, Ordinal, Interval scales.
 Geometric Mean and Harmonic mean are used as Measure of central tendency and Variance
can be used for measures of dispersion.
Eg: - Height, Weight, etc
Scales of Measurement
 Scales of measurement refers to how the properties of numbers can change with different uses.
In all scales of measurement are characterized by three properties: Order, Distance and Origin.
1. Order:-Does a larger number indicates a greater value than a smaller number?
2. Distance:-Does subtracting two numbers represent some meaningful value?
3. Zero Origin:-Does dividing (taking the ratio of ) two numbers represents some meaningful
value?

Scale Of Measurement

Nominal Ordinal Interval Ratio

Order No Yes Yes Yes


Property

Distance No No Yes Yes

Zero Origin No No No Yes

• Whenever it possible to use a more powerful scale than a less one, it is wise to do.

Types of variables
• Dependent or Independent variable
• Continuous or Discrete variable
• Qualitative or Quantitative variable
Dependent or Independent variables
• Dependent variables are those variables depend on or consequent of another, it is known as
dependent variable other is known as independent variable. Independent variables have direct
effect on the dependent variable. It is an Explained variable.
• Independent variables are those variables which are manipulated by the researcher to ascertain
their relationships to an observed phenomenon. It is an Explanatory variable

Eg: - Sales and advertisement.

Continuous and Discrete variables

• A continuous variable is measured along a continuum at any place beyond the decimal point,
meaning that it can be measured in whole units or fractional units.
Eg: - Height and Weight
• A discrete variable is measured in whole unit or categories.
Quantitative and Qualitative Variables
• A Quantitative variable varies by amount. A quantitative variable is measured as numeric value
and often collected by measuring and counting. So both discrete and continuous variable can
be quantitative.
• A Qualitative variable varies by class. A qualitative variable is often a category or label for the
behaviors and event researchers observe, and so describe numeric aspects of phenomena. So
only discrete variables can categorized as qualitative.

Continuous Vs Qualitative Vs Scale of


Variables Discrete Quantitative Measurement

Gender ( Male, female) Discrete Qualitative Nominal

Seasons
(spring, summer, winter) Discrete Qualitative Nominal
Number of Errors Discrete Quantitative Ratio

Duration of drug abuse


(in years) Continuous Quantitative Ratio

Ranking of favorite foods Discrete Quantitative Ordinal

Ratings of satisfaction ( 1 to 7) Discrete Quantitative Interval

Number of students in a class Discrete Quantitative Ratio

Temperature
(degrees Fahrenheit) Continuous Quantitative Interval

Time (in second) to memorize a


list Continuous Quantitative Ratio

Number of heart beats per minute


Discrete Quantitative Ratio

Weights ( in kilograms) of infant


Continuous Quantitative Ratio

A letter grade
(A, B, C, D) Discrete Qualitative Ordinal

Score ( from 0 to 100%) on an


exam Continuous Quantitative Ratio

Type of distraction
(auditory, Visual) Discrete Qualitative Nominal

Branches of Statistics

Descriptive statistics
 The branch of statistics that focus on collecting, summarizing, and presenting a set of data. It’s
describing the characteristics of the data.

 There are three different analysis techniques are used. They are –
1. Univariate analysis

2. Bivariate analysis

3. Multivariate analysis

1. Univariate Analysis

Univariate analysis is the most basic form of statistical data analysis technique. When the data
contains only one variable and doesn't deal with a causes or effect relationship then a univariate
analysis technique is used.

Statistical Techniques to conduct univariate analysis

• Measures of central tendency

• Measures of Dispersion, Skewness and Kurtosis

• Histograms, Frequency polygon

• Pie charts and Bar charts

2. Bivariate analysis

Bivariate analysis is slightly more analytical than univariate analysis. When the data set
contains two variables and researchers aim to undertake comparison between the two data set
then bivariate analysis is the right type of analysis techniques.

Statistical Techniques to conduct Bivariate analysis

• Correlation coefficients

• Regression analysis

3. Multivariate analysis
Multivariate analysis is more complex form of statistical analysis technique and used
when there are more than two variable in the data set.

Statistical Techniques to conduct Multivariate analysis

• Factor analysis

• Variance analysis

• Multiple Regression analysis

 Inferential statistics

The branch of statistics that analyses sample data to draw conclusion about a population. It involves
making forecasts, estimation or judgments about population. It’s again classified into–

• Estimation

• Hypothesis testing

 Estimation

• Statistical estimation is concerned with the method by which population characteristics are
estimated from sample information.

• Statistical information procedures provide us with the means of obtaining estimates of


population parameters with desired degree of precision.

 Hypothesis testing

• A statistical hypothesis is some assumption or statement which may or may not be true, about a
population or equivalently about the probability distribution characterizing the given
population which we want to test on the basis of the evidence from a random sample.

Division of Statistics

Division Meaning Techniques used


1. Descriptive It deals with numerical data. Averages, Dispersion,
statistics Skewness, etc.

2. Applied statistics It covers those methods and techniques which are Sample survey, Quality
applied to specific problem of real life situation. control, Index numbers
etc.

3. Inferential It covers those methods and techniques which are Chi- square test, F-test,
Statistics used to draw conclusion and inferences about etc.
parameters of population on the basis of estimates
derived from sample.
4. Inductive It covers those methods and techniques which are Forecasting techniques
Statistics used to estimate a particular phenomenon on the like extrapolation.
basis of random observation.
5. Mathematical It concerned with application of various Integration,
Statistics mathematical theories and techniques to develop differentiation,
statistical techniques. trigonometry, matrix etc.
6. Analytical It covers those methods and techniques which are Correlation, Regression,
Statistics used in establishing functional relationship association and attributes.
between variables.

Collection of Data
The information collected from various sources which can be expressed in quantitative form
for specific purpose is called Data.

• Types of Data

1) Primary Data
Primary data are those are collected by the investigator himself for the first time
and thus they are original in character.
i. Direct personal investigation.
ii. Indirect oral investigation.
iii. By schedules and Questionnaires.
iv. Local correspondents.
2) Secondary Data

Secondary data are those which have already been collected by some other
agency and which have already been processed. It is in the published and
unpublished. Source is important for collecting data.
Census Method and Sample Survey

• Survey is a process that uses questionnaire or similar means to gather values for the responses
from a set of participants.
• Census is a count of all elements in a population

Sampling

• The basic idea of sampling is that by selecting some of the elements in a population, we may
draw conclusions about the entire population.
• A population is the total collection of elements about which we wish to make some inferences.
• A population element is the subject on which the measurement is taken. It is the unit of study.

Why Samples

• Availability of elements
• Greater speed
• Lower cost
• Greater Accuracy

Principles of Sampling

1) Law of statistical regularity


2) Principle of inertia of large number
3) Principle of persistence of small numbers
4) Principle of validity
5) Principle of optimization

1) Law of statistical regularity: - According to this law a group of objects chosen at random
from larger group tens to possess the characteristics of that large group.
2) Principle of inertia of large number: - It states that as the sample size increases the result
tends to be more reliable & accurate keeping other thing constant.
3) Principles of persistence of small numbers: - According to this principle if some of the items
in a population possess markedly distinct characteristic from the remaining items then this
tendency would be revealed in the sample value also rather this tendency of persistence will be
there even if the population size is increased or even in the case of large sample.
4) Principles of validity: - A sample design is termed as valid if it enables us to obtain valid tests
& estimates about the population parameters.
5) Principles of optimization: - This principle stresses the need of obtaining optimum results in
terms of efficiency cost of the sample design with the source available at our disposal.

What is a Good (Valid) Sample?

• The ultimate test of a sample design is how well it represents the characteristics of the
population it purports to represent.
• In measurement terms, a good sample is valid, and Validity of a sample depends on two
consideration: - Accuracy and precision.

Accuracy

• It is the degree to which bias is absent from the sample. An accurate (unbiased) sample is one
in which the under estimators and over estimators are balanced among the members of sample.
There is no systematic variation with an accurate sample.
• When the sample is drawn properly, the measure of behavior, attitude, or knowledge of some
sample elements will be less than the measure of those same variable drawn from the
population. The measure of other elements will be more than the population values. Variation
in these sample values offset each other, resulting in a sample value that is close to the
population value. For these offsetting effects to occur, there must be enough elements in the
sample and they must be drawn in a way that favors neither over estimation nor under
estimation. Increasing the sample size can reduce systematic variance as a cause of error.
Systematic variance is a variation that causes measurements to skew in one direction or
another.

Precision

• Precision of estimate is the second criterion of a good sample design. The numerical
descriptors that describe samples may be expected to differ from those that describe
populations because of random fluctuation inherent in the sampling process. This is called
sampling error and reflects the influence of chance in drawing the sample members. Sampling
error is what is left after all known sources of systematic variance have been accounted for.
• Precision is measured by the standard error of estimate, a type of standard deviation
measurement. The smaller the standard error of the estimate, the higher is the precision of the
sample.

Sample Design

• A sample design is a definite plan for obtaining a sample from a given population.
• It refers to the procedure, adopted by researcher for selecting items for a sample.

Sample Design Process

• Define the target population


• Determine the sampling frame
• Select sampling techniques
• Determine the sample size
• Execute the sampling process

Define the Target Population

• The target population is the collection of elements or objects that possess the information
sought by the researcher and about which inferences are to be made.
• The target population should be defined in terms of elements, sampling units, extents, and
time.
• An Element is an object about which or from which the information is desired, e.g., the
respondent.
• A Sampling unit is an element, or a unit containing the element, that is available for selection
at some stage of the sampling process.
• Extent refers to the geographical boundaries.
• Time is the time period under consideration.

Sampling Frame

• Sampling Frame is a list or Quasi list of units composing a population from which the sample
is actually drawn.
• The sampling frame is also called working population because it provides the list that can be
worked with operationally.

Sampling Units

• A sampling unit is single element or group of elements subjected to the selection in the sample.
1. Primary sampling Units (PSUs): - A unit selected in the first stage of sampling.
2. Secondary Sampling Units: - A unit selected in the second stage of sampling.
3. Tertiary Sampling.
Random Sampling Error/ Non Systematic Error

• The difference between the sample results and the result of a census conducted using identical
procedures.
• A statistical fluctuation that occurs because of chance variation in the elements for a sample.

Systematic Error/ Non Random Sampling Error

• Errors resulting from some imperfection of the research design that causes response error or
from mistake in the execution of the research; error that comes from such sources as sample
bias, mistakes in recording responses, and non-responses from persons who were not contacted
or who refuse to participate.
• These errors are not due chance fluctuations.

Standard Error

• Standard Error is a measure of the statistical accuracy of an estimate, equal to the standard
deviation of the theoretical distribution of a large population of such estimates.
• Standard Error of mean is the standard deviation of sampling distribution of mean.

Types of Sample Designs

• Probability Sampling: -
The general term for samples selected in accord with probability theory.
Probability sampling is based on the concept of random selection – a controlled procedure
that assures that each population element is given a known nonzero chance of selection.
• Non-Probability Sampling: -
Any technique in which samples are selected in some way not suggested
by probability theory. Each member does not have a known nonzero chance of being included.

Non- Probability Samples

• With a subjective approach like non-probability sampling, the probability of selecting


population elements is unknown.
• There is a greater opportunity for bias to enter the sample and distort findings. We cannot
estimate any range within which to expect the population parameter.
• Despite these disadvantage, there are practical reasons to use non-probability samples.

Convenience Sampling
1. Convenience sampling attempts to obtain a sample of convenient elements. Often, respondents
are selected because they happen to be right place at right time.
• Use of students, and members of social organization.
• People on the street interviews.

Strength of Convenience Sampling

• Least Expensive.
• Least time consuming.
• Most convenient.

Weaknesses

• Selected bias.
• Sample not representative.
• Not recommended for descriptive or Causal research.

Judgmental Sampling

• Judgmental sampling is a form of convenience sampling in which the population elements are
selected based on the judgment of the researcher.

Strengths of judgmental sampling

• Low Cost.
• Least time consuming
• Useful for some type of forecasting.

Weaknesses

• Does not allow generalization.


• Subjective, may make sample unrepresentative.

Quota Sampling

• Quota sampling may be viewed as two-stage restricted judgmental sampling.


• The first stage consist of developing control categories, or quotas, of population elements.
• In the second stage, sample elements are selected based on convenience or judgment.

Strengths of quota sampling

• Sample can be controlled for certain characteristics.


Weaknesses

• Selection bias.
• No assurance of representatives.
• Projecting data beyond sample inappropriate.

Snowball Sampling

• In Snowball Sampling, an initial group of respondents is selected, usually at random.


• After being interviewed, these respondents are asked to identify others who belong to the target
population.
• Subsequent respondents are selected based on the information provided by the initial
respondents.

Strengths

• Can estimate rare characteristics.

Weaknesses

• Time Consuming.
• Projecting data beyond sample inappropriate.

Probability Sampling

Simple Random Sampling

Simple Random Sampling also known as unrestricted random


sampling and the other three are known as restricted sampling.

• Each element in the population has a known and equal probability of selection.
• Each possible sample of given size (n) has a known and equal probability of being the sample
actually selected.
• This implies that every element is selected independently of every other element.
• Simple random may be selected (a) by Lottery Method (b) Table of random numbers
Strengths

• Easily understood.
• Results projectable.

Weaknesses

• Difficult to construct sampling frame.


• Expensive. Often due to dispersed respondents.

Systematic Sampling

• The sample is chosen by selecting a random starting point and then picking every ith element in
succession from the sampling frame.
• The sampling interval, i, is determined by dividing the population size N by the sample size n
and rounding to the nearest integer.

Strengths

• Can increases representatives.


• Easier to implement than SRS.

Weaknesses

• Can decrease representativeness. Must be particularly concerned with possible ordering in the
population.

Stratified Sampling

• A two-step process in which the population is partitioned into subpopulations, or strata.


• The Strata should be mutually exclusive and collective exhaustive in that every population
element should be assigned to one and only one stratum and no population element should be
omitted.
• Next, elements are selected from each stratum by a random procedure, usually SRS.
• A major objective of the stratified sampling is to increase precision without increasing cost.
• The elements within a stratum should be homogenous as possible, but the elements in different
strata should be as heterogeneous as possible.
• The Stratification variables should also be closely related to the characteristics of interest.

Kinds of Stratified Sampling


• Proportionate stratified Sampling: - Where number of items taken from each stratum is
on the basis of size of each stratum.
• Disproportionate stratified Sampling: - Where equal number of items are selected from
each stratum, irrespective of its size.
• Stratified Weighted Sampling: - Where equal number of items are selected from each
stratum, but weights are given to stratum on the basis of its size.

Strengths

• Included all important sub population.


• Precision.

Weaknesses

• Expensive.
• Difficult to select relevant stratification variables.

Cluster Sampling

• The target population is first divided into mutually exclusive and collective exhaustive
subpopulations, or Clusters.
• Then a random sample of clusters is selected, based on a probability sampling techniques such
as SRS.
• For each selected cluster, either all the elements are included in the sample (one-stage) or a
sample of elements is drawn probabilistically (two-stage).
• Elements within the cluster should be as heterogeneous as possible, but clusters themselves
should be as homogeneous as possible. Ideally, each cluster should be small-scale
representation of population.

Strengths

• Easy to implement.
• Cost effective.

Weaknesses

• Less accurate

Determine the Sample Size

• Neither too large nor too small.

Factors in Determining Sample Size


• Variance (Standard Deviation – how homogeneous is the population)
• Confidence level (Sets the probability of true population parameter being incorrectly
estimated).
• Standard Error (Indicates how precise the estimate must be.)

Sample Size Formula

Z=Confidence Level
𝒁𝒁𝒁𝒁 𝟐𝟐
n ( ) S= Standard deviation
𝑬𝑬

E= Standard Error

PRESENTING DATA IN CHART AND TABLE

Classification of Data
Classification is the process of arranging the data in groups or classes according to
resemblances and similarities. It is a grouping of related facts into classes.

Methods of Classification
Measurable characteristics are known as variable and non-measurable characteristics are
known as Attributes.

• Classification according to attributes.


1. Simple Classification - One attribute.
2. Manifold classification – More than one attribute
• Classification according to variable.
• Geographical classification.
• Chronological classification.

Frequency

Number of items each value of variables repeated data.

Frequency Distribution

An orderly arrangement of data classified according to the magnitude of observation.

Class limits: - The limits within which a class interval lies. Lower limit, Upper limit.
Class Interval: - The difference between the upper limit and lower limit.

Exclusive Method: - Upper limit of one class interval is lower limit of the next class.

Inclusive Method: - The upper limit of one class is included in that class itself.

Cumulative Series: - Cumulative series is one wherein the frequency or value against each class
interval in a continuous series progressively totaled up and aggregate form.

Less than cumulative frequency: - Any value of the variable class is obtained by adding
successively the frequency of the previous variables including the variable or class against which it
is written. The cumulation is started from the lowest to the highest size.

More than cumulative frequency: - Distribution is obtained by finding the cumulative total of
frequency starting from the highest to the lowest.

Statistical series

• Statistical series are prepared to present the collected and classified data in properly
arranged way.
• Statistical series may be time series, spatial series, and condition series.
• A series of data arranged chronologically or in relation to time is called time series
• If the data is arranged in relation to place, then it is called spatial series.
• Condition series: - If data are arranged to some condition the series called condition
series.

On the basis of construction

• Individual series: - Each value of variable occurs only once.


• Discrete series: - Arrangement of items in groups showing definite breaks from one
point to another and are exactly measurable.
• Continuous series: - The different values of the variables are stated in a continuous
manner along with their respective frequencies. It cannot exactly measurable, but
placed within the certain limits.

Presentation of Data

Two important methods of presenting data: -

• Tabular presentation
• Graphic and diagrammatic presentation

Tabulation
• It means a systematic presentation of numerical data in columns and rows in accordance with
some salient features.
• It orderly arrangement of data in rows and columns.

• Stubs:- Row heading


• Caption: - Column heading

Types of table

1. One way table


2. Two way table
3. Three way table /Manifold table

Two – Way cross classification table

• A Multicolumn table that presents the count or percentage of responses to two categorical
variables. In two – way tables, the categorical of one of the variable from the rows of table,
while the categories of the second variable from the columns. Cross- classification tables are
also known as cross- tabulation tables.

Diagrammatic Presentation

• A diagram is a visual form of presentation of statistical data.


Types of Diagrams

1. One dimensional
2. Two dimensional
3. Three dimensional
4. Pictograms & Cartograms

One dimensional diagrams

• One dimensional diagrams are those which have only length.


1. Line diagram
2. Simple Bar diagram
3. Sub- divided bar diagram
4. Multiple bar diagram
5. Percentage bar diagram
Two dimensional diagrams

• Two dimensional diagrams, Length as well as Width of the bar are taken into consideration.
1. Rectangle
2. Square
3. Circle
4. Pie diagram

Three dimensional diagrams

• It is also known as volume diagrams, consist of cubes, cylinder, blocks etc.,


• In such diagrams, Length, Breadth, and Height to be taken into consideration.

Pictograms & Cartograms

• Pictogram is a device of representing statistical data in pictures.


• Cartograms or Maps are used to give information on geographical basis.

Graphic representation of data

• A graph is a sort of chart through which a statistical data are represented in form of lines or
curves.
• Graphs of a frequency distribution
1. Histogram
2. Frequency polygon
3. Cumulative frequency curves/ Ogives
1) Less than ogive
2) More than ogive
• A histogram can be defined as a set of rectangles, each proportional width to the range of
values within a class and proportional in height to the class frequencies of the respective class
interval.
• A frequency polygon is a graphical device fpor understanding the shape of the distribution. A
dot plotted for the frequency value at the mid-point of frequency class interval.
• An ogive is a cumulative frequency curve.

False base line (Zig-Zag line)


• If the difference in the value of the various items is very small & the smallest value is much
above zero, a false base line is used for.

Sturges rule

• A rule for determining the desirable number of groups into which a distribution of
observations should be classified; the number of groups or classes is 1 + 3.3 log n, where n is
the number of observations.
K = 1 + 3.3 Log n

Pareto Diagram

• The Pareto diagram is named after Italian economist, Vilfredo Pareto. A special type of bar
chart in which the counts, amounts, or percentages of each category are presented in
descending order left to right, along with a superimposed plotted line that represents a running
cumulative percentage.
Stem and leaf plot

• It is well known technique of organizing raw data into groups. Stem and leaf can be
constructed by separating the digits of each number into two groups, one as a stem and other
as a leaf. After separating the data, the left most digit is termed as stem and is the higher
valued digit. The right most digit is termed as the leaf and is the lower valued digit.

15,16,21,23,23,26,26,30,32,41
Steam Leaf
1 5 6

2 1 3 3 6 6

3 0 2

4 1

Scatter plot

• The scatter plot is graphical presentation of the relationship between two numerical variables.

Dot scale diagram

• A chart in which each response is represented as a dot above a horizontal line that through the
range of all values. Should two or more response values be identical, the dots for these
responses are stacked (placed vertically) above each other.

Box and Whisker plot

• For a set of data values for a variable, the five numbers that correspond to the smallest value,
the first quartile Q1, the median, The third quartile Q3, and the largest value.
• The five number summary concisely summarizes the shape of a set of data values for a
variable. This method determines the degree of symmetry (or Skewness) based on the distance
that separate the five numbers. To compare these distance effectively, you can create a box
and whisker plot.
• To choose an appropriate table or chart type, your starting point is always to determine whether
your data are categorical or numerical.

If your data are categorical

• Determine whether you have one or two variables to present.


• If one variable, Use summary table, and/ or bar chart, pie chart, or pareto diagram.
• If two variables, Use a two- way cross classification table.
MEASURES OF CENTRAL TENDENCY
• Average is defined as an attempt to find one single figure to describe whole figure.
• Average is frequently referred to as a measure of central tendency.
• Measure of central value are also popularly known as measure of central tendency because its
value lies between two extreme values.

Objectives of an Average

• To determine a single value that may be used to describe the characteristics of the entire series.
• To facilitate comparison.
• To facilitate statistical inference: an average obtained from a sample is used in estimating the
average of the population.
• To help in decision making process.

Types of Averages

• Mathematical Average
1) Arithmetic Mean
2) Geometric Mean
3) Harmonic Mean
4) Quadratic Mean
• Positional
1) Median
2) Mode
• Commercial
1) Moving average
2) Progressive average
3) Composite average

Arithmetic Mean

• Arithmetic Mean is value is obtained by adding together all the items and by dividing this total
by the number of items.
• AM=X1+X2+X3……+Xn/N or ∑x/n
• For correcting incorrect value of arithmetic mean is (-) from incorrect ∑x deduct wrong items
and add correct items and then divide the correct with nth observation.
• The use of median and mode would be better in open end distributions because of the difficulty
of ascertaining lower limit & upper limit in open end distribution it is suggested that in such
distributions arithmetic mean should not be used.

Properties of Arithmetic Mean

1) The sum of deviations of the items from the arithmetic mean is always Zero.
2) Mean is characterized as point of balance i.e. the sum of the positive deviations from it is equal
to the sum of negative deviations from it.
3) The sum of the squared deviations of the items from arithmetic mean is minimum that is less
than the sum of the squared deviations of the items from any value.
4) The sum of the given values of a series is equal to the product of their arithmetic mean and
number items of the series.
N𝑋𝑋̅=∑X

Limitations of AM

1) Arithmetic mean is affected by extreme values.


2) For the open end distribution, Mean cannot be computed with accuracy.
3) Mean is not used for studying the qualitative phenomena.
4) For non-homogeneous data, average may give misleading conclusion.

Combined Mean

• If we have the arithmetic average and number of items of two or more than related variables,
we can compute combined mean of these groups by applying this formula.

𝑁𝑁 𝑋𝑋̅ +𝑁𝑁 𝑋𝑋̅


𝑋𝑋̅= 1 1 2 2
𝑁𝑁1+𝑁𝑁2

MEDIAN

• The middle value in the distributions.


• It is just the 50th percentile value below which 50% of the values in the sample fall.
• Median is called the positional average.
• If N is odd the median is an actual value with the remainders of the series in two equal parts on
either side of it.
• If N is even the median is a derived figure, half the sum of the two middle values.
• Odd = Middle Value
• Even = N+1/2th item
• The sum of the deviations of the items from median, ignoring signs is least.
• Uses of Median: - In open end distributions, it is more satisfactory measure of the central
tendency than the mean.
• Appropriate average dealing with qualitative data.
• Quartiles = 4 equal parts, Octiles = 8 equal parts, Deciles= 10 equal parts, Percentiles= 100
equal parts.
• Median can be determined by graphic method also by Ogives.

Limitation

• It requires the arrangement of the data in ascending or descending order.


• It is not based on all items in the series.
• It is not capable of further algebraic treatment.

Mode

• The mode or the modal value is that value in a series of observation which occurs with the
greatest frequency.
• The mode is often said to be the value which occurs most often that is with the highest
frequency.
• Mode is the value which has the greatest frequency density in its immediate neighborhood. For
this reason mode is called the most typical or fashionable value of the distribution.
• For determining mode count the number of items the various values repeated themselves & the
value occurring the maximum number of times is the modal value.
• When there are two or more values having the same maximum frequency one cannot say which
is the modal value & hence mode is said to be ill defined. Such a series is also known as
bimodal or multimodal.
• Where mode is ill defined its value may be ascertained by the formula based upon relationship
between mean, median, mode.

Mode = 3 Median – 2 Mean

This measure is called Empirical mode.

• Uses of mode: - It is used in open end distributions/qualitative phenomenon.


• Mode is the most meaning measure of central tendency in case of highly skewed or non-normal
distribution.

• We can locate mode graphically using histogram and frequency polygon.


Limitations of Mode

• In case of bi model / Multi model series, mode cannot be determined.


• It is not capable of further algebraic treatment.
• It is not based on all items of series.
• Its value is affected significantly by the size of class intervals.

Geometric Mean

GM = 𝒏𝒏√ 𝑿𝑿 𝟏𝟏 × 𝑿𝑿𝟐𝟐 × 𝑿𝑿𝟑𝟑 × … … … .× 𝑿𝑿𝒏𝒏

• Geometric mean is defined as the Nth root of the product of N items or values. If there are two
items, we take the square root; if there are three items, the cube root; and so on.

Properties of Geometric Mean

1) The product of the values of series will remain unchanged when the value of geometric
mean is substituted for each individual values.
E.g.: - Geometric mean of 2, 4, 8. Is GM = 4.

i.e. 2× 4 × 8 = 64 = 4 × 4 x 4
2) The product of the ratio of the geometric mean to the item below or equal to it’s the
product of the ratios of the items above geometric mean,

E.g.: - Geometric mean of 2, 4, 8 is GM = 4.


4 4 4 2 4 8
× × = × ×
2 4 8 4 4 4

• Because of this property this measure of central values is especially adopted to average
ratios, rates of changes & logarithmically distributed series.
• Uses – to find average percentage increase in sales, production, population, in
construction of index number.
• Geometric mean is not computed when there are both negative & positive values in a
series or one or more of the values are zero.

Harmonic Mean

• The harmonic mean is based on the reciprocals of the numbers averaged, it is defined as the
reciprocal of the arithmetic mean of the reciprocal of the individual observation.

𝑵𝑵
HM = 𝟏𝟏 𝟏𝟏 𝟏𝟏
+ ………+𝑿𝑿
𝑿𝑿𝟏𝟏 𝑿𝑿𝟐𝟐 𝒏𝒏

• Uses – It is useful for computing the average rate of increase in profits of a concern or average
speed at which a journey has been performed or the average price at which an article has been
sold. The rate usually indicates the relation between two different types of measuring units that
can be expressed reciprocally.
• Weighted harmonic mean = ∑w / ∑ (w/x)
• Relationship among the averages – AM > GM > HM.
• They are equal only when all observations are equal.
• GM not only lies between AM and HM
GM2 = AM × HM

i.e. GM = √𝑨𝑨𝑨𝑨 × 𝑯𝑯𝑯𝑯


Quadratic Mean

• Quadratic mean is an average obtained by extracting the root of the sum of squares of item
values divided by their numbers.

𝑎𝑎 2+𝑏𝑏2+𝑐𝑐2+⋯+𝑛𝑛2
QM = √
𝑁𝑁

Choice of a suitable average

I. Purpose

Purpose Suitable Average

To give equal importance of all items of a series. Arithmetic Mean

To locate position of an item in relation to other Median


items.

To find out the most common or most fashionable Mode


items.

To give more importance to small items than big Geometric Mean


items

To give greatest importance to small items Harmonic Mean


Nature of data Suitable average

For open ended distribution Median

To describe qualitative data Mode

To compute average rate of increase, decrease, average ratios, Geometric Mean


average percentages

To compare the value of a variable with another variable which is


constant. For examples varying speed with constant distance, Harmonic Mean
varying quantitative bought/ sold per rupees.

In other cases Arithmetic Mean

• Amenability to further algebraic treatment – Arithmetic Mean.


• Special purpose – for calculating trend in time series analysis, the moving average would be
the most suitable average.

Commercial Averages

• Moving Average: - it is a special type of arithmetic average. Generally, moving average can
be calculated on the basis of certain period like 3 years, 5 years etc.
• Progressive average: - it is also known as cumulative average.

𝒂𝒂 𝒂𝒂 + 𝒃𝒃 𝒂𝒂 + 𝒃𝒃 + 𝒄𝒄 𝒂𝒂 + 𝒃𝒃 + 𝒄𝒄 + 𝒅𝒅
, , ,
𝟏𝟏 𝟐𝟐 𝟑𝟑 𝟒𝟒

• Composite average: - Composite average is just like an arithmetic average which can be
calculated on the basis of the different averages.

Measures of Dispersion

• Dispersion is the measure of the variation of items.


• A measure of dispersion or variation is one that measures the extent to which these are
differences between individual observations & some central or average value. In measuring
variation we shall be interested in the amount of the variation or its degree but not in the
direction.
• It is important to measure the reliability.

Methods of Studying Dispersions are as follows: -

1) The Range
2) Interquartile range or the quartile deviation.
3) The Mean deviation or the average deviation.
4) Standard deviation
5) Lorenz curve.

Range

• It is the difference between the largest item and the smallest item.

Range = Highest – Lowest

• Coefficient of range = 𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯−𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳


𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯+𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳

Inter quartile range

• It represents the difference between the third quartile and the first quartile.

𝑸𝑸𝟑𝟑− 𝑸𝑸𝟏𝟏
• Inter quartile range = Q3 – Q1, Quartile deviation = .
𝟐𝟐

𝑸𝑸𝟑𝟑− 𝑸𝑸𝟏𝟏
• Coefficient of Quartile deviation =
𝑸𝑸𝟑𝟑+ 𝑸𝑸𝟏𝟏

• Percentile range is also used as measure of dispersion.

• Percentile range = p90 – p10

• Range and quartile deviation they do not show the scatterness around as average.
Interpretation of Range

• If the average of the two distributions are almost same, the distribution with smaller range is
said to have less dispersion.

Interpretation of QD.

Value of QD

Small variation (or High uniformity) among the among the central 50%
A small QD items of the distribution.

High variation (or low uniformity) among the central 50% items of the
A high QD distribution.

Mean Deviation

• The mean deviation is also known as the average deviation. It is the average difference
between the items in a distribution from the median or mean of that series.
• It is advantage in taking the deviation from median because the sum of the deviations of items
from median is minimum when signs are ignored.
• The arithmetic mean is more frequently used in calculating the value of average deviations &
this is the reason it is also called Mean deviation.

∑|𝐷𝐷|
M.D =
𝑁𝑁
• D = X – Mean or Median.
𝑴𝑴𝑴𝑴
• Coefficient of Mean deviation = if it is taken Median and if it is taken from
𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴
𝑴𝑴𝑴𝑴
Mean then,
𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴
• The greatest drawbacks of this method is that algebraic signs are ignored while taking the
deviations of the items as it makes the method is non-algebraic. It is especially effective in
reports presented to the general public or to groups not familiar with statistical method.
Standard Deviation

• This concept was introduced by Karl Pearson in 1823. It is also known as root mean square
deviations for the reason that it is the square root of the means of squared deviations from the
arithmetic mean. It is denoted by small Greek letter 𝝈𝝈 sigma.

∑ ̅ ) 𝟐𝟐
( 𝑿𝑿−𝑿𝑿
𝝈𝝈 = √
𝑵𝑵

• The standard deviations measures the absolute dispersion or variability of a distribution, the
greatest the amount of dispersion or variability the greater the standard deviation, for the
greater will be the magnitude of the deviations of the values from their mean. A small SD
means a high degree of uniformity of the observations as well as homogeneity of a series, a
large standard deviation means just the opposite.

Difference between Mean deviation and Standard deviation

1) Algebraic signs are ignored while calculating mean deviation whereas in the calculation of
standard deviation sign are taken in to account.
2) Mean deviation can be computed either from median or mean but standard deviation is always
computed from the arithmetic mean because the sum of the squares of the deviations of items
from arithmetic mean is least.

• Population standard deviation is denoted by 𝝈𝝈 whereas sample standard deviation is denoted by


s.

Mathematical Properties of Standard Deviation

• It is possible to compute combined mean of two or more than two groups, similarly we can
also compute combined standard deviation of two or more group.
• The sum of the squares of deviations of items in the series from their arithmetic mean is
minimum. The sum of the squares of the deviations of items of any series from a value other
than the arithmetic mean would always be greater this is the reason why standard deviation is
always computed from the arithmetic mean.
• For Symmetrical distributions,

Mean ± 𝟏𝟏𝟏𝟏 = 𝟔𝟔𝟔𝟔. 𝟐𝟐𝟐𝟐%, Mean ± 𝟐𝟐𝟐𝟐 = 𝟗𝟗𝟗𝟗. 𝟒𝟒𝟒𝟒%, Mean ± 𝟑𝟑𝟑𝟑 = 𝟗𝟗𝟗𝟗. 𝟕𝟕𝟕𝟕%
• In normal distribution there is a fixed relationships between the three most commonly used
measures of dispersion. The QD is smallest, MD next & SD is greatest.

𝟐𝟐 𝟒𝟒
QD = 𝝈𝝈, MD = 𝝈𝝈
𝟑𝟑 𝟓𝟓

SD > MD > QD.

Coefficient of variation

• Relative measures of SD is known as coefficient of variation. This measure developed by Karl


Pearson.
• Coefficient of variation is greater is said to be more variable or less consistent, less uniform,
less stable or less homogenous.
• On the other hand the series for which coefficient of variation is less is said to be less variable
or more consistent and more uniform, more stable or more homogeneous. It is denoted by

𝝈𝝈
C.V = ̅
𝑿𝑿
• Variance = Square of standard deviation𝝈𝝈𝟐𝟐. Smaller the value of 𝝈𝝈𝟐𝟐 the lesser the variability
or greater the uniformity in the population.
• Standard deviation is the best measure of variation.
• Correcting of incorrect value of SD = SD – Wrong value + Right value then divided by
number of observation.

Lorenz Curve

• It is devised by Max O Lorenz. It is a graphic method of studying dispersion. This curve was
used by him for the first time to measure the distribution of wealth & income.
• The most common use of this curve is the study of the degree of inequality in the distribution
of income & wealth between countries or between different periods of time.
• It is a cumulative percentage curve in which the percentage of items is combined with the
percentage of other things as a wealth, profits & turnover.
• As it is a graphical method, in this there is a line OP which is known as a line of equal
distribution. The line OP will make a angle of 45%.
• For any given distribution the curve will never cross the line equal distribution. It will always
lies below OP unless the distribution is uniform in which case it will coincide with OP. The
greater the variability the greater is the distance of the curve from OP.
• This a measure of variability of the distribution is provided by the distance of the cumulated
percentages of the given distribution from the line of equal distribution.

Skewness, Moments & Kurtosis

• When a series is not symmetrical it is said to be asymmetrical or skewed.


• Skewness refers to the lack of symmetry
• In a symmetrical distribution the value of mean, median & mode coincide. The spread of the
frequencies is the same on the both sides of the center point of the curve.

Mean = Median = Mode


• A distribution which is not symmetrical is called a skewed distribution & such a distribution
could either be positively skewed or negatively skewed.
• Symmetrical distribution = Mean = Median = Mode.
• Positive skewed distribution= Mean > Median > Mode.
• Negative skewed distribution= Mode > Median > Mean.
• In a moderately symmetrical distributions the interval between the mean & the median is
1 rd
approximately of the interval between mean and mode. It is the relationship which
3
provides means of measuring the degree of skeweness.
• Dispersion is concerned with the amount of variation rather than with its direction.
Skeweness tell us about the direction of the variation or the departure from symmetry.
• Measures of skeweness are depend upon the amount of dispersion.

Skeweness is present if,

1) Mean, Median, Mode do not coincide.


2) When data are plotted on the graph they do not give the normal bell shaped.
3) Sum of the positive deviation from median is not equal to the sum of the negative deviation
from median.
4) Quartiles are not equidistant from the median.
5) Frequencies are not equally distributed at points of equal deviation from mode.

Measures of Skeweness

• It tells us the direction & extent of asymmetry in a series. Absolute measure of skeweness and
relative measure of skeweness.

Four important measures of skeweness are as follows.

1) First Measure of Skeweness.


2) Second Measure of Skeweness.
3) Third Measure of skeweness.
4) Measure of skeweness based on Moments.

First Measure of Skeweness

• It is based on the difference between mean & mode.


• Mean > Mode – Positive skeweness.
• Mode > Mean - Negative skeweness.
• Mean – Mode is an Absolute measure of skeweness.

• 𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴−𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴
𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺 𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅 It is a Relative measure of Skewness. This Known as Karl Pearson
Coefficient of Skeweness.
(𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑−𝟑𝟑 𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴)
• If Mode is Ill-defined,
𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺 𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅.

Second Measure of Skeweness

• It based on the difference between Quartiles and Median.

• (Q3 – Median) - (Median – Q1) is an Absolute measure of Skeweness.

• (𝑸𝑸𝟑𝟑+ 𝑸𝑸𝟏𝟏−𝟐𝟐 𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎)


Is a Relative measure of Skeweness. This is known as Bowley’s
𝑸𝑸𝟑𝟑−𝑸𝑸𝟏𝟏
Measure of Skeweness.
• Bowley’s measure value is limited between -1 to 1. It useful for open end distribution and
extreme values.

Third Measure of Skeweness

• It is based on the difference between Percentiles/ Deciles and Median.


• (D9 – Median) – (Median – D1) is an Absolute measure of Skeweness.

(𝑷𝑷𝟗𝟗𝟗𝟗+ 𝑷𝑷𝟏𝟏𝟏𝟏−𝟐𝟐 𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴)


• (𝑫𝑫𝟗𝟗+ 𝑫𝑫𝟏𝟏−𝟐𝟐 𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴) Or is a Relative Measure of Skeweness.
𝑫𝑫𝟗𝟗− 𝑫𝑫𝟏𝟏 𝑷𝑷𝟗𝟗𝟗𝟗− 𝑷𝑷𝟏𝟏𝟏𝟏
This is known as Kelley’s Coefficient of Skeweness.

Measure of Skeweness based on Moments

• It is based on third Moment.


• 𝜇𝜇3 It is an Absolute measure of Skeweness.
𝝁𝝁𝟑𝟑
• It is a Relative Measure of Skewness.
√𝝁𝝁𝟑𝟑
𝟐𝟐
Moments.

• In Statistics, the term Moment is used where class frequencies are forces and deviations of the
various items from the mean are taken as distance. In other words, moment is the mean of the
first, second, third, fourth, etc.

Moments
∑(𝑿𝑿 − 𝑿𝑿̅)
First 𝑵𝑵 Mean
̅ ) 𝟐𝟐
∑(𝑿𝑿 − 𝑿𝑿
Second 𝑵𝑵 Variance
̅ ) 𝟑𝟑
∑(𝑿𝑿 − 𝑿𝑿
Third 𝑵𝑵 Skeweness
̅ ) 𝟒𝟒
∑(𝑿𝑿 − 𝑿𝑿
Fourth 𝑵𝑵 Kurtosis

Kurtosis

• It is a Greek word means bulginess.


• It refers to the degree of Flatness or Peakedness in the relation about the mode of a frequency
curve.
• If a curve is more peaked than the normal curve – Leptokurtic >3.
• If a curve is more flat topped than the normal curve – Platykurtic < 3.
• Normal curve- Mesokurtic = 3.
CORRELATION & REGRESSION

• Correlation is the relationship that exist between two or more variables.


• If two variables are related to each other in such a way that change in one creates a
corresponding change in the other, the variables are said to be correlated.
• The measure of correlation is called coefficient of correlation.
• Correlation is sometimes termed as ‘covariation’.
• Correlation analysis is a statistical technique used to measure the degree and direction of
relationship between the variables.
• Correlation does not necessarily imply causation or functional relationship though the
existence of causation always implies correlation. It establishes only covariation.

Types of Correlation

1) Positive or Negative correlation


2) Simple, partial, multiple correlation.
3) Linear and Nonlinear correlation.

Positive or Negative Correlation

• If both the variables are varying in the same direction i.e. if as one variable is increasing the
other on an average is also increasing it is known as positive correlation.
• On the other hand if the variables are varying in opposite direction i.e. as one variable is
increasing the other is decreasing or vice versa, correlation said to be negative.
Simple, Partial or Multiple Correlation

• When two variables are studied it is the problem of simple correlation.


• In case partial correlation the study of three or more variables, but consider only two
variable to be influencing each other and the effect of the other influencing variables being
held constant.
• When three or more variables are studied it is a problem of multiple correlation.

Linear or Nonlinear (Curvilinear) Correlation

• The amount of change in one variable tends to be a constant ratio to the amount of change in
the other variable.
• Amount of change in one variable is not a constant ratio to the change in the other variable.

Method of Studying Correlation

1) Scatter diagram method


2) Graphic Method
3) Karl Pearson coefficient of correlation.
4) Rank Correlation.
5) Concurrent deviation method.

Scatter diagram method

• The simplest device for ascertaining whether two variables are related is to prepare dot chart
called scatter diagram.
• The greater the scatter of the plotted points on the chart the lesser is the relationship between
the two variables.
• The more closely the points come to a straight line, the higher the degree of relationships.
• If all the points lie on the straight line raising from the lower left hand corner to upper right
hand, correlation said to be perfect correlation r = +1.
• If all the points are lying on a straight line falling from upper left hand to the corner to the
corner right hand correlation is said to be Perfect negative correlation r = -1.
• If the plotted points lie on a straight line parallel to X-axis or in haphazard manner it shows
absence of any relationship between the variables and it is called No correlation r = 0.
• Perfect positive = +1, Perfect Negative = -1, Positive r > 0, Negative r < 0, No-
correlation r = 0.
• As much as relationships come closer to Zero it is called weak correlation or low degree
correlation.
• As much as relationships come closer to 1it is called strong correlation or High degree
correlation.
Graphic Method

• When values are plotted on a graph paper we obtain two curves, one for x variable and another
for y variable.
• If both the curves drawn on the graph are moving in the same direction (either up or down)
correlation is said to be positive.
• On the other hand if the curves are moving in the opposite direction, correlation said to be
negative.

Covariance

• Given set of N pairs of observations relating to two variables X and Y, the covariance of X and
Y, usually represented by

̅ ) (𝒀𝒀− 𝒀𝒀̅)
∑(𝑿𝑿− 𝑿𝑿 ∑𝒙𝒙𝒙𝒙
Cov (X, Y) = or =
𝑵𝑵 𝑵𝑵

Difference between variance and covariance

• Variance must always positive whereas covariance may be positive, negative or Zero.

Properties of covariance

• Independent change of origin: It means the value of ‘r’ is not affected even if each of the
individual value of X and Y is increased or decreased by some non-zero constant.
• Independent change of scale: It means the value of ‘r’ is not affected even if each of the
individual value of V and Y is multiplied or divided by some non-zero constant.
• Covariance can vary from –α to +α.

Limitation

• Covariance is a direct measure of correlation between two variables but it cannot be used for
measuring the strength of the relation between two variables.

Karl Pearson Coefficient of Correlation


∑𝒙𝒙𝒙𝒙
r=
𝑵𝑵𝑵𝑵𝒙𝒙𝝈𝝈𝒚𝒚
• The coefficient of correlation is said to be a measure of covariance between two series. The
covariance of two series x & y.
• In order to find out the value of correlation coefficient, first we calculate covariance & then in
order to convert it to a relative measure we divide the covariance by the standard deviation of
the two series. The ratio so obtained is called Karl Pearson’s Coefficient.

Properties of Karl Pearson’s coefficient

1. In case the correlation is present, then the coefficient correlation would lie between
+ 1 to – 1.
2. It is based on suitable measure of variation as it takes into account all items of the variable.
3. It measure both direction as well as degree of change.
4. The coefficient correlation doesn’t prove causation but it is simply a measure of covariation.
5. It is independent change of Scale and Origin of the variable X & Y.
6. It is independent of unit of measurement.
7. It is the geometric mean of two regression coefficients

r = √𝒃𝒃𝒙𝒙𝒙𝒙 × 𝒃𝒃𝒚𝒚𝒚𝒚
8. It works with both ways

𝒓𝒓𝒙𝒙𝒙𝒙 = 𝒓𝒓𝒚𝒚𝒚𝒚
Nonsensical/ Spurious Correlation

• We can try to establish correlation between two variables even when there is not any logical
basis for the relationship between the two. Such illogical correlation is reffered to as
nonsensical/Spurious Correlation.

Standard Error

• Standard error of correlation coefficient is

𝟏𝟏− 𝒓𝒓𝟐𝟐
SE =
√𝑵𝑵
Where r = Coefficient of Correlation, N = Number of pairs of observation.
• Standard error of coefficient of correlation is used for ascertaining the probable error of
coefficient of correlation.

Probable Error

• Probable error of coefficient of correlation is an amount which if added to and subtracted from
the value of ‘r’ gives the upper and lower limits within which coefficients of correlation in the
population can be expected to lie. It is .6745 times the standard error of ‘r’.

𝟏𝟏− 𝒓𝒓𝟐𝟐
PE = .6745 SE = .6745
√𝑵𝑵
• Probable error is used to determine the reliability of value ‘r’ in so far as it depends on the
conditions of random sampling.

Coefficient of Determination

• The Coefficient of determination gives the ratio of Explained variance to the total variance.

𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬 𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽
𝑟𝑟2=
𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻 𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽

• Square of coefficient of correlation is called coefficient of determination.


• It shows the percentage of variation in the Y dependent variable which is explained by the
independent variable.

Properties

• The coefficient of determination varies from Zero to 1.


• As the Coefficient of correlation decreases, coefficient of determination decreases very rapidly
except in case of r = 0 and r = 1.
• One minus coefficient of determination is called coefficient of non-determination.

Coefficient of non-determination = 1 - 𝑟𝑟 2.

Rank Correlation Coefficient

• Edward spearman has developed this.


• Sometimes we are required to examine the extent of association between two ordinary scaled
variables such as two rank orderings.
• A measure to ascertain the degree of association between the ranks of the two variables x and y
is called rank correlation.

6 ∑ 𝐷𝐷2
r=1- , = Rank difference, N = no of pairs.
𝑁𝑁3−𝑁𝑁
In case of Tied Ranks

• In case of more than one item with same value in the series, usually average rank is allotted to
each of these and the factor is added for each such tied items to

𝑚𝑚3 − 𝑚𝑚
12
• Thus in case of tied ranks the formula is

𝒎𝒎𝟑𝟑−𝒎𝒎
𝟔𝟔(∑ 𝑫𝑫𝟐𝟐+ +⋯)
𝟏𝟏𝟏𝟏
r=1-
𝑵𝑵𝟑𝟑−𝑵𝑵

Concurrent Deviation Method

• To find out the direction of change of x variable & y variable.

(𝟐𝟐𝟐𝟐 − 𝒏𝒏)
𝒓𝒓𝒄𝒄 = ± √± ( )
𝒏𝒏

C = Concurrent deviationn

n=N–1

N = Total number of items.

• r is +ve when 2c > n, and r is –ve when 2c < n.


Kendall Coefficient

A measure to find out the degree of association among the ranking of ‘n’ object by ‘n’
respondents or judges.

Lag and lead correlation

• Sometimes, we find that two variable correlated and they signify cause and effect relationship.
This doesn’t mean that change in one variable will immediately cause a change in other
variable. There can be some time lag or difference of between cause and effect.
• When two series are given, one may be leading and other may be lagging. In other words
forward moving series are known as the leading series and the following series are known as
lagging series.

Regression Analysis

• Regression is the measure of average relationship between two or more variables in terms of
original units of the data.
• After having established that two variables (say sales and advertising expenditure) are the
correlated, one may find out the average relationship between the two to estimate the unknown
values of dependent variable (sales) from the known value of independent variable (advertising
expenditure).
• The meaning of the term regression is the act of returning or going back.
• The term regression was first used by Francis Galton in 1877.
• To study the functional relationship between two or more variables and thereby provided a
mechanism for prediction and forecasting.
• Regression means the estimation or prediction of the unknown value of one variable from the
known value of other variable.

Classification of Regression Analysis

Change in proportion

• Linear regression
• Nonlinear regression

Number of variable

• Simple regression: only two variable


• Multiple regression: More than two variable.
Regression lines

• A regression line is a graphic techniques to show functional relationship between the two
variables X & Y. i.e. dependent & independent variable.
• This is called an estimating line as it gives the average estimated value of dependent (Y) for
any given value of independent variable (X).
• In simple linear regression model, there are two regression lines.

• Regression line X on Y

X = a + bY
• The value of two constant ‘a’ and ‘b’ can be calculated from

∑X = Na + b∑Y

∑XY = a∑Y + b∑Y2


• Another way of expressing regression line of X on Y.

(𝑿𝑿 − 𝑿𝑿̅) = 𝒃𝒃𝒙𝒙𝒙𝒙(𝒀𝒀 − 𝒀𝒀̅)

Or
𝝈𝝈
(𝑿𝑿 − 𝑿𝑿) = 𝒓𝒓 𝒙𝒙 (𝒀𝒀 − 𝒀𝒀̅)
̅
𝝈𝝈𝒚𝒚

Methods of drawing regression lines

• Free hand curve method


• The method least squares

Free hand curve method

• According to this method, Original data are plotted on a graph paper. Usually original data
when plotted on a graph gives a wave like curve, but it depicts a general tendency of the data.
• Independent variable is taken along the horizontal axis and dependent variables along the
vertical axis.
• We draw a smooth free hand line in such a way that it clearly indicates the tendency of the
original data. This line is fitted by inspection.
Method of Least squares

• According to this method the line should be drawn through the plotted points in such a way
that the sum of squares of the deviations of the actual ‘Y’ values from the computed ‘Y’ values
is the minimum or least. A line fitted by this method is called line of best fit.
• In line of best fit, the deviation above the line are equal to the deviation below the line.

Regression Equations

• Regression equations are the equations of the lines of best fit.


• There are two regression equations, one is known as X on Y and other is known as
Y on X.
• The regression equation of Y on X is the equation of the best fitting line drawn to the scattered
diagram by taking X variables along the x axis and Y variables along y axis. Then X on Y vice
versa.

Line of best fit / Least square line

• A line fitted by the method of least square is the line of best fit. It has the following
characteristics.
1. The sum of squared of deviation i.e. ∑(𝑌𝑌 − 𝑌𝑌𝑐𝑐)2 is less than sum of squares of
deviations from any other line.
2. The sum of deviation is ∑(𝑌𝑌 − 𝑌𝑌𝑐𝑐) = 0.
3. The straight line intersect at the mean value of variable. ( 𝑋𝑋̅, 𝑌𝑌̅) .
4. The least square line is a best estimate of the population regression line when data
represent a sample from large population.

Standard Error

• Regression equation provides us with a method of obtaining likely values only prediction with
cent % accuracy is not possible.
• The standard error of estimate measures the dispersion about an average line called the
regression line. It indicates how precise the prediction of Y is based on X or of X on Y.

𝟐𝟐
∑(𝑿𝑿− 𝑿𝑿𝒄𝒄)
Sxy = √
𝑵𝑵
∑(𝒀𝒀− 𝒀𝒀𝒄𝒄) 𝟐𝟐
Syx=√
𝑵𝑵

• The difference between actual value & predicted value is the error in prediction.
• Standard Error of the estimate is the square root of the mean of square of these errors.
• Just as the standard deviation measures variability around the mean, the standard error of the
estimate measures variability around the fitted line of regression.

Properties of Regression Coefficient

• Both regression coefficient will have the same sign i.e. either they will be positive or negative.
It is never possible that one of the regression coefficient is negative & other positive.
• Since the value of the coefficient of correlation 𝑏𝑏𝑥𝑥𝑥𝑥 & 𝑏𝑏𝑦𝑦𝑦𝑦 cannot exceed 1. One of the
regression coefficient must be less than one or, in other words, both the regression coefficient
cannot be greater than 1.
• The coefficient of correlation will have the same sign as that of regression coefficient i.e. if
regression coefficient have a negative sign, r will also be negative and if regression coefficient
have a positive sign, r would be positive.
• Correlation coefficient is the geometric mean between regression coefficients.
• Regression is affected by change of scale & independent of change of Origin.
• The arithmetic mean of 𝑏𝑏𝑥𝑥𝑥𝑥 & 𝑏𝑏𝑦𝑦𝑦𝑦 is greater than or equal to coefficient of correlation.

𝒃𝒃𝒙𝒙𝒙𝒙 + 𝒃𝒃𝒚𝒚𝒚𝒚
r≤
𝟐𝟐
𝝈𝝈𝒙𝒙
• Since 𝒃𝒃𝒙𝒙𝒙𝒙 = r , we can find any of these four values, given the other three.
𝝈𝝈𝒚𝒚
• If 𝝈𝝈𝒙𝒙 = 𝝈𝝈𝒚𝒚, then the coefficient correlation equal to regression coefficient,
r = 𝒃𝒃𝒙𝒙𝒙𝒙 =𝒃𝒃𝒚𝒚𝒚𝒚
• If r = 0, 𝒃𝒃𝒙𝒙𝒙𝒙 and 𝒃𝒃𝒚𝒚𝒚𝒚 both are zero.
• If 𝒃𝒃𝒙𝒙𝒙𝒙 = 𝒃𝒃𝒚𝒚𝒚𝒚 then it is equal coefficient of correlation.
Difference between Correlation & Regression

Correlation Regression

1) It simply tells the relationship between the 1) It mean stepping back or returning to the
two or more variables which vary together. average value i.e. it simply tells average
relationship between two variables.

2) Correlation coefficient tells the degree of 2) Regression analysis aims establishing the
relationship between two variables. functional relationships between two variables.
𝒃𝒃𝒙𝒙𝒙𝒙 & 𝒃𝒃𝒚𝒚𝒚𝒚

3) It need not imply cause & effect relationship 3) It clearly indicates the cause and effect
between two variables. relationships.

4) Correlation coefficient is the relative measure 4) Regression coefficient 𝒃𝒃𝒙𝒙𝒙𝒙 & 𝒃𝒃𝒚𝒚𝒚𝒚 are
of the linear relationship between x & y and is absolute measures representing the change in the
independent of unit of measurement. value of variable Y for a unit change in the value
Its values lies between -1 to 1. of variable X.
Its value lies between 0 to 1.
5) There may be nonsense correlation between 5) There is no such thing as nonsense regression.
two variables are called Spurious correlation.

6) Correlation analysis is confirmed to the study 6) Regression analysis includes linear as well as
of linear relationships between variables. nonlinear relationships between variables.

7) Correlation coefficient is independent of both 7) Regression is independent of change of origin


change of scale & change of origin. but not of scale.
Probability Theory

• The word probability literally denotes ‘Chance’.


• The theory of probability deals with laws governing the chances of occurrence of phenomenon
which are unpredictable in nature.

Definition of Probability

Probability is the likelihood or chance that a particular event will occur.

• The theory of probability provides a quantitative measure of uncertainty or likelihood of


occurrence of different events, resulting from a random experiment, in terms of quantitative
measures ranging from 0 to 1.
• This means that the probability of a certain event is 1 and the probability of an impossible
event is 0.

Basic Concepts

• A random experiment is a process which produces outcomes that cannot be predicted in


advance and said to be ‘depend on chance’. For e.g. if we toss a fair coin, we may obtain either
a Head or Tail.
• The word experiment is used to describe an act which can be repeated under some given
conditions.
• Random experiment are those experiments whose results depends on chance.

Followings are examples.

• Tossing a coin
• Throwing a die
• Outcome: the result of random experiment will be called outcome.
e.g. In random experiment of tossing a coin, there are two possible outcome-
Head & Tail.
• Event: It is used to denote any phenomenon which occurs in a random experiment. In effect,
one or more outcomes are said to constitute an ‘event’.
• Events may be ‘elementary’ or ‘composite’.
• An event said to be elementary, if it cannot be decomposed into simpler events.
E.g. When a coin is tossed, we may speak of events ‘Head’ or ‘Tail’, each of which is
an elementary event.
• A composite event is an aggregate of several elementary event.
E.g. When 2 coins are tossed, the ‘both head is an elementary event (HH), but ‘one
head and one tail’ is a composite event.
• Two or more events are said to be dependent if the occurrence of one event influences the
occurrence of the other.
E.g. we draw a card from a pack of well shuffled cards and doesn’t replace it, then the
result of drawing a second card from a pack will be dependent upon the first event of
drawing the card and not replacing it.
• Two events are said to independent event when the occurrence of one, in no way affects the
occurrence of another.
• Two or more events are said to be mutually exclusive if the occurrence of one implies that the
other cannot occur.
• A list of events can be termed as collective exhaustive when the outcome of an experiment
consists of all possible events that can occur in the experiment.
• Two or more events are said to be equally likely if each has an equal chance of occurrence.
• The complement event of an event is the set of all the outcomes in a sample space that are not
included in the event.

Different Concept of Probability

• Classical (or priori or mathematical) probability


• Statistical (or empirical or posteriori) probability or relative frequency theory of probability
• Personalistic view of probability.
• Axiomatic approach to probability

Classical theory of probability

• French mathematician Laplace – Probability is defined as the ratio of the number of cases
favorable to the total of equal likely cases.
• It is often called a priori probability because if we keep using orderly examples such as fair
coin, unbiased dice, and standard decks of cards, we can state the answer in advance (a priori)
without tossing a coin, rolling a die, or drawing a card.
• A classical definition of probability states that if there are n exhaustive, mutually exclusive and
equally likely cases, out of which m cases are favorable to the happening of event A is defined
as

𝒎𝒎
P (A) =
𝒏𝒏

m = no of favorable cases

n = total no of equally likely cases.


• The probability of an event happening or not happening is always equal to 1. This also
indicates that the probability of the occurrence of any event always lies in between 0 and 1, i.e.
0 ≤ 𝑷𝑷(𝑬𝑬) ≤ 𝟏𝟏.
• Probability of certain event is always equal to 1 and probability of an impossible event is equal
to 0.

Limitations

• When the cases are not equally likely.


• When the total number of cases in trail is infinite.
• When events are not mutually exclusive.

Relative frequency theory

• If trails be repeated a greater number of times under essentially the same (homogeneous and
identical) conditions then the limit of the ratio of the number of times that an event happens to
the total number of trails, as the number of trails become infinitely large, is called the
probability of the happening of that event, it is assumed that the ratio approaches finite and a
unique limit.
• If n trails, an event A happens m times, then the probability of happening of A is.
𝒎𝒎
P (A) = 𝐥𝐥𝐥𝐥𝐥𝐥
𝒏𝒏→∞ 𝒏𝒏

m = favorable no of trails

n = total no of trails

Subjective approach to probability

• The subjective probability is defined as the probability assigned to an event by an individual


based on whatever evidence is available. Thus a probability which is calculated on the basis of
personal experience or opinion is known as subjective probability.

Axiomatic or modern approach to probability

• Introduced by Russian mathematician A.N Kolmogrov in 1933.


• In this approach is followed some concepts are laid down and certain properties or postulates
generally called axioms are defined and from these axioms alone the whole theory is developed
through logical of deduction.
• A number P (A) is called the probability of an event A if it satisfies the following axioms.
1. P (A) lies between 0 and 1 both inclusive.
2. P (entire sample space) = 1.
3. P (A U B) = P (A) + P (B) if A and B are mutually exclusive.

These axioms are called axioms of positiveness, certainty and unity.

Theorems of probability

• Addition theorem
• Multiplication theorem

Addition theorem

• When events are mutually exclusive.


P (A or B) = P (A) + P (B)
• If they are not mutually exclusive
P (A or B) = P (A) + P (B) – P (AB)

Multiplication theorem

• Independent events:
P (A and B) = P (A) × P (B)
P (A, B and C) = P (A) × P (B) × P (C)
• Dependent events (Conditional probability)
P (A/B) = P (AB) / P (B)
P (B/A) = P (AB) / P (A)

P (A and B) = P (B) × P (A/B), P (B) > 0.


P (A and B) = P (A) × P (B/A), P (A) > 0.

Permutations

• A permutation of ‘n’ different objects taken ‘r’ at a time is an arrangement of ‘r’ out of the ‘n’
objects with attention given to the order of arrangement.
n𝑷𝑷 = 𝒏𝒏!
𝒓𝒓 (𝒏𝒏−𝒓𝒓)!
Combinations

• A combination of ‘n’ different objects taken ‘r’ at a time is a selection ‘r’ out of the ‘n’ objects
with no attention given to the order of arrangement.
𝒏𝒏!
n𝑪𝑪 =
𝒓𝒓 .
𝒓𝒓! (𝒏𝒏−𝒓𝒓)!

Bayes Theorem

• Called as posterior probability, revised probability, Inverse probability etc.


• Introduced by Thomas Bayes in 1763.
• The theory finds the probability of an event by taking into account a given sample information.
• It is based on the proposition that probability should be revised when the information is
available.
• The need for revising probabilities arises from a need to make better use of available
information.
• The idea of revising probabilities on the basis of new information.

𝑃𝑃(𝐴𝐴).𝑃𝑃(𝐵𝐵⁄𝐴𝐴)
P (A/B) =
𝑃𝑃(𝐴𝐴).𝑃𝑃(𝐵𝐵⁄𝐴𝐴)+ 𝑃𝑃(𝑁𝑁𝑁𝑁𝑁𝑁 𝐴𝐴).𝑃𝑃(𝐵𝐵⁄𝑁𝑁𝑁𝑁𝑁𝑁 𝐴𝐴)

𝑃𝑃(𝐵𝐵).𝑃𝑃(𝐴𝐴⁄𝐵𝐵)
Similarly P (B/A) =
𝑃𝑃(𝐵𝐵).𝑃𝑃(𝐴𝐴⁄𝐵𝐵)+ 𝑃𝑃(𝑁𝑁𝑁𝑁𝑁𝑁 𝐵𝐵).𝑃𝑃(𝐴𝐴⁄𝑁𝑁𝑁𝑁𝑁𝑁 𝐵𝐵)

Theoretical/ Probability distribution

• Theoretical or probability distribution refers to mathematical models of expected frequencies


of finite number of observations of a variable with associated probabilities.
• Theoretical distributions are based on mathematical functions whereas observed frequency
distributions are based on actual observed frequencies.

Usefulness of theoretical distribution

• With a known parameter like mean and standard deviation of population or number of trails,
chances of success and so on, probabilities of various values of variate can be found.
• Empirical model can be tested for goodness of fit with theoretical distribution available. But
such method of testing is recommended only when proper choice from amongst various
theoretical distribution is made.
• However, these distributions may not fully agree with theoretical distribution based on sample
observations but if the number of experiments is increased sufficiently, the empirical
distribution may approach these theoretical distributions very closely.

Types of theoretical distributions

• Discrete
1) Binomial
2) Poisson
• Continuous
1) Normal

Binomial Distribution

• Associated with the name of James Bernoulli


• It is a discrete probability distribution that applies in situations where there are fixed number
of repeated trails of any experiment under identical conditions for which only one of the two
mutually exclusive outcomes, success or failure can result in each trail.
• The probability of success of any trail is the same, and trails are independent.
• When ‘p’ is small (0.1), the binomial distribution is skewed to the right. As p increases the
skeweness in less noticeable. When p = 0.5, the binomial is symmetrical, when p is larger than
0.5, the distribution skewed to the left.
• By convention, generally, the value of ‘n’ is kept less than or equal to 20.

Conditions under which binomial distribution is used.

• The experiment is repeated a finite and fixed number of items.


• Each trail has two outcomes, success or failure.
• The trails are independent in the sense that outcome of one trial has no effect on the outcome
of other trails.
• Probabilities of the outcome do not change for each trail regardless of what has happened in
the previous trails.

Characteristics of Binomial distribution

1. Distribution Discrete probability distribution

2. Parameters ‘p’ (Probability of success in a single trail) and ‘n’ (number


of trails)
3. Restriction on parameters ‘P’ must be greater than zero but less than 1.

4. Mean np

5 Variance npq, where q = 1-p


6. Probability function 𝒏𝒏!
𝑷𝑷 (𝒙𝒙 = 𝒓𝒓) = 𝑷𝑷𝒓𝒓 (𝟏𝟏 − 𝑷𝑷)𝒏𝒏−𝒓𝒓
𝒓𝒓! (𝒏𝒏 − 𝒓𝒓)!
Or 𝑷𝑷 (𝒙𝒙 = 𝒓𝒓) = nCr Pr qn-r
where P = Probability of success in a single trail
q=1–P
n = number of trails
r = number of success in ‘n’ trail.

6. Expected frequency N. P(r) = N. nCr Pr qn-r

Properties of Binomial distribution

• As ‘p’ increases for a fixed ‘n’, the binomial distribution shifts to the right.
• As ‘p’ increases for a fixed ‘n’, both the mean and mode increase.

if p = 0.3 if p = 0.5 if p = 0.9

Mode 2 3 6
(n +1)p

• If ‘n’ is large and if neither ‘p’ nor ‘q’ is too close to zero, the binomial distribution can be
closely approximated by a normal distribution with a standardized variable given by
• Shape of binomial distribution
𝑿𝑿−𝒏𝒏𝒏𝒏
Z=
√𝒏𝒏𝒏𝒏𝒏𝒏

Value of P Shape of Binomial distribution


1. If p = 0.5 Symmetrical

2. If p < 0.5 Skewed to Right

3. If P > 0.5 Skewed to Left

Poisson distribution

• Originated by Simon Denis Poisson in 1837


• It is used as a limiting form of binomial distribution.
• It is a discrete probability distribution and is defined by the probability function of Poisson
variable.
• It applies in situations where the probability of success (p) is very small and that of failure (q)
is very high, almost equal to 1.

Conditions under which Poisson is used

• Large ‘n’ number of trails is very large.


• Small (p) – probability of success is very small
• Finite mean – np = mean of the distribution is finite and moderate.
• Discrete variable
• Independent trails

Characteristics of Poisson distribution.

1. Types of distribution Discrete probability distribution

2. Parameters Parameter is m

3. Restriction on parameter ‘m’ greater than zero

4. Mean ‘m’

5. Variance ‘m’

7. Expected frequency function


𝒆𝒆−𝒎𝒎 𝒎𝒎𝒓𝒓
N.P(r) = N .
𝒓𝒓!
8. Skewness It is positive skewed to right, As ‘m’
increases, the distribution shift to the right.
In practice, the poisson distribution may be used in place of the binomial distribution where n >
20 and p<0.1.

Normal distribution

• The normal distribution also called the normal probability distribution or continuous
probability distribution happens to be most useful theoretical distribution for continuous
variables.
• It was first discovered by De Moivre. It was also known to be Laplace, It has been credited to
Karl gauss.
• The normal distribution is also known as Gaussian distribution.
• Topography of Normal distribution is given by W J Yoden.
• The type of random variable which can be take an infinite number of values is called a
continuous random variable & probability distribution of such a variable is called continuous
probability distribution.
• Normal distribution is one of the versatile continuous probability distribution.

Characteristics of Normal distribution

1. Type of distribution Continuous distribution

2. Parameters
𝝁𝝁 𝒂𝒂𝒂𝒂𝒂𝒂 𝝈𝝈

3. Restriction on parameters
− 𝜶𝜶 < 𝒙𝒙 > + 𝜶𝜶

4. Mean 𝝁𝝁

5. Variance 𝝈𝝈𝟐𝟐

𝟏𝟏 𝟏𝟏 𝟐𝟐
6. Density function in terms of standard 𝑷𝑷(𝒛𝒛) = 𝒆𝒆 𝟐𝟐(𝒛𝒛)
normal variable 𝝈𝝈√𝟐𝟐𝟐𝟐
𝑿𝑿− 𝝁𝝁
Z=
𝝈𝝈
𝝁𝝁 = 𝟎𝟎
𝝈𝝈 = 1.
Properties of Normal distribution

• The normal curve is perfectly symmetrical about the mean (Skewness = 0).
• It is bell shaped in which two halves are exactly same.
• It is unimodal, only one mode since there is only one maximum point.
• Mean = Median = Mode.
• Asymptotic base i.e. the two tails of the curve extended indefinitely and never touch the
horizontal axis.
• Its height is maximum at the mean value.
• Area Under normal curve,

Mean ± 1𝝈𝝈 = 68.27%, Mean ± 2𝝈𝝈 = 95.45%, Mean ± 3𝝈𝝈 = 99.73%

Mean ± 1.96𝝈𝝈 = 95%, Mean ± 2.57𝝈𝝈 = 99%.

• Its mathematical equation is completely determined if mean and standard deviation are known.
• Its total area is 1.
• Q1 and Q2 are equidistant from median.

Importance of Normal distribution.

• It has the property stated in the central limit theorem (sample size n increase, the distribution of
mean of random sample taken from any population approaches a normal distribution.
• As ‘n’ becomes large the normal distribution serve as good approximation of many discrete
distribution.

Standard normal probability distribution

• A random variable that has a normal distribution with mean (𝝁𝝁) is 0 and Standard deviation
(𝝈𝝈) is 1.
Fundamentals of Hypothesis Testing

Statistical inference

• The primary objective of sample study is to draw inferences about population by examining
only a part of the population known as sample.
• Two branches
1. Estimation: i.e. to use ‘statistic’ obtained from the sample as estimate of the
unknown parameter of population from which the sample is drawn.
2. Test of hypothesis: i.e. to test some hypothesis about parent population from
which the sample is drawn.

Formulating Hypothesis

• As a researcher, we don’t know the exact but have a hunch about outcome and, thus we make
some prediction about the outcome. This hunch or prediction about the outcome is called
hypothesis.
• It also be termed as an educated guess or assumption about some phenomenon.

Features

• It is a tentative proposition.
• The validity if hypothesis is unknown
• In most cases, formulating hypothesis specifies the logical relationship between two variables.
• It must be generalizable.
• It should be simple.

Hypothesis

• A statistical hypothesis is some assumption or statement which may or may not be true, about a
population or equivalently about the probability distribution characterizing the given
population which we want to test on the basis of the evidence from a random sample.

Null Hypothesis

• The statement that a population parameter is equal to a specific value, or that the population
parameters from two or more groups are equal.
• The null hypothesis always expresses an equality, either between a population parameter and a
specific value or between two or more population parameters, and is always paired with
another statement, the alternative hypothesis. We use the symbol
𝑯𝑯𝟎𝟎 .
• A null hypothesis is considered true until evidence indicates otherwise. If you can conclude
that the null hypothesis is false, then the Alternative Hypothesis must be true.

Alternative Hypothesis

• The statement paired with a null hypothesis that is mutually exclusive to the null hypothesis.
• Use the symbol H1 to identify the alternative hypothesis and write an alternative hypothesis
using either a not-equal sign or less than or greater than sign, along with the symbol for the
population parameter.
• The alternative hypothesis represents the conclusion reached by rejecting the null hypothesis.
You reject the null hypothesis if evidencing from the sample statistic indicates that the null
hypothesis is unlikely to be true.

Level of significance

• Confidence with which null hypothesis is accepted or rejected depends on what is called
significant level. The probability with which may reject null hypothesis, when it is true, when
it is true, is called the level of significance.
• So when the level of significance is 0.5, it means that in the long run the statistician is rejecting
true null hypothesis 5 times out of every 100 times.
• Therefore the level of significance is the risk a statistician is running in this decision.

Degree of Freedom

• Degree of freedom is defined as number of independent observations which is obtained by


subtracting the number of constraints from the total number of observation.

Test of Hypothesis (statistical test)

• Statistical test of hypothesis is a process or procedure under which a statistical hypothesis is


laid down and is accepted or rejected on the basis of sample drawn from the population.
• The test conducted to accept or reject the hypothesis are known as statistical type of
hypothesis.
• Common test are Z-test, t-test, F-test, Chi-square test, etc.
Procedure for testing hypothesis

• Set up a null and alternative hypothesis.


• Specify the suitable level of significance.
• Decide the statistical test.
• Calculate the value of test statistic.
• Obtain table value corresponding to the level of significance & the degree of freedom.
• Make decisions about accepting or rejecting null hypothesis.

Decision-Making Risks

• In hypothesis testing, you always face the possibility that either you will wrongly reject the
null hypothesis or wrongly not reject the null hypothesis.
• These possibilities are labeled type I and type II errors, respectively.

Type I Error

• The Error that occurs if the null hypothesis H0 is rejected when in fact it is true and should not
be rejected.
• The risk, or probability, of a type I error occurring is identified by the Greek lowercase
alpha,𝜶𝜶. Alpha is also known as the level of significance of the statistical test. Traditionally,
you control the probability of a type I error by deciding the risk level 𝜶𝜶 you are willing to
tolerate of rejecting the null hypothesis when it is true. Because you specify the level of
significance before performing the hypothesis test, the risk of committing a type I error is
directly under your control.
• The most common values are 0.01, 0.05 and 0.10, and researchers traditionally select a value
of 0.05 or smaller.
• When you specify the value for 𝜶𝜶, you determine the rejection region, and using the
appropriate sampling distribution, the critical value or values that divide the rejection and non-
rejection regions are determined.

Type II Error

• The error that occur if the null hypothesis H0 is not rejected when in fact it is false and should
be rejected.
• The risk, or probability, of a type II error occurring is identified by the Greek lowercase beta,𝜷𝜷.
The probability of a type II error depends on the size of the difference between the value of the
population parameter stated in the null hypothesis and the actual population value.
• The arithmetic complement of beta, 1- 𝜷𝜷, is known as the power of the test and represents the
probability of rejecting the null hypothesis when it is false and should be rejected.
Reality

H0 H1

H0 1- 𝜶𝜶 𝜷𝜷
Statistical
Decision

H1
𝜶𝜶 1- 𝜷𝜷

Note: - 𝜶𝜶 = Probability of a Type I Error

𝜷𝜷 = Probability of a Type II Error

Parametric Tests

T test
• The test was named by William Gossett. If we want to compare values obtained from two
different groups, and if the groups are independent of each other and the data are normally
distributed in each group, then t-test can be used.

Properties of t-distribution

1. Like standard normal distribution, the‘t’ distribution is symmetrical and has a mean zero.
2. The variance of ‘t’ distribution is greater than one but approaches one as the number of degree
of freedom and therefore, the sample size become large.
3. The variance ‘t’ distribution ranges from minus infinity to plus infinity.

Uses of ‘t’ distribution

• It is used in many test of hypothesis.


• It is used to test the given population mean when sample is small.
• It is used to test whether the two samples have same mean when the samples are small.
• It is used to test the significance of mean of a random sample.
• It is used to test the difference between means of two dependent samples.
• It is used to test the significance of an observed correlation coefficient.

One sample t- test


• The one sample t-test is used for comparing sample results with a known value. It is used to
compare the mean of a single sample with the population mean.
• The purpose of the one-sample t-test is to determine whether there is a sufficient evidence to
conclude that the mean of the population from which the sample is taken is different from the
specific value.

̅ − 𝝁𝝁
𝑿𝑿
t= √𝒏𝒏 𝑋𝑋̅= sample mean
𝑺𝑺
∑(𝑋𝑋− 𝑋𝑋̅) 2
s=√ 𝜇𝜇 = population mean
𝑛𝑛−1
n = sample size

Independent samples t-test

• If two independent random samples of size n1 and n2 with mean 𝑋𝑋̅1 , 𝑋𝑋̅2 and standard-deviation
s1 and s2 are given then ‘t’ test is used to test the hypothesis that the samples come from the
same normal population. The test statistic is

𝑋𝑋̅1 − 𝑋𝑋̅2 𝒏𝒏𝟏𝟏𝒏𝒏𝟐𝟐


t= ×√
𝑺𝑺 𝒏𝒏𝟏𝟏+𝒏𝒏𝟐𝟐

Paired sample t-test (Dependent sample)

• Two samples are said to be dependent when the elements in one samples are related to those on
the other in any significant or meaningful manner. The ‘t’ test based on paired observation is
expressed as follows.

̅ √ 𝒏𝒏
𝒅𝒅
t= 𝑑𝑑𝑑 = the mean of the difference
𝑺𝑺
S = standard deviation of the difference
∑ 𝑑𝑑2−𝑛𝑛 (𝑑𝑑)𝑑 2
S=√
𝑛𝑛−1

Difference between large samples and small samples


• If a sample consist of more than 30 items, it is called large samples.

Assumptions

• The random distribution of a statistic is approximately normal


• Sample values are close to parameter values.

Z-test

𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫𝑫 𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃𝒃 𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎 ̅ − 𝝁𝝁


𝑿𝑿
Z= =
𝑺𝑺𝑺𝑺 𝑺𝑺𝑺𝑺𝒙𝒙

F-test

• F-test is named by R.A. Fisher. F-test is used to determine whether the two independent
estimates of population variance significantly differ between themselves or to establish
whether both variables have come from the universe. For this purpose the ratio of variance
given by two samples are used. This ratio is called F-Ratio.

𝑺𝑺𝟏𝟏𝟐𝟐 𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮 𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽


F= . F=
𝑺𝑺𝟐𝟐𝟐𝟐 𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺 𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽

̅ 𝟏𝟏 ) 𝟐𝟐
∑(𝑿𝑿𝟏𝟏− 𝑿𝑿 ̅ 𝟐𝟐 ) 𝟐𝟐
∑(𝑿𝑿𝟐𝟐− 𝑿𝑿
𝑺𝑺𝟐𝟐𝟏𝟏 = 𝑺𝑺𝟐𝟐𝟐𝟐 =
𝒏𝒏𝟏𝟏−𝟏𝟏 𝒏𝒏𝟐𝟐−𝟏𝟏

Assumption

1. The population for each samples is normally distributed with same or identical mean or
variance.
2. All observations are randomly selected.
3. The ratio of 𝜎𝜎2 and 𝜎𝜎2 should be equal to or greater than 1.
1 2
4. F. distribution is always formed by the ratio of squared values. Therefore it can never be a
negative number.

ANOVA

• Analysis of variance is a statistical method used to test whether the means of several samples
have significant difference or not.
• The statistical test used in ANOVA is F – test.
• In one-way ANOVA, testing of hypothesis can be carried out by partitioning the total variation
in the sample data can be on account two components, namely variance between the samples
and variance within the samples.
• Variance between the samples is attributed to difference among the sample means. This
variance is due to some assignable causes.
• Variance within the samples is the difference to chance or experimental errors.

Assumptions

• Normality
• Homogeneity
• Independence of error

ANOVA Table

• In case of ANOVA, f value is obtained by dividing the treatment variance (MSC) by the error
variance (MSE).

𝑴𝑴𝑴𝑴𝑴𝑴
F=
𝑴𝑴𝑴𝑴𝑴𝑴

Source of Sum of squares Degree of Mean squares F Value


variation Freedom
𝑺𝑺𝑺𝑺𝑺𝑺
Between MSC = F=
𝑴𝑴𝑴𝑴𝑴𝑴
Columns SSC k–1 𝒌𝒌−𝟏𝟏 𝑴𝑴𝑴𝑴𝑴𝑴

Within columns 𝑺𝑺𝑺𝑺𝑺𝑺


MSE =
SSE n–k 𝒏𝒏−𝒌𝒌

Total SST n–1

Non Parametric test

CHI SQUARE TEST

• Chi-square test is commonly used as a non-parametric test.


• It test the significant difference between observed frequencies and theoretical frequencies of a
distribution.
• Developed by Karl Pearson in 1900.
Purpose

• Chi-square describes the magnitude of the difference between theory and observation.
• Useful to find whether two or more attributes are associated or not.
• Used to ascertain how appropriately the theoretical distributions fit empirical distributions.
• It is designed to determine whether two or more independent random samples are drawn from
the same population or not.

Conditions for applying Chi square test

• The total frequencies (N) must be reasonably large at least 50.


• The expected frequency of less than 5 is pooled with the preceding or succeeding frequency so
that no expected frequency is less than 5, then the degree of freedom is based on the resulting
number of frequencies.
• The distribution shouldn’t be proportions or percentages. It should be of original units.

Types of Chi-square tests

1. Chi-square test of goodness of fit


• If we have a set of observations obtained through an experiment and we are
interested in knowing whether these values are consistent with the values which
may obtained under some theoretical distribution (Binomial, Poisson, and
Normal). For this we use chi- square test of goodness of fit.
• This test is applied when there is one categorical variable from a single
population. It is used to determine whether sample data are consistent with a
hypothesized distribution.

O = observed frequencies

E = Expected frequencies

2. Test of Independence
•The Chi-square test used to find out whether one or more attributes are
associated or not. It is applied when there are two categorical variables from a
single population. This test also called chi square test of association.
3. Test of Homogeneity
• It is an extension of the test of independence. Here we have more than one
sample unlike the test of independence where there is one sample.
• To test whether two or more independent random samples are drawn from the
same population or from different population.

Contingency Table

• A contingency table is frequency table in which samples from population is classified


according to two attributes which are divided into two or more classes. When there is only two
divisions for each attribute the contingency table is known as 2 × 2 contingency table.

Graduates Non-
Graduates
Male 50 40
Female 20 10

Degree of freedom

• When the observed frequencies are given in the shape of contingency table, the degree of
freedom is (r-1) (c-1).

Non- parametric tests

• Run test
• Mann-Whitney U test
• Wilcoxon signed rank test
• Kruskal-Wallis H test
• Friedman test
• Spearman’s rank correlation.

Run test

• The Run test procedure tests whether the order of occurrence of values of variables in a sample
is random. A run is a sequence of like observation.
• The procedure first classifies each value of variable as falling above or below a cut point and
then test to ensure that there is no order to the resulting sequence.
• The cut point is based either on a measure of central tendency or a custom value.

Mann-Whitney U test

• This is a Non-parametric test to equivalent to the t-test. This test is designed to test whether
two independent samples are from the same population.
• ‘U’, is the number of times a value in the first group precedes a value in the second group,
when values are sorted in ascending order.

Wilcoxon Signed Rank Test

• This test is equivalent to paired sample t-test. To test for two related samples or repeated
measurement on a single sample.
• The test statistic is based on the ranks of the absolute values of the difference between the two
variables.

Kruskal-Wallis H Test

• Test is equivalent to the One-way ANOVA. To test whether several independent samples are
from the same population.
• Kruskal-Wallis test is a one-way analysis of variance by ranks.

Friedman T test

• The Friedman test is the non-parametric alternative to the one-way ANOVA with repeated
measures. It is used to test for differences between two or more groups when the dependent
variable being measured is ordinal.

Spearman’s rank correlation

• It is equivalent to the Pearson coefficient correlation. Test is designed to determine the degree
of association between two variables

Normally distributed
Goal Continuous data Categorical data
Describe one group Mean, SD Median, Inter quartile
range.

Compare one group to a One sample t-test Wilcoxon test


hypothetical value

Compare 2 unpaired Independent sample t-test Mann-Whitney U test


groups

Compare 2 paired groups Paired sample t-test Wilcoxon test

Compare 3 or more One-way ANOVA Kruskal-Wallis H test


unmatched groups

Compare 3 or more Repeated measure Friedman test


matched groups ANOVA

Quantify Association Pearson Correlation Spearman Rank


between 2 variables Correlation

Predict values from Simple linear regression Non parametric


another measured regression
variable

You might also like