Introduction to Advanced Statistics
Introduction to Advanced Statistics
LESSON 1
PREVIEW OF STATISTICS
One of the peculiar characteristics that set man apart from other animals is his
capacity to capitalize from the achievements of his ancestors. Statistics offers a striking
example of man’s attempt to employ the records of his past in solving the problems of the
present and plot the course of the future.
There is a peculiar logic associated with the study of statistics but ultimately it is
really common sense. Although it is difficult to dissociate the purely mathematical
aspects in statistical activities but there is no reason why it should involve complex and
sophisticated mathematical skills. A basic background in mathematics is sufficient.
Moreover, with the availability of calculators, wherein lies most of the errors, had been
eliminated.
Statistics may be defined as the science that deals with the collection,
organization, presentation, analysis and interpretation collections of data in order to be
able to draw judgments or conclusions that help in the decision-making process.
There are two main divisions of statistics. These are: Descriptive Statistics and
Inferential Statistics.
Variable is a quantity that may assume any of a set of values. Examples are
monthly income, average grade, volume, price and so forth.
Constant is a quantity that does not change its value. For example, the
mathematical symbol ( ) the Greek alphabet pi is constant because its value does not
change which is always 3.1416. Likewise, the equivalence of an inch is 2.54 centimeters
is constant.
Ungrouped Data are data which are not organized in any specific way. They are
simply the collection of data as they are gathered.
Statistic is any statistical characteristic of a sample such as the Mean and Standard
Deviation. The Sample Mean is a sample statistic.
Knowing what is the present and what is happening, derived from the mass of
information gathered is useful enough to many fields of human endeavor such as
business, medicine, politics, and others. But the ultimate utility of statistics is as an aid in
planning and decision making. The businessman may want to know how he fares in
relation with his competitors, what goods are most salable, how many shoes and sizes to
order in anticipation of a peak season. These he can decide with the aid of statistics.
Some notable business tycoons insist that rely more on gut feeling, intuition and hunches
but it is doubtful if they are really sincere on this, for all we know they may have
unconsciously digested statistical information in their heads and analyze these.
In fact, in the field of medicine many discoveries had been made, causes and
cures were found by experimental method relying heavily on statistical formulas.
Statistics can handle only quantitative data that is number. The quantity may be
derived from counting, which will give exact values or from measurements, which will
give approximate values because measured quantity can never be exact.
The values may be continuous (that will allow values in between) or dichotomous
(that which allows only two sides: absent-present; wrong-correct; male-female). In
several instances the study will involve qualitative or descriptive data which by
themselves cannot be subjected mathematical treatment. The procedure is to translate
these quantitative data to quantitative data by assigning weights or values to the
corresponding quality or description. For example: Always = 4; Often = 3; seldom = 2;
never = 1; or Excellent = 5; Very Good = 4; Good = 3; Fair = 2; and Poor = 1.
1. central tendencies
2. degree of variability, dispersion or scatter
3. proportions and percentages
4. trends and drifts and tendencies
5. skewness, kurtosis
6. degree of reliability
7. variations, fluctuations, cycles
These can be further used for projections, predictions and prognosis applying the
theories of regression analysis, which in turn can be the basis of decision making.
Textual Form. Findings are described and verbally explained, although figures
may be cited in the text.
Tabular Form. This method uses statistical tables in presenting data.
The table consists of a number of columns with headings and several
rows of figures. Usually, the three basic columns are those of
category, quantity and the percentage.
Graphical Form. This method usually goes together with the tabular
presentation of data. The graphs employed are usually a combination
of the bar graphs, pictographs and pi charts.
Classification of Data
These data are best recorded in tabular form for easy reference and treatment.
LESSON 2
Generally, data collected from different sources are usually unorganized and in a
form unsuitable for immediate interpretation. In any statistical investigation, once
pertinent data are already gathered, the next step is to present such data in organized form
using appropriate tables and graphs.
Steps:
2. Get the value of the range. The range is denoted by R, refers to the difference
between the highest and the lowest value in the distribution.
3. Determine the interval size by dividing the range by the desired number of classes.
Divide the range by 10 and 20 in order that the size of class limits may not be less
than 10 and not more than 20 provided that such class will cover the total range of the
observations. These will meet the requirements of most sets of data. The rule says,
that we should prefer not less than 10 or not more than 20 class limits. And the Ideal
class limit is between 10 to 15, inclusive. In choosing the class interval, odd number
is preferable.
4. Determine the class limits of the class intervals. The bottom interval must include the
lowest score.
5. Tally the frequencies for each class interval. The tally should be carefully checked if
the sum is equal to the total number of scores (cases). At the bottom of Column F the
symbol N or ∑f in which ∑ (capital Greek sigma) stands for the “sum of” or the total
number of cases (N).
6. Get the sum of the frequency column and check it against the total number of
observations or cases.
Illustrative Example:
80 85 55 75 61 64 66 89
77 56 53 72 82 57 70 96
76 54 60 84 77 52 62 95
75 84 88 59 75 84 65 87
60 63 76 62 92 72 90 92
Solution:
Range = Highest Score – Lowest Score
Range = 96 – 52 44 / 16 = 2.45 or 3
= 44 interval with or i= 3
_______________________________________________________
95-97 II 2 40
92-94 I 1 38
89-91 II 2 37
86-88 II 2 35
83-85 IIII 4 33
80-82 II 2 29
77-79 II 2 27
74-76 IIII 5 25
71-73 III 3 20
68-70 I 1 17
65-67 II 2 16
62-64 IIII 4 14
59-61 IIII 4 10
56-58 II 2 6
53-55 III 3 4
50-52 I 1 1
_______________________________________________________
Total n = 40
_______________________________________________________
LESSON 3
There are actually three measures of central tendencies, namely, Mean, Median
and Mode.
The Mean
The mean is the most frequently used measure of central tendency because it is
subject to less error; it is rigidly defined; and it is also easily calculated. Moreover, it
lends itself to algebraic manipulation; its standard error is less than the median, and the
sum of the deviation of the cases about the mean is zero.
The mean of ungrouped data is determined as the sum of all scores divided by the
number of cases. Consider the following scores: 8, 5, 10, 9, 7, 8, 11, 15, 14, 5, 4, 19, and
7. The mean of these scores is 9.47.
In general, fi the scores are represented by the symbols X 1, X2, X3, . . . Xk, the
mean in algebraic language is
X = mean
N = number of cases
X
8
5
10
9 X = ∑X ∑X = 142
7
10 N N = 15
8
11 = 142
15 15
14
10 = 9.47
5
4
19
7
142
The mean is 9.47. It is well to note that ∑X is equal to N x X. This information is useful
in a variety of situations
Moreover, ungrouped data may mean the number of cases is less than 30.
There are two methods in determining the mean of the grouped data, namely, (a)
by Midpoints Method, and (b) by Class-Deviation Method
Steps:
Midpoint = LL + UL
2
2. Multiply the Midpoint by the Frequency
3. Sum the product of Midpoints times Frequencies.
4. Divide the sum by the total number of cases (N) to obtain the mean.
5. Apply the formula
X = ∑fM
------
N
95-97 96 2 192
92-94 93 1 93
89-91 90 2 180
86-88 87 2 174
83-85 84 4 336
80-82 81 2 162
77-79 78 2 156
74-76 75 5 375
71-73 72 3 216
68-70 69 1 69
65-67 66 2 132
62-64 63 4 252
59-61 60 4 240
56-58 57 2 114
53-55 54 3 162
50-52 51 1 51
_______________________________________________________
Total n = 40 2904
_______________________________________________________
= 2904
40
= 72.6
Steps:
1. Choose a temporary arbitrary origin from any of the class limits either
at the center, bottom or at the top.
2. Assign to the class limits coded values, starting with zero at the origin
and with positive values above the zero and negative values below.
This deviation appears in Column d.
X = M0 + C ( ∑fd )
N
95-97 96 2 7 14
92-94 93 1 6 6
89-91 90 2 5 10
86-88 87 2 4 8
83-85 84 4 3 12
80-82 81 2 2 4 56
77-79 78 2 1 2
74-76 75 5 0 0
71-73 72 3 -1 -3
68-70 69 1 -2 -2
65-67 66 2 -3 -6
62-64 63 4 -4 -16 -88
59-61 60 4 -5 -20
56-58 57 2 -6 -12
53-55 54 3 -7 -21
50-52 51 1 -8 - 8
________________________________________________________________
Total 40 -32
________________________________________________________________
X = M0 + C ( ∑fd ) M0 = 75
N
= 75 + ( -96 ) C = 3
40
X = 72.6 N = 40
The mean value of 72.6 in using the Midpoint Method and the Class-Deviation is the
same.
The Computation of the Median
Another measure of central tendency that is commonly used is the median. The
median is defined as a point on a scale such that scores above or below it lie 50 percent of
the cases. It may or may not stand for a score.
a) The median from ungrouped data. The median of a set of ungrouped data is
obtained arranging the scores from highest to lowest, and pick out the middlemost score
from its order of magnitude if the set of scores is odd. When the set of values is even, the
median is obtained by computing the midpoints of two middle scores. To illustrate,
consider the set of scores below:
97
95
92
90 Median
88
85
80
In the above example, there are 7 scores. Locate a point such that 3 scores fall
above the median and 3 scores below. Thus, the median is 90.
When set of scores is even, compute the median be getting the average of the two
middlemost scores from its point of order of magnitude. To illustrate, consider the even
set of scores below:
85
84
82
80 80 + 75 155
75 --------- = ------
73 2 2
72
70 Median = 77.5
The above set of scores is 8. The median is between 80 and 75. Thus, 80 plus 75 equals
155 divided by 2 equals 77.5. Hence, the median of the foregoing set of scores is 77.5
∑Cf < = the sum of the cumulative frequencies “lesser than” up to but
below the median class
95-97 2 40
92-94 1 38
89-91 2 37
86-88 2 35
83-85 4 33
80-82 2 29
77-79 2 27
74-76 5 25
71-73 Fc = 3 20
68-70 1 17 ∑Cf <
65-67 2 16
62-64 4 14
59-61 4 10
56-58 2 6
53-55 3 4
50-52 1 1
_______________________________________________________
Total 40
_______________________________________________________
= 70.5 + 3 ( 0.1)
= 70.5 + 3
Median = 73.5
The mode from ungrouped data. In ungrouped data, mode can be easily be
determined by inspection. It is classified into unimodal, bimodal, trimodal, and
polymodal.
Illustrative Example:
In a set of scores 90, 95, 89, 89, 80, 92, 87, 91, 89, 92, 93, 92, the scores that
appear three times are 89 and 92. These scores are the modes having equal number of
occurrence. These are called bimodal.
In these group of scores 89, 92, 89, 90, 87, 92, 91, 92, 91, 89, 90, 90, there are
three scores having equal frequencies 89, 90 and 92. Thus, they are the modal scores and
we call it trimodal.
SAMPLING
Sampling Design
This refers to the scheme of arriving at the sample which involves specification of
the target, the respondent population and the method of selecting them.
Population – refers to the entire group or set of individuals or items which is the
focus of an investigation. It is also called the universe. A population is further
distinguished by its role in the study, thus, the following types:
a. Topic Population – may be people things, plants or animals. It is the group or
set about which generalization will be made.
b. Respondent Population – refers to a group or set of individuals who will
furnish the needed information on which the generalization is based.
c. Target Population – refers to the group or set of individuals or items from
which or about which representative information is originally desired.
There are basically two types of sampling, namely: (1) probability sampling and
(2) non-probability sampling.
Probability Sampling
b. Quota Sampling. This method involves the taking of the desired number of
respondents with the required characteristics proportionate to the population
under study. An example is when a researcher would like to document the
experience of male and female scientists who have been involved in the
establishment of a marine station. He/she should look for these scientists until
the desired number of the respondent is met.
Formula:
n= N
------------------
1 + (N) (e)2
Where n = sample size
N = population size
Illustrative Example:
In your study, the size of the population is 10, 000. What is the sample size if you
allow 5% margin of error. Using the above formula the sample size could be computed as
follows:
n= N
------------------
1 + (N) (e)2
= 10, 000
-----------------------------
1 + (10, 000) (0.05)2
= 10, 000
-----------------------------
1 + (10, 000) (0.0025)
= 10, 000
--------------------------
1 + 25
= 10, 000
--------------------------
26
n = 384.615 or 385
Steps:
P = 1 - 6 ∑d2
---------------
N (N2 - 1)
24 18 1 1 0 0
21 16 2 2.5 -.5 0.25
18 10 3 6 -3 9
15 16 5 2.5 2.5 6.25
15 12 5 4 1 1
15 10 5 6 -1 1
12 10 7.5 6 1.5 2.25
12 8 7.5 8 - .5 .25
9 6 9.5 9 .5 .25
9 4 9.5 10 - .5 .25
----------------
∑d2 = 20.5
Solution:
P=1 - 6 ∑d2
---------------
N (N2 - 1)
=1 - 6 ( 20.5 )
---------------
10 (102 - 1)
=1 - 123
---------------
10 (100 – 1)
=1 - 123
--------------
10 ( 99 )
=1 - 123
--------------
990
= 1 - 0.12
= 0.88
DF = N – 2 Critical Value of P
= 10 – 2 at
=8 .05 = .643
The computed P of 0.97 is > the Critical Value of .643, thus significant. The
null hypothesis is rejected.
When there are two sets of scores and the researcher would like to find if the
two sets are correlated, the Pearson Product-Moment correlation is used. The
correlation is called co-variation because analysis is concentrated mainly on
how the two determine the relationship between two variables with interval
type of data. An example is knowing how scores in an achievement test
correlates with the scores in a mental ability test.
Steps:
r= N ∑XY - ∑X ∑Y
---------------------------------------------------
r= N ∑XY - ∑X ∑Y
-------------------------------------------------
= 10 (1824) – (150)(110)
-------------------------------------------------------
1740
= -------------------------------------------------------
[ 2160] [ 1860 ]
1740
= -------------------------------------------------------
4017600
1740
-------------------------------------------------------
2004.40
= 0.868 DF = N - 2
= 10 - 2
= 8
.05 = .549
The computed r of 0.868 is > than the required Tabular value of .549 at .05 level
of significance. Thus, significant. The null hypothesis is rejected in this aspect.
The t-test for Independent Sample Means is used to determine if the observed
difference between the mean of two groups is statistically significant. It is,
therefore, a test for the observed difference between two sample means not
correlated with each other. It is used to compare the difference between the
average of cases of control and experimental groups and to determine if there
is a difference between the average of two intact groups.
The t-test for Dependent Sample Means is a more precise test with its use
limited to scores that are correlated and involving the pre-test and post-test.
The t-value is obtained from the table of critical t-value using the appropriate
degrees of freedom. It the computed t is greater than the tabular t, the
hypothesis of no difference between the pre-test and post-test is rejected.
7. F-Test
A. NON-PARAMETRIC STATISTICS