Unit – II DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data with
Averages - Describing Variability - Normal Distributions and Standard (z) Scores
THREE TYPES OF DATA
Qualitative data consist of words (Yes or No), letters (Y or N), or numerical codes (0 or 1) that
represent a class or category.
Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent relative standing within a
group.
Quantitative data consist of numbers (weights of 238, 170, . . . 185 lbs) that represent an amount
or a count.
To determine the type of data, focus on a single observation in any collection of observations
TYPES OF VARIABLES
A variable is a characteristic or property that can take on different values.
The weights can be described not only as quantitative data but also as observations for a
quantitative variable, since the various weights take on different numerical values.
By the same token, the replies can be described as observations for a qualitative variable, since
the replies to the Facebook profile question take on different values of either Yes or No.
Given this perspective, any single observation can be described as a constant, since it takes on
only one value. Discrete and Continuous Variables Quantitative variables can be further
distinguished as discrete or continuous. A discrete variable consists of isolated numbers separated
by gaps. Discrete variables can only assume specific values that you cannot subdivide. Typically,
you count discrete values, and the results are integers. Examples
Counts- such as the number of children in a family. (1, 2, 3, etc., but never 1.5)
These variables cannot have fractional or decimal values. You can have 20 or 21 cats, but not
20.5
The number of heads in a sequence of coin tosses.
The result of rolling a die.
The number of patients in a hospital.
The population of a country. While discrete variables have no decimal places, the average of
these values can be fractional. For example, families can have only a discrete number of children: 1,
2, 3, etc. However, the average number of children per family can be 2.2. A continuous variable
consists of numbers whose values, at least in theory, have no restrictions. Continuous variables can
assume any numeric value and can be meaningfully split into smaller parts. Consequently, they have
valid fractional and decimal values. In fact, continuous variables have an infinite number of
potential values between any two points. Generally, you measure them using a scale. Examples of
continuous variables include weight, height, length, time, and temperature. Durations, such as the
reaction times of grade school children to a fire alarm; and standardized test scores, such as those on
the Scholastic Aptitude Test (SAT).
Independent and Dependent Variables
Independent Variable In an experiment, an independent variable is the treatment manipulated by the
investigator.
Independent variables (IVs) are the ones that you include in the model to explain or predict
changes in the dependent variable.
Independent indicates that they stand alone and other variables in the model do not influence
them.
Independent variables are also known as predictors, factors, treatment variables, explanatory
variables, input variables, x-variables, and right-hand variables—because they appear on the right
side of the equals sign in a regression equation.
It is a variable that stands alone and isn't changed by the other variables you are trying to
measure. For example, someone's age might be an independent variable. Other factors (such as what
they eat, how much they go to school, how much television they watch) The impartial creation of
distinct groups, which differ only in terms of the independent variable, has a most desirable
consequence. Once the data have been collected, any difference between the groups can be
interpreted as being caused by the independent variable. Dependent Variable When a variable is
believed to have been influenced by the independent variable, it is called a dependent variable. In an
experimental setting, the dependent variable is measured, counted, or recorded by the investigator.
The dependent variable (DV) is what you want to use the model to explain or predict. The values
of this variable depend on other variables.
It’s also known as the response variable, outcome variable, and left-hand variable. Graphs place
dependent variables on the vertical, or Y, axis.
a dependent variable is exactly what it sounds like. It is something that depends on other factors.
For example the blood sugar test depends on what food you ate, at which time you ate etc. Unlike
the independent variable, the dependent variable isn’t manipulated by the investigator. Instead, it
represents an outcome: the data produced by the experiment. Confounding Variable An uncontrolled
variable that compromises the interpretation of a study is known as a confounding variable.
Sometimes a confounding variable occurs because it’s impossible to assign subjects randomly to
different conditions.
Describing Data with Tables and Graphs
Frequency Distributions for Quantitative Data
A frequency distribution is a collection of observations produced by sorting observations into
classes and showing their frequency (f) of occurrence in each class. When observations are sorted
into classes of single values, as in Table 2.1, the result is referred to as a frequency distribution for
ungrouped data.
The frequency distribution shown in Table 2.1 is only partially displayed because there are more
than 100 possible values between the largest and smallest observations. Frequency distribution table
is much more informative if possible observed values is less then 20. If more entry is observed then
grouped data is used.
Grouped Data
According to their frequency of occurrence. When observations are sorted into classes of more than
one value result is referred to as a frequency for grouped data. (Shown in table 2.2)
The general structure of this frequency distribution is the data’s are grouped into class intervals
with 10 possible values each.
The frequency ( f ) column shows the frequency of observations in each class and, at the bottom,
the total number of observations in all classes.
OUTLIERS
An outlier is an extremely high or extremely low data point relative to the nearest data point and the
rest of the neighbouring co-existing values in a data graph or dataset you're working with. Outliers
are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.
RELATIVE FREQUENCY DISTRIBUTIONS Relative frequency distributions show the frequency
of each class as a part or fraction of the total frequency for the entire distribution. This type of
distribution is especially helpful when you must compare two or more distributions based on
different total numbers of observations. The conversion to relative frequencies allows a direct
comparison of the shapes of two distributions without adjust other observations. Constructing
Relative Frequency Distributions To convert a frequency distribution into a relative frequency
distribution, divide the frequency for each class by the total frequency for the entire distribution.
Table 2.5 illustrates a relative frequency distribution based on the weight distribution of Table 2.2.
Percentages or Proportions
Some people prefer to deal with percentages rather than proportions because percentages usually
lack decimal points. A proportion always varies between 0 and 1, whereas a percentage always
varies between 0 percent and 100 percent. To convert the relative frequencies, multiply each
proportion by 100; that is, move the decimal point two places to the right.
CUMULATIVE FREQUENCY DISTRIBUTIONS
Cumulative frequency distributions show the total number of observations in each class and in all
lower ranked classes. Cumulative frequencies are usually converted, in turn, to cumulative
percentages. Cumulative percentages are often referred to as percentile ranks. Constructing
Cumulative Frequency Distributions To convert a frequency distribution into a cumulative
frequency distribution, add to the frequency of each class the sum of the frequencies of all classes
ranked below it.
Cumulative Percentages
As has been suggested, if relative standing within a distribution is particularly important, then
cumulative frequencies are converted to cumulative percentages To obtain this cumulative
percentage, the cumulative frequency of the class should be divided by the total frequency of the
entire distribution.
Percentile Ranks
When used to describe the relative position of any score within its parent distribution, cumulative
percentages are referred to as percentile ranks. The percentile rank of a score indicates the
percentage of scores in the entire distribution with similar or smaller values than that score. Thus a
weight has a percentile rank of 80 if equal or lighter weights constitute 80 percent of the entire
distribution.
FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA
Frequency distributions for qualitative data are easy to construct. Simply determine the frequency
with which observations occupy Each class, and report these frequencies as shown in Table 2.7 for
the Face book profile survey Qualitative data have an ordinal level of measurement because
Observations can be ordered from least to most, that order should be preserved in the frequency
table
Relative and Cumulative Distributions for Qualitative Data
Frequency distributions for qualitative variables can always be converted into relative frequency
distributions. if measurement is ordinal because observations can be ordered from least to most,
cumulative frequencies (and cumulative percentages) can be used.
GRAPHS
Data can be described clearly and concisely with the aid of a well-constructed frequency
distribution. And data can often be described even more vividly by converting frequency
distributions into graphs.
GRAPHS FOR QUANTITATIVE DATA
Histograms
A bar-type graph for quantitative data. The common boundaries between adjacent bars emphasize
the continuity of the data, as with continuous variables. A histogram is a display of statistical
information that uses rectangles to show the frequency of data items in successive numerical
intervals of equal size.
Frequency Polygon
An important variation on a histogram is the frequency polygon, or line graph. Frequency polygons
may be constructed directly from frequency distributions.
Stem and Leaf Displays
Another technique for summarizing quantitative data is a stem and leaf display. Stem and leaf
displays are ideal for summarizing distributions, such as that for weight data, without destroying the
identities of individual observations.
Constructing Stem and Leaf Display
The leftmost panel of table re-creates the weights. To construct the stem and leaf display for the
table given below, first note that, when counting by tens, the weights range from the 130s to the
240s. Arrange a column of numbers, the stems, beginning with 13 (representing the 130s) and
ending with 24 (representing the 240s). Draw a vertical line to separate the stems, which represent
multiples of 10, from the space to be occupied by the leaves, which represent multiples of 1.
MISLEADING GRAPHS
Graphs can be constructed in an unscrupulous manner to support a particular point of view. Popular
sayings says, including “Numbers don’t lie, but statisticians do” and “There are three kinds of lies
— lies, damned lies, and statistics.”
Describing Data with Averages
MODE
The mode reflects the value of the most frequently occurring score. In other words A mode is
defined as the value that has a higher frequency in a given set of values. It is the value that appears
the most number of times.
Example: In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 since it has appeared
in the set twice. Types of Modes Bimodal, Trimodal & Multimodal (More than one mode)
When there are two modes in a data set, then the set is called bimodal For example, The mode of
Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because both 2 and 5 is repeated three times in the given set.
When there are three modes in a data set, then the set is called trimodal For example, the mode of
set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8
When there are four or more modes in a data set, then the set is called multimodal
MEDIAN
The median reflects the middle value when observations are ordered from least to most. The median
splits a set of ordered observations into two equal parts, the upper and lower halves. Finding the
Median
Order scores from least to most.
If the total number of observation given is odd, then the formula to calculate the median is:
Median = {(n+1)/2}th term / observation
If the total number of observation is even, then the median formula is: Median = 1/2[(n/2)th term
+ {(n/2)+1}th term ]
MEAN
The mean is found by adding all scores and then dividing by the number of scores. Mean is the
average of the given numbers and is calculated by dividing the sum of given numbers by the total
number of numbers.
Types of means Sample mean
Population mean
Sample Mean
The sample mean is a central tendency measure. The arithmetic average is computed using samples
or random values taken from the population. It is evaluated as the sum of all the sample variables
divided by the total number of variables.
Population Mean
The population mean can be calculated by the sum of all values in the given data/population
divided by a total number of values in the given data/population.
Describing Variability
RANGE
The range is the difference between the largest and smallest scores. The range in statistics for a
given data set is the difference between the highest and lowest values. For example, if the given
data set is {2,5,8,10,3}, then the range will be 10 – 2 = 8.
VARIANCE
Variance is a measure of how data points differ from the mean. A variance is a measure of how far
a set of data (numbers) are spread out from their mean (average) value.
Formula σ = Σ(x-μ)2 or
Variance = (Standard deviation)2= σ2 = > σ 2= Σ(x-μ)2 /n
STANDARD DEVIATION
The standard deviation, the square root of the mean of all squared deviations from the mean, that is,
Standard deviation = √variance
Standard Deviation: A rough measure of the average (or standard) amount by which scores deviate
Standard Deviation: A Measure of Distance The mean is a measure of position, but the standard
deviation is a measure of distance (on either side of the mean of the distribution).
Sum of Squares (SS)
Calculating the standard deviation requires that we obtain first a value for the variance. However,
calculating the variance requires, in turn, that we obtain the sum of the squared deviation scores.
The sum of squared deviation scores or more simply the sum of squares, symbolized by SS
“The
sum of
squares equals the sum of all squared deviation scores.” You can reconstruct this formula by
remembering the following three steps:
1. Subtract the population mean, μ, from each original score, X, to obtain a deviation score, X − μ.
2. Square each deviation score, (X − μ)2, to eliminate negative signs.
3. Sum all squared deviation scores, Σ (X − μ)2.
Sum
of
Squares Formulas for Sample
Sample notation can be substituted for population notation in the above two formulas without
causing any essential changes:
DEGREES OF FREEDOM (df)
Degrees of freedom (df) refers to the number of values that are free to vary, given one or more
mathematical restrictions, in a sample being used to estimate a population characteristic.
Degrees of freedom are the number of independent variables that can be estimated in a statistical
analysis. These values of these variables are without constraint, although the values do impost
restrictions on other variables if the data set is to comply with estimate parameters.
Degrees of Freedom (df ) The number of values free to vary, given one or more mathematical
restrictions.
Formula Degree of freedom df = n-1
INTERQUARTILE RANGE (IQR)
The interquartile range (IQR), is simply the range for the middle 50 percent of the scores. More
specifically, the IQR equals the distance between the third quartile (or 75th percentile) and the first
quartile (or 25th percentile), that is, after the highest quarter (or top 25 percent) and the lowest
quarter (or bottom 25 percent) have been trimmed from the original set of scores. Since most
distributions are spread more widely in their extremities than their middle, the IQR tends to be less
than half the size of the range.
Simply, The IQR describes the middle 50% of values when ordered from lowest to highest. To find
the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the
data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3
and Q1.
Normal Distributions and Standard (z) Scores
THE NORMAL CURVE
The normal distribution is a continuous probability distribution that is symmetrical on both sides of
the mean, so the right side of the center is a mirror image of the left side.
Properties of the Normal Curve
The normal curve is a theoretical curve defined for a continuous variable, as described in Section
1.6, and noted for its symmetrical bell-shaped form
Because the normal curve is symmetrical, its lower half is the mirror image of its upper half.
The normal curve peaks above a point midway along the horizontal spread and then tapers off
gradually in either direction from the peak (without actually touching the horizontal axis, since, in
theory, the tails of a normal curve extend infinitely far).
The values of the mean, median (or 50th percentile), and mode, located at a point midway along
the horizontal spread, are the same for the normal curve. Properties of a normal distribution
The mean, mode and median are all equal.
The curve is symmetric at the center (i.e. around the mean, μ).
Exactly half of the values are to the left of center and exactly half the values are to the right.
The total area under the curve is 1.
z SCORES
A z score is a unit-free, standardized score that, regardless of the original units of measurement,
indicates how many standard deviations a score is above or below the mean of its distribution.
A z score can be defined as a measure of the number of standard deviations by which a score is
below or above the mean of a distribution. In other words, it is used to determine the distance of a
score from the mean. If the z score is positive it indicates that the score is above the mean. If it is
negative then the score will be below the mean. However, if the z score is 0 it denotes that the data
point is the same as the mean.
Where X is the original score and μ and σ
are the mean and the standard deviation, respectively, for the normal distribution of the original
scores.
STANDARD NORMAL CURVE
If the original distribution approximates a normal curve, then the shift to standard or z scores will
always produce a new distribution that approximates the standard normal curve. This is the one
normal curve for which a table is actually available.
Although there is an infinite number of different normal curves, each with its own mean and
standard deviation, there is only one standard normal curve, with a mean of 0 and a standard
deviation of 1.