Chapter 2 : Data and Statistics
Data and Statistics
1. Data :
Elements, Variables, and Observations
a) Elements are the entities on which data are collected.
b) A variable is a characteristic of interest for the elements.
c) Measurements collected on each variable for every element in a study
provide the data. The set of measurements obtained for a particular
element is called an observation.
Example
• The data set in Table 1.1 includes the following five
variables: Fund Type , Net Asset Value ($) ,5-Year,
Average Return (%) , Expense Ratio , Morningstar
Rank
• The set of measurements for the first observation
(American Century Intl. Disc) is IE, 14.37, 30.53,
1.41, and 3-Star.
• The set of measurements for the second observation
(American Century Tax-Free Bond) is FI, 10.73, 3.34,
0.49, and 4-Star, and so on.
• A data set with 25 elements contains 25 observations.
1. Data :
Scales of Measurement
• Data collection requires one of the following scales of measurement:
nominal, ordinal, interval, or ratio.
• The scale of measurement determines the amount of information contained in
the data and indicates the most appropriate data summarization and statistical
analyses.
1. Data :
Scales of Measurement
• Nominal scale : When the data for a variable consist of labels or names used to identify
an attribute of the element
• Ordinal scale : if the data has the properties of nominal data and the order or rank of the
data is significant.
• Interval scale : if the data have all the properties of ordinal data and the interval
between values is expressed in terms of a fixed unit of measure. Interval data are always
numeric.
• Ratio scale if the data have all the properties of interval data and the ratio of two values
is significant.
1. Data :
Categorical and Quantitative Data
• Data that can be grouped by specific categories are referred to as categorical data.
Categorical data use either the nominal or ordinal scale of measurement.
• Data that use numeric values to indicate how much or how many are referred to as
quantitative data.
• Quantitative data are obtained using either the interval or ratio scale of measurement.
• A categorical variable is a variable with categorical data, and a quantitative variable
is a variable with quantitative data.
1. Data :
Cross-Sectional and Time Series Data
• Cross-sectional data are data collected at the same or approximately the
same point in time.
• Time series data are data collected over several time periods.
Example
1. Data :
Statistical Inference
• A population is the set of all elements of interest in particular study.
• A sample is a subset of the population.
• The process of conducting a survey to collect data for a sample is called a sample
survey.
• Statistics uses data from a sample to make estimates and test hypotheses about the
characteristics of a population through a process referred to as statistical inference.
•
1. Descriptive Statistics :
Summarizing Categorical Data
• Frequency Distribution is a tabular summary of data showing the number (frequency) of items in each of
several nonoverlapping classes.
Example : To develop a frequency distribution for these data, we count the number of times each soft drink
appears in Table 2.1. Coke Classic appears 19 times, Diet Coke appears 8 times, Dr. Pepper appears 5 times,
Pepsi appears 13 times, and Sprite appears 5 times.
1. Descriptive Statistics :
Summarizing Categorical Data
• A frequency distribution shows the number (frequency) of items in each of several nonoverlapping classes.
• A percent frequency distribution summarizes the percent frequency of the data for each class.
• A bar chart is a graphical device for depicting categorical data summarized in a frequency, relative
frequency, or percent frequency distribution.
1. Descriptive Statistics :
Summarizing Categorical Data
• The pie chart provides another graphical device for presenting relative frequency and percent frequency
distributions for categorical data.
1. Descriptive Statistics :
Summarizing Quantitative Data
• The three steps necessary to define the classes for a frequency distribution with quantitative
data are:
1. Determine the number of nonoverlapping classes.
2. Determine the width of each class.
3. Determine the class limits.
1. Descriptive Statistics :
Summarizing Quantitative Data
• We define the relative frequency and percent frequency distributions for quantitative data in
the same manner as for qualitative data. First, recall that the relative frequency is the
proportion of the observations belonging to a class. With n observations,
• The percent frequency of a class is the relative frequency multiplied by 100.
1. Descriptive Statistics :
Summarizing Quantitative Data
• One of the simplest graphical summaries of data is a dot plot. A horizontal axis shows the
range for the data. Each data value is represented by a dot placed above the axis.
• A common graphical presentation of quantitative data is a histogram. A histogram is
constructed by placing the variable of interest on the horizontal axis and the frequency,
relative frequency, or percent frequency on the vertical axis.
1. Descriptive Statistics :
Summarizing Quantitative Data
• A variation of the frequency distribution that provides another tabular summary of quantitative
data is the cumulative frequency distribution.
• The cumulative frequency distribution uses the number of classes, class widths, and class
limits developed for the frequency distribution.
• A cumulative percent frequency distribution shows the percentage of data items with values
less than or equal to the upper limit of each class.
1. Descriptive Statistics :
Summarizing Quantitative Data
• A graph of a cumulative distribution, called an ogive, shows data values on the horizontal axis
and either the cumulative frequencies, the cumulative relative frequencies, or the cumulative
percent frequencies on the vertical axis.
1. Descriptive Statistics: Numerical Measures
Measures of Location
• The mean provides a measure of central location for the data. If the data are
for a sample, the mean is denoted by x ̄; if the data are for a population, the
mean is denoted by the Greek letter μ.
1. Descriptive Statistics: Numerical Measures
Measures of Location
• The median is another measure of central location.
• MEDIAN
• Arrange the data in ascending order (smallest value to largest value).
a) For an odd number of observations, the median is the middle value.
b) For an even number of observations, the median is the average of the two middle values.
1. Descriptive Statistics: Numerical Measures
Measures of Location
• A third measure of location is the mode.
• The mode is the value that occurs with greatest frequency.
• Percentile : the pth percentile is a value such that at least p percent of the observations
are less than or equal to this value and at least (100 - p) percent of the observations are
greater than or equal to this value.
1. Descriptive Statistics: Numerical Measures
Measures of Location
• Calculating the pth percentile
• Step 1. Arrange the data in ascending order (smallest value to largest value).
• Step 2. Compute an index i
where p is the percentile of interest and n is the number of observations.
• Step 3.
a) If i is not an integer, round up. The next integer greater than i denotes the position of the
pth percentile.
b) If i is an integer, the pth percentile is the average of the values in positions i and i + 1.
Example. As an illustration of this procedure, let us determine the 85th percentile for the starting salary data.
Step 1. Arrange the data in ascending order.
3310 3355 3450 3480 3480 3490 3520 3540 3550 3650 3730 3925
Step 2.
Step 3. Because i is not an integer, round up. The position of the 85th percentile is the next integer greater than
10.2, the 11th position.
Returning to the data, we see that the 85th percentile is the data value in the 11th position, or 3730.
As another illustration of this procedure, let us consider the calculation of the 50th per-
centile for the starting salary data. Applying step 2, we obtain
Because i is an integer, step 3(b) states that the 50th percentile is the average of the sixth and seventh data values;
thus the 50th percentile is (3490 + 3520)/2 = 3505. Note that the 50th percentile is also the median.
1. Descriptive Statistics: Numerical Measures
Measures of Location
• Quartiles: The division points are referred to as the quartiles and are defined as
• Q1 = first quartile, or 25th percentile
• Q2 = second quartile, or 50th percentile (also the median)
• Q3 = third quartile, or 75th percentile.
• Figure 3.1 shows a data distribution divided into four parts.
Example. The starting salary data are again arranged in ascending order. We already identified Q2, the second
quartile (median), as 3505. 3310 3355 3450 3480 3480 3490 3520 3540 3550 3650 3730 3925
The computations of quartiles Q1 and Q3 require the use of the rule for finding the 25th and 75th percentiles.
For Q1,
Because i is an integer, step 3(b) indicates that the first quartile, or 25th percentile, is the average of the third and
fourth data values; thus, Q1 = (3450 + 3480)/2 = 3465.
For Q3,
Again, because i is an integer, step 3(b) indicates that the third quartile, or 75th percentile, is the average of the
ninth and tenth data values; thus, Q3 = (3550 + 3650)/2 = 3600. The quartiles divide the starting salary data into
four parts, with each part containing 25% of the observations.
1. Descriptive Statistics: Numerical Measures
Measures of Variability
• Range: the simplest measure of variability is the range.
Range = Largest value - Smallest value
• Interquartile Range: This measure of variability is the difference between the third
quartile, Q3, and the first quartile, Q1. In other words, the interquartile range is the range
for the middle 50% of the data.
IQR = Q3 - Q1
For the data on monthly starting salaries, the quartiles are Q3 = 3600 and Q1 = 3465. Thus,
the interquartile range is 3600 - 3465 = 135.
1. Descriptive Statistics: Numerical Measures
Measures of Variability
• The variance is a measure of variability that utilizes all the data. The variance is based on
the difference between the value of each observation (xi ) and the mean.
1. Descriptive Statistics: Numerical Measures
Measures of Variability
• The standard deviation is defined to be the positive square root of the variance.
• Coefficient of variation: indicates how large the standard deviation is relative to the mean.
1. Descriptive Statistics: Numerical Measures
Exploratory Data Analysis
• In a five-number summary, the following five numbers are used to summarize the data:
1. Smallest value
2. First quartile (Q1)
3. Median (Q2)
4. Third quartile (Q3)
5. Largest value
Example : The monthly starting salaries shown in Table 3.1 for a sample of 12 business school graduates are
repeated here in ascending order.
1. Descriptive Statistics: Numerical Measures
Exploratory Data Analysis
• A box plot is a graphical summary of data that is based on a five-number summary. A key to the
development of a box plot is the computation of the median and the quartiles, Q1 and Q3. The
interquartile range, IQR = Q3 - Q1, is also used.
• The steps used to construct the box plot follow.
o A box is drawn with the ends of the box located at the first and third quartiles. For the salary
data,Q1 =3465 and Q3 =[Link] box contains the middle 50% of the data.
o A vertical line is drawn in the box at the location of the median (3505 for the salary data).
o By using the interquartile range, IQR = Q3 - Q1, limits are located. The limits for the box plot
are 1.5(IQR) below Q1 and 1.5(IQR) above Q3.
o For the salary data, IQR = Q3 - Q1 = 3600 - 3465 = 135. Thus, the limits are 3465 - 1.5(135)
= 3262.5 and 3600 + 1.5(135) = 3802.5. Data outside these limits are considered outliers.
o The dashed lines in Figure 3.5 are called whiskers. The whiskers are drawn from the ends of
the box to the smallest and largest values inside the limits computed in step 3. Thus, the
whiskers end at salary values of 3310 and 3730.
o Finally, the location of each outlier is shown with the symbol *. In Figure 3.5 we see one
outlier, 3925.
1. Descriptive Statistics: Numerical Measures
Measures of Association Between Two Variables
• covariance as a descriptive measure of the linear association between two variables.
• For a sample of size n with the observations (x1, y1), (x2, y2), and so on, the sample covariance
is defined as follows:
Correlation Coefficient