0% found this document useful (0 votes)

4 views42 pages

Univariate EDA of Numerical Data

Chapter 3 focuses on univariate exploratory data analysis (EDA) of numerical variables, particularly in the context of COVID-19 data and HDB resale flats in Singapore. It discusses techniques for summarizing and visualizing data distributions, including the use of frequency tables and histograms, while emphasizing the importance of understanding the shape, center, and spread of distributions. The chapter also addresses outliers and their impact on measures of central tendency, highlighting the need for careful analysis of data variability.

Uploaded by

yuminhokie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views42 pages

Univariate EDA of Numerical Data

Uploaded by

yuminhokie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter 3

Dealing with Numerical Data

Section 3.1 Univariate EDA

In Chapter 1, we introduced two main types of variables that we will be focussing on, namely categorical
variables and numerical variables. Categorical variables were discussed extensively in Chapter 2 and in
this chapter, we will turn our attention to numerical variables and how they can be analysed.
Consider the following table that shows a portion of a data set relating to COVID-19 cases in
Singapore.

An example of a numerical variable in this data set is Age. Can you identify another numerical
variable? The analysis of data, more precisely, Exploratory Data Analysis (or EDA) is a process of
summarising or understanding the data and extracting insights or main characteristics of the data. This
is a critical part of the “Analysis” step of the PPDAC problem solving cycle. In this chapter, we will
discuss how numerical variables can be summarised and understood. To begin, the focus of this section
will be on data exploration techniques for one variable, or univariate exploratory data analysis.

Example 3.1.1 In Chapter 2, the recurring data set that was used to drive the discussion on categorical
variables was the patients with kidney stones data set. In this chapter, we will be using a data set closer
to home.
70 Chapter 3. Dealing with Numerical Data

The data set (Microsoft Excel file partially shown above) that we will be looking at in this chapter
corresponds to sales of Housing Development Board (HDB) resale flats within the period of January 2017
to June 2021. The entire data set contains 99, 236 rows and 11 columns. Note that each transaction is
a row of the Excel file and each transaction contains information on variables (the columns) like month
(of sale), flat’s floor area (in square metres), resale price, etc.
The PPDAC cycle starts o↵ with
1. Problem. So what is the problem that we are considering and attempting to answer? If you are
a potential buyer, perhaps a question that you may be interested in investigating could be

What factors may a↵ect the pricing of resale flats sold in Singapore?

2. Plan. Here, we need to decide what are some of the variables that are relevant and possible factors
that answer the question. Suppose these variables were determined to be the 11 columns of the
data set. Some of these variables are

– “Month” - this is the month/year of the resale transaction;

– “Town” - this is the town that the resale flat belongs to;
– “Floor area sqm” - this is the floor size of the resale flat;
– “Resale price” - this is how much the flat was sold for.

3. Data. In this stage, data is collected and prepared as shown in the table above.

4. Analysis. We are now at this stage where the data is going to be analysed in attempting to answer
the Problem.

Definition 3.1.2 A distribution is an orientation of data points, broken down by their observed number
or frequency of occurrence.

Example 3.1.3 Let us look at our HDB resale flats data set. The first few rows of the data set for
transactions from January to June 2021, is reproduced in the table below.

Month Floor area sqm Age Resale price

1/1/2021 45 35 225000
1/1/2021 45 35 211000
1/1/2021 73 45 275888
1/1/2021 67 43 316800
1/1/2021 67 43 305000
1/1/2021 68 40 260000
1/1/2021 73 44 351000
1/1/2021 73 44 343000
1/1/2021 75 41 306000
Section 3.1. Univariate EDA 71

We would like to investigate the distribution of the Age1 variable. To do this, we would need to
collate the number of flats with the same ages when the resale transaction was made and put them in a
frequency table. For example the first two rows of the data indicates that the first two HDB flats in the
data set had the same age of 35 years when they were sold, while the third flat was 45 years old and so
on. Suppose the frequency table collated for the entire data set is as follows:

Age Frequency
2 9
3 8
4 583
5 1105
6 884
7 295
8 255
.. ..
. .

If we simply look at the frequency values in the table, it would be hard to observe any patterns or
gain insights into how the frequencies are distributed across the di↵erent age values. We will introduce
two di↵erent graphs to present the distribution in better way.

Example 3.1.4 (Histograms for Univariate EDA) A histogram is a graphical representation that
organises data points into ranges or bins. It is particularly useful when we have large data sets. Let
us see how the histogram will look like when we use Microsoft Excel to create one based on the “Age”
frequency from Example 3.1.3. To create a histogram, the variable values are “grouped” into equal size
intervals called bins. For our “Age” variable, we can use bins with a width of 2 years. The number of
flats in each bin are counted and tabulated.

Bins Frequency
0-2 9
2-4 591 (8 + 583)
4-6 1989 (1105 + 884)
6-8 550 (295 + 255)
8-10 336 (219 + 47)
.. ..
. .

You may notice that for the 2-4 Bin, the frequency is obtained by adding the number of flats sold
at Age 3 and Age 4 and excludes those sold at Age 2. Thus, the left-end point of the interval 2-4 is
excluded. The same is observed for the rest of the bins. The histogram created using Radiant is shown
below:

1 The data set, which can be downloaded from [Link] actually does not contain

the “Age” variable. The “Age” variable was created by subtracting lease commence date from the year the flat was sold.
72 Chapter 3. Dealing with Numerical Data

With the height of each bar representing the frequency for that bin range, the highest bar would
represent the most frequently occurring range of values.
From the histogram above, we see that the range 4-6 years has the highest frequency as it accounts
for 1989 out of the total 11644 transactions, or about 17% of the flats sold.

Remark 3.1.5 You may wonder how we came to the decision to have bin widths of 2 years rather
than 3 years (or any bigger number). There is no correct answer for this. Normally, we would construct
several histograms with di↵erent bin widths before deciding which one is most appropriate.
Once we have obtained and visualised the distribution of a numerical variable, we would like to
describe the overall pattern of the distribution as well as whether there are any deviations from the
overall pattern. To describe the overall pattern of the distribution, we will focus on the

1. Shape;

2. Center; and

3. Spread of the distribution.

For deviations from the overall pattern, this usually refers to identifying outliers which will be discussed
later on in this Chapter. Let us start by looking at how we can describe the shape of a distribution.

Discussion 3.1.6 (Shape - peaks and skewness). There are two important descriptors when we
discuss the shape of a distribution, namely the peaks and the skewness. Let us look at another histogram
plot obtained from the HDB resale data set. Rather than the age of the flat at the point of resale, we
consider another numerical variable of interest, which is the “Resale Price”. The following histogram
was obtained when we set a bin size of 25,000.
Section 3.1. Univariate EDA 73

There is a peak in the interval [455000, 480000]. The distribution is unimodal , which means that it
has one distinct peak. This tells us that the most frequent resale flat prices lies between 455,000 and
480,000.
Distributions are not always unimodal. Looking at the histogram we plotted earlier for the Age of
the resale flats, we see that there is more than one distinct peak. In such a situation, we say that the
distribution is multimodal . If a distribution has exactly two distinct peaks, we say it is bimodal .

In the histogram above, we see the highest peak in the 4-6 years range and the second highest peak
occurring in the 34-36 years range. It should be noted that we say these are peaks because they occur
most frequently in their immediate neighbourhoods of age ranges.
For a unimodal distribution, we can use another descriptor to describe the shape of the distribution,
that is, whether the distribution is symmetrical or skewed .
74 Chapter 3. Dealing with Numerical Data

In a symmetrical distribution (middle picture above), the left and right halves of the distribution are
approximate mirror images of each other, with the peak in the middle.
For the picture on the left, the distribution is left skewed, with the peak shifted to the right and a
relatively long “tail” on the left.
The picture on the right shows a distribution that is right skewed. Such a distribution has the peak
shifted to the left and a relatively long “tail” on the right. Referring back to the distribution of resale
prices of HDB flats, we see that the distribution is right skewed, meaning that there are some (but few)
flats sold at very high prices. These data points gave rise to the long tail to the right of the peak.

Example 3.1.7 (Symmetrical distribution - Bell curve) One of the most well-known symmetrical
distributions is the normal distribution or what is commonly known as the bell curve. A famous example
of the normal distribution is that of the IQ scores in a population, based on the Wechsler Intelligence
scale.

From the figure, we see that the peak happens at 100, which means that the average IQ of a person in
the population is 100. We also see that about 68% of the population has IQ scores in the range between
85 and 115.

Discussion 3.1.8 (Central tendency - mean, median and mode). Besides describing the shape of
the distribution, we can also describe the characteristics of a distribution more precisely using measures
of central tendency. The three most common measures of central tendency are mean, median and mode,
which were all introduced in Chapter 1.
The three possible shapes of a distribution have di↵erent relative positions of the mean, median and
mode.

1. For a symmetrical distribution, the mean, median and mode will be very close to each other near
the peak of the distribution.

2. For a left skewed distribution, we usually (but not always) have

mean < median < mode .

To see why this is the case, notice that the small number of extremely small values which contributes
to the long tail on the left, will push down the mean/average, as compared to the median which is
less a↵ected by these extremely small values. The mode, found at the peak of the distribution is
naturally the largest among the three measures of central tendency.
Section 3.1. Univariate EDA 75

3. For a right skewed distribution, we have the opposite of the left skewed distribution, which is

mode < median < mean .

In this case, there are a small number of extremely big values which contributes to the long tail on
the right. These big values will push up the mean/average as opposed to the median which is less
a↵ected by these extremely large values. The mode in such a distribution would be the smallest
among the three measures of central tendency.

Example 3.1.9 Referring again to the resale prices distribution, we have seen the shape of the distribution
and concluded that the distribution is right skewed.
The mean, median and mode of this distribution were found to be 496,870.40, 468,000 and 420,000
respectively. This indeed agrees with

mode < median < mean .

Discussion 3.1.10 (Spread - standard deviation and range). Besides the shape and center of the
distribution, we can also describe the spread of a distribution. This refers to how the data vary around
the central tendency.

Take a look at the two distributions above, both of which have the same central tendencies. In fact,
the mean, median and mode of both distributions are 10. However, the top distribution has a relatively
lower variability compared to the distribution below. This means that the data in the top distribution
are all relatively close to the center while the data in the bottom distribution are more spread out, or
has more variability. We can also say that the data in the bottom distribution is spread across a much
wider range.
The most commonly used measure of variability is standard deviation which was introduced in Section
1.5. For the two distributions shown here, the top distribution has standard deviation 1.69 while the
bottom distribution has standard deviation 4.30.
A simpler measure of variability is the range of the distribution. This is defined to be the di↵erence
between the largest and the smallest data points in the distribution. The range is simple to compute
but sometimes it can be misleading. For example, if we look at the range of the HDB resale prices data,
we obtain

Range = Highest resale price Lowest resale price = $1, 250, 000 $180, 000 = $1, 070, 000.

The range is very large and is due to the existence of a few extremely high resale prices. It is not really
the case that there is great variability in resale prices as we see that most of the resale prices are actually
much lower and the variability is not as big as the range indicates it to be.

Definition 3.1.11 An outlier is an observation that falls well above or below the overall bulk of the
data.
76 Chapter 3. Dealing with Numerical Data

Consider the data set with 11 data points shown above. We can consider 75 and 85 as outliers since
they are way larger than the rest of the data points. At this point, we use our judgement to identify
values that appear to be exceptions to the general trend in the data. Later on, we will be introducing a
more precise method (boxplot) to identify outliers.
Identifying outliers can be useful when we wish to identify any strong skewness in a distribution.
Sometimes the outliers are caused by erroneous data collection or data entry but this may not always be
the case. It is also possible that outliers are legitimate data points that provide us interesting insights into
the behaviour of the data. A general rule when we investigate a data set is that outliers should not be
removed unnecessarily as they do tell us something about the behaviour of the variable and prompt us
to investigate further why such extreme values can happen.

Example 3.1.12 Consider the data set below:

4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 300.

It is not difficult to be convinced that 300 is an outlier in the data set. The table below shows the three
di↵erent central tendencies as well as the standard deviation for the entire set and also when the outlier
is removed from the data set.
Mean Median Mode Standard deviation
Without removing 300 30 5.5 5 85.03
With 300 removed 5.45 5 5 1.04

We see that between the three central tendencies, the mean seems to be the most a↵ected by the
removal of the outlier, while both the median and the mode either remained the same or only changed
slightly. Without removing the outlier, the mean is pulled away in the direction of the skew (in this
example, the distribution is skewed to the right). In such cases, mean may no longer be a good measure
of the central tendency of the distribution. We call the median and the mode robust statistics.
In addition, the standard deviation also increases greatly from 1.04 to 85.03 because of the outlier.
This is expected because the standard deviation measures the spread of the data points and with the
outlier being far away from the other data points, the variability of the distribution is understandably
high.
As mentioned above, we need to treat outliers with care. If they have minimal e↵ect on the conclusions
and if we cannot figure out why they are there, such outliers may possibly be removed. However, if they
substantially a↵ect the results, then we should not drop them without justification.

Example 3.1.13 Suppose we are interested to find out if there are significant di↵erences in the distribution
of HDB resale prices for di↵erent time periods. For example, would the distributions di↵er significantly
if we compare the period July to December 2020 with January to June 2021? The two distributions are
shown below.
Section 3.1. Univariate EDA 77

The distribution on the left corresponds to the period of resale from July to December 2020. The
distribution for January to June 2021 is shown on the right. We observe that both distributions have
a similar shape which is right skewed with a single peak. Taking it one step further, we compare the
central tendencies and variabilities of the data points in both periods. The values in the table can be
computed using the Microsoft Excel Data Analysis Toolpak.

Mean Median Mode Range Standard deviation

July to December 2020 462,827 435,000 400,000 1,098,000 155,955
January to June 2021 496,870 468,000 420,000 1,070,000 162,107

Observe that all measures of mean, median and mode are higher in the time period January to June
2021 compared to those in the time period July to December 2020. The range of the resale prices is lower
in January to June 2021 while the standard deviation is actually higher. In conclusion, we can say that
resale prices in January to June 2021 are higher, but more spread out (in terms of standard deviation)
compared to the resale prices in July to December 2020.

Example 3.1.14 In Example 3.1.4, we described the setting of bin widths when creating a histogram.
Deciding the bin width to use can have a big impact on how the histogram looks like and thus a↵ect our
observation and conclusion on the shape of the distribution.

The three histograms above are constructed using the same data set of 233 students’ final exam scores
with the only di↵erence being the bin width settings. The histogram on the left has a bin width of 20,
while the one in the middle has bin width of 10. The last histogram has bin width set at 5. What
conclusions can be made on the distribution based on these histograms?
Based on the first histogram, we may make the conclusion that most students score between 60 to 80
marks, and the distribution is rather symmetric. However, with a slightly smaller bin width, the second
histogram reveals that most students actually scored between 70 to 80 marks. This does not contradict
the observation made earlier based on the first histogram but because of the smaller bin width, we are
able to narrow the range of marks that are scored by most students. With an even smaller bin width, the
third histogram suggests that most students scored between 65 and 75 marks. How do you rationalise
this conclusion with the one from the second histogram?
In general, we should bear in mind the following when determining bin widths for histograms.

1. Avoid histograms with bin widths that are too large. This will result in only a few bins and
information in the data will be lost when data points are grouped together into a small number of
groups/bins.

2. Avoid histograms with bin widths that are too small. If we do this, there may be bins that have
very few data points (or none) that does not give us a sense of the distribution.

3. Our initial choice of bin width may not be the most appropriate. Di↵erent histograms with various
bin widths should be created before deciding which one is the most useful and informative.

Remark 3.1.15 We should not confuse histograms with bar graphs introduced in Chapter 2. A
histogram shows the distribution of a numerical variable across a number line. So one of the axes
(usually the horizontal) will display the range of values taken on by the numerical variable. On the other
hand, the horizontal axis of a bar graph will show the di↵erent categories of a categorical variable.
In addition, the ordering of the bars in a histogram cannot be changed, as it progresses through the
range of values, usually in an ordered manner, taken on by the numerical variable. On the other hand,
78 Chapter 3. Dealing with Numerical Data

the ordering of the bars in a bar graph can be switched around with little consequence. There are also
usually no gaps between the bars in a histogram.

Discussion 3.1.16 (Boxplots for Univariate EDA) Besides a histogram, another way to visualise
the distribution of a numerical variable is to use a boxplot. To construct a boxplot, we will use the
five-number summary, consisting of

1. Minimum;

2. Quartile 1 (Q1 );

3. Median (Q2 );

4. Quartile 3 (Q3 );

5. Maximum.

The median and quartiles have already been introduced in Definition 1.6.1 and Definition 1.6.5. Furthermore,
we have also introduced the Interquartile range

IQR = Q3 Q1 .

While the median can be viewed as the center of a data set, the IQR is a way to quantify the spread of a
data set. We have defined an outlier in Definition 3.1.11 but did not provide an explicit way to classify
a data point as an outlier. For our purpose we will adopt the following consideration to classify a data
point as an outlier.

A data point is considered an outlier if it satisfies

one of the following conditions:

The value of the data point is greater than Q3 + 1.5 ⇥ IQR;

The value of the data point is less than Q1 1.5 ⇥ IQR.

To construct a boxplot, we do the following:

1. Draw a box from Q1 to Q3 .

2. Draw a vertical line in the box where the median (Q2 ) is located.

3. Identify all the outliers by using the consideration above.

4. Extend a line from Q1 to the smallest value that is not an outlier and another line from Q3 to the
largest value that is not an outlier. These lines are called whiskers.

5. Mark each of the outliers with dots or asterisks.

Example 3.1.17 Consider the following data set, with the data points already sorted in increasing
order.
18, 44, 47, 55, 61, 62, 78, 79, 83, 145.
There are 10 data points. The median (Q2 ) is the average of the fifth and sixth data points, so
1
Q2 = (61 + 62) = 61.5.
2
The first quartile is the median of the first five data points: 18, 44, 47, 55, 61, so Q1 = 47. The third
quartile is the median of the last five data points: 62, 78, 79, 83, 145, so Q3 = 79. Following Remark
1.6.9, it should be pointed out that you may encounter slightly di↵erent ways of finding quartiles for a
data set in other texts. For this course, we will adopt what is presented here.
Section 3.1. Univariate EDA 79

The Interquartile Range is

IQR = Q3 Q1 = 79 47 = 32.
To determine if we have outliers, note that 1.5 ⇥ IQR = 48. Since there are no data points smaller than
Q1 48, there are no small-valued outliers. On the other end, since 145 > Q3 + 48 = 79 + 48 = 127, we
see that 145 is the only big-valued outlier.
The boxplot constructed is shown below.

Example 3.1.18 Let us return to the HDB resale flats data set. The boxplot below is based on the
resale prices of flats sold in January to June 2021.

The boxplot confirms our earlier conclusion that there are outliers that correspond to very high resale
prices. Note that the cross in the box, just above the median line represents the mean resale price. Recall
that we have discussed the shape, center and spread of the distribution using a histogram. What can we
say based on the boxplot?

1. (Shape) From the boxplot, we see that the variability in the upper half of the data, given by (Max
Median) is significantly larger than the variability in the lower half of the data which is equal
to (Median Min). This confirms our earlier observation that the distribution is skewed to the
right and there is a relatively long tail to the upper end of the distribution due to the existence of
outliers.

2. (Center) The center, described by the median is easily observed in the boxplot, unlike in a
histogram. We can also compare the relative positions of the median and the mean from the
boxplot.

3. (Spread) The IQR of 204, 000 gives us an idea of the spread for the middle 50% of the data set.
On its own it may not be immediately informative but this would be a meaningful measure to
compare across di↵erent distributions (see next example).

Example 3.1.19 The three boxplots below show the distributions of resale flat prices in three di↵erent
time periods, namely January to June 2020 (call this period P1), July to December 2020 (call this period
P2) and January to June 2021 (call this period P3). What can we say about the three distributions after
comparing the three boxplots?
80 Chapter 3. Dealing with Numerical Data

1. All three distributions are right skewed as the upper halves of the data have greater variability than
the lower halves, due to (large-valued) outliers. However, upon a closer look, it is also apparent
that the upper half variability in period P1 is greater than the upper half variability in P2 which
in turn is greater than the upper half variability in P3.
2. The middle 50% (that is, the IQR) box of resale prices is lowest in P1, followed by P2 and then
P3. Hence, the overall resale prices have increased over time. The spread (given by the height of
the boxes) appears to be similar between P1 and P2 while slightly higher in P3.
3. There appears to be more outliers in P1 and P2 compared to P3.
To conclude this section, we summarise the comparison between using histograms and boxplots to
represent a distribution.
1. A histogram typically gives a better sense of the shape of the distribution of a variable, compared to
a boxplot. When there are great di↵erences among the frequencies of the data points, a histogram
will be able to illustrate this di↵erence better than a boxplot.
2. If we wish to compare the distributions of di↵erent data sets, putting the di↵erent boxplots side
by side is more illustrative than using histograms.
3. To identify and indicate outliers, boxplots do a better job than histograms.
4. The number of data points we have in a data set is better shown in a histogram than in a boxplot.
In fact, two distributions with very di↵erent number of data points can have almost identical
boxplots. On the other hand, this di↵erence is apparent by comparing the histograms.
The bottom line is that di↵erent graphics and summary statistics have their advantages and disadvantages
and they are often used together to complement each other.

Section 3.2 Bivariate EDA

In this section, we will focus on how we can investigate a relationship between two variables in a
population.

Discussion 3.2.1 We start o↵ with a relationship between two variables that is deterministic. This
means that the value of one variable can be determined exactly if we know the value of the other variable.
Perhaps the most common type of deterministic relationship is the one that involves the conversion of
units of measurement from one metric to another. For example:
Section 3.2. Bivariate EDA 81

1. The relationship between Fahrenheit (F ) and Degree Celsius (C) in the measurement of temperature.
We know that F and C are related by

5
C = (F 32) ⇥ .
9

This is a deterministic relationship between F and C. For example, if the temperature in the oven
now is 450 degrees Fahrenheit (so F = 450), then the temperature in the oven now, measured in
Degree Celsius is
5
C = (450 32) ⇥ = 232.22.
9

2. Meters (M ) and Feet (F ) are both measurements of length (or height) and they are related
(approximately) by
F = 3.2808 ⇥ M.

So, if Johnny’s height is 5.9 Feet (so F = 5.9), then his height in meters will be

F 5.9
M= = ⇡ 1.8 meters.
3.2808 3.2808

Discussion 3.2.2 The main focus of this section is on a relationship between two variables that is not
deterministic in nature. We say such a relationship is statistical or non-deterministic. Recall that in a
deterministic relationship, given the value of one variable, we can find a unique value of another variable.
However, this is not possible for a statistical relationship, where given the value of one variable, we can
describe the average value of the other variable. Such relationships between variables, called associations
occur quite often in our daily life.

Example 3.2.3 In a Medical News Today article2 published in November 2020, it was reported that in
a study involving more than 150, 000 participants, a clear link was observed between low physical fitness
and the risk of experiencing symptoms of depression, anxiety, or both.

This association between physical fitness and mental health may not be surprising but we wonder if
it could be due to other factors, like a confounder. More interestingly, does having better fitness make
a person mentally healthier or having better mental health make a person exercise more resulting in
better physical fitness? We will not only measure the association (if one exists) between variables but
also attempt to interpret any observed associations.
Bivariate data is data involving two variables. For example, in the HDB resale flat data set, we can
study the two variables Age and Resale Price .

2 [Link]
82 Chapter 3. Dealing with Numerical Data

Month Floor area sqm Age Resale price

1/1/2021 45 35 225000
1/1/2021 45 35 211000
1/1/2021 73 45 275888
1/1/2021 67 43 316800
1/1/2021 67 43 305000
1/1/2021 68 40 260000
1/1/2021 73 44 351000
1/1/2021 73 44 343000
1/1/2021 75 41 306000

In Section 3.1, we saw two ways to display univariate data, using either a histogram or a boxplot. For
bivariate data, it is clear that using a table like the one above is not really useful if we wish to investigate
if the two variables are associated. Instead, we will use a scatter plot to give us an idea of the pattern
formed by the data between the two variables in question. After looking at the scatter plot, we use a
quantitative measure called the correlation coefficient to quantify the level of linear association (if any)
between the two variables. Finally, we will attempt to fit a line or a curve through the points in the
data set which will enable us to make predictions on the values of the variables. This process is known
as regression analysis. For now, we will focus on scatter plots and defer the discussion on correlation
coefficients and regression analysis to the next few sections.

Example 3.2.4 Returning to our HDB resale flats prices data set, we will focus on the bivariate data
with the variables Age and Resale price. Suppose we wish to know if the age of the flat a↵ects the resale
price, with the ultimate intention to make a prediction, based on the past resale prices, of how much
a 38 year old resale flat is likely going to cost. In this case, we can treat age as the independent (or
explanatory) variable and resale price as the dependent (or response) variable.

Our scatter plot shown above has the age (independent) variable on the x-axis and the resale price
(dependent) variable on the y-axis. Each resale transaction would be represented by an ordered pair

(x, y)

where x is the age of the resale flat and y is the resale price of that flat. For example, the ordered pair
(35, 225000) corresponds to the first resale flat listed in the table above. With a point plotted for each
ordered pair, since there are 11, 644 resale transactions in the data set, there will be 11, 644 points on the
scatter plot. Observe that in the scatter plot, each value of x (age of flat) corresponds to many di↵erent
values of y (the resale price). This is to be expected because there are many di↵erent transactions
involving flats of the same age and all these transactions are made at di↵erent resale prices.
How do we describe the relationship between two numerical variables using a scatter plot?
Section 3.2. Bivariate EDA 83

Univariate data Bivariate data

Overall pattern Deviation Overall pattern Deviation
from the pattern from the pattern
1) Shape Outliers 1) Direction Outliers
2) Center 2) Form
3) Spread 3) Strength

We have seen that for univariate data, we discussed the shape (symmetrical or skewed), center
(median, mean and mode) and spread (interquartile range, standard deviation and range) of the distribution.
For bivariate data, we will use descriptors like the direction, form and strength to describe the relationship
between the two variables. For both univariate and bivariate data, data points that deviate significantly
from the pattern of the main bulk of data points are called outliers.

Definition 3.2.5 The direction of the relationship can be either positive, negative or neither. We say
that there is a positive relationship between two variables when an increase in one of the variables is
associated with an increase in the other variable.
On the other hand, a negative relationship between two variables means that an increase in one
variable is associated with a decrease in the other.
Not all relationships can be classified as either positive or negative and there are those that do not
behave in one way or the other.

The form of the relationship describes the general shape of the scatter plot. In general, we can classify
the form of the relationship as either linear or non-linear. The form of the relationship is linear when
the data points appear to scatter about a straight line. Later in the chapter, we will use a mathematical
equation to describe the straight line when the form of the relationship between two variables is linear.
When the data points appear to scatter about a smooth curve, we say that the form of the relationship
is non-linear. It is beyond the scope of this course to summarise curve patterns in the data but it is
useful to note that quadratic and exponential equations are examples of non-linear forms of relationship.
84 Chapter 3. Dealing with Numerical Data

The two scatter plots on the left shows a linear form of the relationship between the two variables
while the two scatter plots on the right shows non-linear forms.
The strength of the relationship indicates how closely the data follow the form of the relationship.

Both scatter plots above suggests that there is positive, linear relationship between the two variables.
However, the scatter plot on the left shows the data points lying very close to the straight line. This
indicates that the strength of the relationship is strong. The scatter plot on the right shows the data
points scattered loosely around the straight line and thus the strength of the relationship is weaker than
that in the scatter plot on the left.

Example 3.2.6 Let us look at the scatter plot from the HDB resale flats data again. The scatter plot
below is similar to the one from Example 3.2.4 except for an additional trendline drawn in black.

The trendline suggests that as the age of the HDB flat increases, the resale price decreases linearly
on average, in the period of January to June 2021. Is this relationship strong or weak? In fact, one can
argue that without the trendline, one may not even observe that there is a linear relationship between
age and resale price.
At this point, we cannot really tell if there is indeed a linear relationship and if there is, whether the
relationship is strong or weak. Nevertheless, in the next section, we will discuss a more precise measure
of the strength of a relationship.
As mentioned earlier, outliers are data points that deviate significantly from the pattern of the
relationship. Consider the scatter plot shown below that plots the resale price against the floor area of
the HDB resale flats. Do you observe any outliers?
Section 3.3. Correlation coefficient 85

Recall that for univariate data, using a boxplot, we can determine if a data point is an outlier by
checking if its value is greater than Q3 + 1.5 ⇥ IQR or smaller than Q1 1.5 ⇥ IQR. What about for
bivariate data? We will discuss more about outliers in the next section.

Section 3.3 Correlation coefficient

In the previous section, using the HDB resale flats data set, we have observed that a flat’s resale price
is associated with the age of the flat. From the scatter plot, we concluded that the relationship between
the age of the flat and the resale price of the flat was negative. This means that flats whose ages were
higher tended to have a lower resale price. This is not surprising. However, can we say anything about
whether this relationship is strong or weak? If possible, can we measure the strength of this relationship
using a number?
More generally, given two numerical variables, is it possible for us to measure the relationship between
the two variables quantitatively?

Definition 3.3.1 The correlation coefficient between two numerical variables is a measure of the linear
association between them. The correlation coefficient, denoted by r, always ranges between 1 and 1. We
can use this number to summarise the direction and strength of linear association between two variables.
The sign of r tells us about the direction of the linear association. If r > 0, then the association
is positive, which means that when one of the variables increase, the other variable will tend to increase
as well. On the other hand, if r < 0, then the association is negative, which means that when one of
the variables increase, the other variable will tend to decrease. In the event that r = 1 (resp. r = 1),
we say that there is perfect positive association (resp. negative association). When r = 0, we say there
is no linear association. Thus, while the sign of r tells us the direction of the linear association, the
magnitude of r (that is, how close r is to 1 or 1) will tell us the strength of the linear association
between two numerical variables.

Example 3.3.2 The two scatter plots below are examples of positive linear association between two
variables.
86 Chapter 3. Dealing with Numerical Data

The plot on the left plots the price index of HDB flats against the price index of condominiums. We
observe that there is positive linear association between the two indices, which means that as the price
of HDB flats increase, it is likely that the price of condominiums would increase as well. The value of r
in this case is 0.95 which indicates that the association is strong.
The plot on the right shows the midterm mark of students against the final mark. Again, we observe
that there is positive linear association between the two marks and in this case, r was found to be 0.75.
The next two scatter plots are examples of negative linear association between two variables.

The plot on the left shows the price of oil against the price of gold. In this case, we observe that the
trend is that when the price of gold increases, the price of oil tends to decrease. The value of r was found
to be 0.67 and this indicates that there is negative linear association between gold and oil prices.
The plot on the right shows the amount of financial aid received by students against the students’
family income. It is not surprising to find that as the family income increases, the amount of financial
aid received by students would tend to decrease. The value of r in this case is 0.49 and there is negative
linear association between the two variables.

The two scatter plots above are examples where r = 0. This means that there is no linear association
between the two variables. However, note that while r = 0 for the second plot, we can see that the data
points fit very well onto a curve and there is a clear non-linear relationship between X and Y . More
generally, no linear association between variables does not necessarily mean no association between
variables.
Section 3.3. Correlation coefficient 87

The two plots above show situations where there is perfect (positive or negative) linear correlation
between the two variables. In such cases, all the data points are connected by (and thus lie on) a straight
line. There is however, one exception, which is when the straight line joining all the data points is
actually a straight horizontal (or vertical) line. In such instances, the value of r is 0 and there is no
association between the two variables. This is because when the data points are connected by a vertical
or horizontal line, a change of value in one of the variables does not relate to a change in the other
variable.
When describing the strength of a linear relationship, we usually follow the rule of thumb as given
in the diagram below.

When the magnitude of r is between 0.7 and 1, we say that the two variables have a strong linear
association. If the magnitude is between 0.3 and 0.7, the two variables have a moderate linear association.
If the magnitude is between 0 and 0.3, the two variables have a weak linear association. Do note that
other sources may di↵erentiate strong/moderate/weak linear associations at other “cut-o↵” points that
are di↵erent from 0.3 and 0.7.
In general, as the value of r becomes closer to 1 or 1, the data points will increasingly fall more
closely to a straight line. Scatter plots where the data points are loosely dispersed typically mean that
correlation is weak (or non-existent). We will now discuss how to compute the value of r numerically.

Example 3.3.3 We will go through the steps required to compute the correlation coefficient using an
example. Consider the following table that shows a total of 10 data points of bivariate data (x, y):

x 9 4 5 10 6 3 7 2 8 1
y 41 17 28 50 39 26 30 6 4 10

1. First compute the mean and standard deviation of x and y. (Refer to Definition 1.4.1 and Definition
1.5.1 if you have forgotten how these are computed.) For this data set, we find the mean and
standard deviation of x to be 5.5 and 3.03 respectively while the mean and standard deviation of
y are 25.1 and 15.65 respectively.

2. Convert each value of x and y into standard units. To convert x (resp. y) into its standard unit,
we compute ✓ ◆
x x y y
resp. ,
sx sy
where sx and sy are the standard deviations of x and y respectively. The table below shows the
values of x and y after they have been converted to standard units.
88 Chapter 3. Dealing with Numerical Data

x 1.16 0.50 0.17 1.49 0.17 0.83 0.50 1.16 0.83 1.49
y 1.02 0.52 0.19 1.59 0.89 0.06 0.31 1.22 1.35 0.96

3. Compute the product xy in their standard units for each data point. The table below has an
additional row for the value xy for each data point.

x 1.16 0.50 0.17 1.49 0.17 0.83 0.50 1.16 0.83 1.49
y 1.02 0.52 0.19 1.59 0.89 0.06 0.31 1.22 1.35 0.96
xy 1.17 0.26 0.03 2.36 0.15 0.05 0.15 1.41 1.11 1.43

4. Sum the products xy obtained in the previous step over all the data points and then divide the
sum by n 1, where n is the number of data points. The result is the correlation coefficient r. For
the data set above,
1
r= (1.17 + 0.26 0.03 + 2.36 + 0.15 0.05 + 0.15 + 1.41 1.11 + 1.43) = 0.64.
9

Remark 3.3.4 For the purpose of this module, you will not be required to compute r manually, instead
you should be familiar with the method of how r is computed and thereby develop some basic intuition
on the properties of r.

Example 3.3.5 Let us revisit Example 3.2.6, where the scatter plot of HDB resale flat prices against
the ages of the flat shown below does indeed suggest that these two variables are negatively associated.

Indeed, upon computing the correlation coefficient between these two variables, we find that r =
0.356, confirming that there is moderate negative linear association between the age and resale price
of HDB flats from the period January to June 2021.
We will now present three properties of correlation coefficients.

1. From the “Age” vs. “Price” of HDB resale flats example, we saw that r = 0.36 when we consider
the scatter plot with Age as the x-axis and Resale price as the y-axis. What would happen to r if
we had done the plot with Resale price as the x-axis and Age as the y-axis? In other words, what
happens to r when we interchange the x and y variables? If we revisit the process that describes
how r is computed from a bivariate data set, you would realise that regardless of which variable is
x (or y), the computation of r would not be a↵ected in any way.
Section 3.3. Correlation coefficient 89

The correlation coefficient r is not a↵ected by interchanging the x and y

variables.

2. What would happen to the value of r if we add a constant to all the values of a variable? For
example, suppose it was discovered that there was an error in the recording of all the resale prices
of HDB flats and that the actual resale prices were all 1000 higher than what was given in the
data set. To correct this error, we would have to add 1000 to all the resale prices in the data set.
It turns out that such a change does not a↵ect the value of r.

The correlation coefficient r is not a↵ected by adding a number to all values of

a variable.

While this may not be immediately obvious, you are encouraged to verify this result by using the
data set in Example 3.3.3 and adding some number to all the values of x (or y).

3. Instead of adding the same number to all the values of a variable, what would happen to the value
of r if we multiply a positive number to all the values of a variable instead? For example, if the
resale prices were converted to US dollars instead? This means that we have to multiply a factor
of 0.73 (assuming an exchange rate of 1 Singapore dollar is to 0.73 US dollars) to all the resale
prices in the data set. It turns out that such a change again does not a↵ect the value of r.

The correlation coefficient r is not a↵ected by multiplying a positive number

to all values of a variable.

You are again encouraged to verify this result by adjusting the data set in Example 3.3.3 and
recalculating the correlation coefficient.

While the correlation coefficient between two numerical variables is insightful, there are certain
limitations.

Discussion 3.3.6

1. Association is not causation. To confuse association with causation is a common mistake that
is made by many. Very often when there is a strong association between two variables, with a
correlation coefficient of r that is close to 1 or 1, it is mistakenly concluded that any change in
the explanatory variable, say x, will cause the response variable y to change. This is incorrect as
what we can conclude is only a statistical relationship between x and y and not a causal relationship.
90 Chapter 3. Dealing with Numerical Data

Consider the example above of a scatter plot that came from a data set containing information
on the percentage of people that earned a Bachelor’s Degree in 2017 across 3142 counties in the
United States, as well as the per capita income of these counties in 2017.3 Each data point in the
scatter plot represents a county. The x-axis is the per capita income in the past 12 months while
the y-axis is the percentage of the population in the county that earned a Bachelor’s Degree in
2017. The correlation coefficient for the two variables is 0.79, which indicates that there is strong
and positive association between the two variables.
It would be tempting to conclude that the higher the per capita income of a county, the higher
the percentage of the county’s population would have earned a Bachelor’s Degree. This is not
necessarily true. The data here merely suggests association of the two variables and does not
establish any causal relationship.
2. r does not tell us anything about non-linear association. The correlation coefficient r,
as defined and described in this section, measures the degree of linear association between two
numerical variables. Whatever the computed value of r is, it does not give any indication of
whether the two variables could be associated in a non-linear way.

The correlation coefficients for the three scatter plots above are small but yet there is actually a
strong relationship between the variables. The value of r is small because the relationship between
the variables is not a linear one. It is always a good practice to look at a scatter plot of the data
set and not just deduce any relationship between the variables from the computed value of r.
3 Data set can be downloaded from [Link]/data/?data=county complete.
Section 3.3. Correlation coefficient 91

3. Outliers can a↵ect the correlation coefficient significantly. Outliers are observations that
lie far away from the overall bulk of the data. How do outliers a↵ect the value of the correlation
coefficient? The removal of outliers from a data set can have di↵erent e↵ects on the correlation
coefficient, depending on how the outlier is positioned in relation to the rest of the data points.

Consider the scatter plot on the left, where the outlier is circled, the correlation coefficient is 0.22
based on the data set that includes the outlier. However, when we remove the outlier, we see that
there is a strong positive linear association between the remaining data points. Thus, in this case,
the presence of the outlier decreases the strength of the correlation, compared to when the outlier
is removed.
Consider the scatter plot on the right where again the outlier is circled. In this case, the correlation
coefficient is 0.75 based on the data set that includes the outlier. When the outlier is removed,
the remaining data points give a correlation coefficient of 0.01. Thus, in this case, the presence
of the outlier actually increases the strength of the correlation, compared to when the outlier is
removed.

Example 3.3.7 For the HDB data set that we introduced earlier, the scatter plot below shows the
relationship between the resale price and the floor area of the flat. There are three outliers (circled) and
these are resale flats whose floor areas are larger than 200 square meters.

Using a statistical software, it was found that the correlation coefficient was 0.626 before the outliers
were removed. After the outliers are removed, the correlation coefficient becomes 0.625, which is
practically the same as before.

Definition 3.3.8 So far, we have discussed correlation in the setting where individual data points are
considered. For example, the collection of data points could represent individuals from a population.
However, we can also examine the data at an aggregated level by grouping these individuals based
on factors like ethnic group or education level. An ecological correlation is computed based on the
92 Chapter 3. Dealing with Numerical Data

aggregates rather than on the individuals. Thus, ecological correlation represents relationships observed
at the aggregate level, considering the characteristics of groups rather than individuals.

Example 3.3.9 Consider the scatter plot below for a data set consisting of individuals belonging to
three distinct groups. The three groups are represented by the symbols circle, cross and plus.

The correlation coefficient computed at the individual level is r = 0.55, indicating that there is a
moderate and positive linear association between the variables X and Y . Suppose we compute the group
averages (for X and Y ) for the three subgroups and obtain the three red dots as shown in the figure.
These three red dots, or aggregate points, align rather closely along a straight line. In fact, if we compute
the correlation coefficent based on these three agrregate points, the correlation coefficent would be 0.9.
Consequently, this example illustrates that the ecological correlation derived from group averages
suggests a more pronounced (since 0.9 is closer to 1 than 0.55) positive linear association compared to
correlation calculated at the individual level.
This phenomenon does not happen all the time. In general, when the association for both individuals
and aggregates are in the same direction, the ecological correlation based on aggregates will typically
overstate the strength of the association in individuals. Without getting into details, the intuitive
explanation for this is because the variability among individuals will not be as significant when correlation
is computed based on group aggregates.

Definition 3.3.10 The previous example reminds us that correlation at the individual level and at
aggregate level may tell us a di↵erent story about our data set. We need to be careful not to make
any wrong deductions. Consider the scatter plot below that represents the relationship between two
variables.
Section 3.4. Linear regression 93

There are clearly four distinct subgroups of individuals (grouped by the four ovals). If we consider the
subgroup averages, represented by the four red dots in the diagram, the correlation between these four
subgroup averages suggests that there is a positive linear association, as indicated by the blue regression
line. Can we now conclude that at the individual level, there is also a positive linear association between
the two variables?
This is not the case. If we look at the individual level within each subgroup, we notice a weak, but
nevertheless negative linear association between the two variables. Thus, we would have been wrong if
we drew conclusion about correlation at the individual level based on what we observe at the aggregate
level. If we do so, we would have committed what is known as ecological fallacy.
The moral of the story is that we should not assume that correlations based on aggregates will hold
true for individuals. Ecological correlation and correlation based on individuals are not the same and
should not be confused.

Definition 3.3.11 Consider another scatter plot below, again representing the relationship between
two variables X and Y .

Again, there are clearly three distinct subgroups of individuals in the data set and within each
subgroup, we observe a strong positive linear association between the two variables. Can we now conclude
that at the aggregate level, there is also a positive linear association between the two variables?
The three subgroup averages, represented by the three red dots are shown. It turns out that there is
actually no clear correlation between the variables at the aggregate level. Based on the correlation we
observed at the individual level, if we had mistakenly concluded that the same correlation would exist
at the aggregate level, we would have committed what is known as atomistic fallacy.
To di↵erentiate the two types of fallacies described above can be confusing initially. The following
table summarises them.

Fallacy Using To conclude

Ecological Ecological correlation (aggregate level) Individual level correlation
Atomistic Individual level correlation Ecological correlation (aggregate level)
94 Chapter 3. Dealing with Numerical Data

Section 3.4 Linear regression

Now that we have seen that the age of a HDB resale flat is negatively associated with the resale price,
it is reasonable to wonder if we can make some predictions on the resale price of a flat given the age of
the flat. For example for a flat that is 40 years old, what is our guess for its resale price?

Definition 3.4.1 If we believe that two variables X and Y are linearly associated, we may model the
relationship between the two variables by fitting a straight line to the observed data. This approach is
known as linear regression. Recall that the equation of a straight line is given by

Y = mX + b,

where b is the y-intercept and m is the slope or gradient of the line. The y-intercept is the value of
Y when the value of X is 0. The slope of the line is the amount of change in Y when the value of X
increases by 1.

In the figure above, the straight line in red is the regression line that is fitted to the observed data,
represented by the blue dots. Consider the i-th observation (Xi , Yi ). The “?” in the figure represents the
residual of the i-th observation, which is the observed value of Y for Xi (that is, Yi ) minus the predicted
value of Y for Xi (predicted by the straight line). This residual, denoted by ei , is sometimes also called
the error of the i-th observation as it measures how far the predicted value is from the observed value.

Example 3.4.2 Let us return to the question we posed at the beginning of this section. What is our
prediction for the resale price of a HDB flat that is 40 years old?
Section 3.4. Linear regression 95

With X representing the age of the resale flat and Y being the resale price, the regression line obtained
from the data set is
Y = 4007X + 591857.

This means that when X = 40, (age of resale flat is 40),

Y = 4007 ⇥ 40 + 591857 = 431577.

So the predicted resale price of a 40 year old flat is 431,577. It is important to note that we are not
concluding that

A 40 year old resale flat will be sold at 431,577.

But instead our linear regression model predicts that

The average resale price of 40 year old HDB flats is 431,577.

Furthermore, as the correlation between resale flat price and age of the flat is weak, the prediction
obtained from the linear regression above may not be as accurate compared to the scenario where the
correlation is stronger.
Now that we have seen how a regression line can be used, the question is how do we obtain such a
line given bivariate data? What method and principle is used to determine the regression line? Among
the many di↵erent straight lines that we can use to fit the data points, which one is the “best”?

Discussion 3.4.3 There are several ways to assess which straight line fits the observed data better.
One of the most common way is the method of least squares. For this module, we will not go into the
technicalities of this method but instead we will briefly describe the idea behind this method.
Recall that when we fit a straight line through a set of observed data points (xi , yi ), the di↵erence
between the observed value yi and the predicted outcome, predicted by the straight line, is known as
the residual of the i-th observation. This residual, denoted by ei is also known as the error of the i-th
observation that measures how far is the observed from the predicted.
96 Chapter 3. Dealing with Numerical Data

In the plot above, we see that each data point gives rise to an error term and it is reasonable to say
that a line of good fit is one that keeps the error terms (considered over all data points) small. However,
instead of looking at the overall error by summing up

e1 + e2 + · · · + en ,

where n is the total number of data points, the method of least squares seek to find a straight line that
minimises the overall sum of squares of errors,

e21 + e22 + · · · + e2n .

You may wonder why minimising e21 + e22 + · · · e2n is more appropriate than minimising e1 + e2 + · · · + en .
We will leave you to ponder about this question before having a discussion with your friends or instructor.

Remark 3.4.4

1. It is useful to note that the least squares regression line obtained from a set of observed data points
(xi , yi ) will always pass through that point of averages for that data set, that is, (x, y). This fact
can be established mathematically, but is beyond the scope of this course.

2. It is important to note that while we have obtained the least squares regression line that allows us
to predict the average resale price for a given age of the resale flat, the same regression line cannot
be used to predict the average age of resale flats for a given resale price. The reason is essentially
because of the way the regression line was obtained.
In obtaining the regression line with the independent variable (x) as age and the dependent variable
(y) as the resale price, the line was fitted to minimise the square of error terms between the observed
and predicted resale prices.
If the intention was to use a given resale price to predict the average age of the resale flats, then
we would be looking at another regression line that minimises the square of error terms between
the observed and predicted ages of resale flats.
The two regression lines are di↵erent and thus not interchangeable.
Section 3.4. Linear regression 97

3. The correlation coefficient r between the variables X and Y is closely related to the regression line

Y = mX + b

obtained using the method of least squares. More precisely, we have

sY
m= r,
sX
where sX (resp. sY ) is the standard deviation of X (resp. Y ). With this relationship, we see that
if the correlation coefficient r is positive, then the gradient of the regression line is also positive.
Similarly, if the correlation coefficient is negative, then the gradient of the regression line will also
be negative. However, it is important to remember that the correlation coefficient is not necessarily
equal to the gradient of the regression line.

4. Another important point to note about the linear regression line obtained using a data set is with
regards to the range of the independent variable in the data set.

Recall that we have obtained the linear regression line for the purpose of predicting the average
resale price based on the age of the resale flat. From the data set, the value of the independent
variable (in this case, this is the age of the resale flat) ranges from 2 to 54 years. Thus the prediction
that can be arrived at using the regression line is only applicable for HDB flats whose age is between
2 and 54 years old. Outside this range, we should not use the regression line to make our prediction
as the best fit regression line may change outside this range. For example, we should not use the
regression line to predict the average resale price of flats that are 60 years old as our data set does
not contain any information on resale flats that are more than 54 years old.

Discussion 3.4.5 To conclude this section and also the chapter, we will describe a method to study
the relationship between two variables if the relationship is not linear. The following table shows part of
a data set that provides the total number of confirmed COVID-19 cases in South Africa since 5 March
2020.4 .
4 Data set can be downloaded from [Link]/sudalairajkumar/novel-corona-virus-2019-dataset.
98 Chapter 3. Dealing with Numerical Data

t Total confirmed cases

76 17200
77 18003
78 19137
79 20125
.. ..
. .
95 48285
96 50879
.. ..
. .

In this data set, t is the variable representing the number of days since 5 March 2020.
It can be computed using Microsoft Excel or other statistical software that the correlation coefficient
between the total number of confirmed cases and t is 0.812, which indicates that there is a strong
positive linear association between the two variables. Is this indeed the case? Perhaps we may make
such a conclusion but as stated earlier, correlation coefficient alone does not give the entire picture. We
should create a scatter plot using our bivariate data and verify if there is really a linear relationship.

Are the two variables associated linearly? It is quite clear visually that the total number of confirmed
cases increases exponentially when t increases. Thus, if we let y be the variable representing the total
number of confirmed cases, y and t are not linearly associated but instead the relationship between them
seems to be exponential. For such a situation, can we apply our linear regression technique to make
predictions on the total number of confirmed cases? The answer is yes, but it would have to be done
indirectly.
Now, if the relationship between y and t is indeed exponential in nature, we can model this relationship
using the equation
y = cbt ,
where c and b are some constants that we will determine. Using the property of the logarithmic function,
we see that

y = cbt is equivalent to ln y = ln(cbt ) is equivalent to ln y = ln c + t ln b.

Thus, instead of making a scatter plot with y plotted against t, we will make a scatter plot with ln y
plotted against t. If there is indeed an exponential relationship between y and t, then we would expect
to see a linear relationship between ln y and t, as indicated by the equivalent equations above. Let us go
through the steps:
Section 3.4. Linear regression 99

(a) Step 1: For each data point (t, y), compute (t, ln y). For our data set on COVID-19 cases in South
Africa, we have the following table:

t Total confirmed cases (y) ln(y)

76 17200 9.753
77 18003 9.798
78 19137 9.859
79 20125 9.910
.. .. ..
. . .
95 48285 10.785
96 50879 10.837
.. .. ..
. . .

We then plot ln y against t.

(b) Step 2: Find the linear regression line for ln y vs t. For our example, the regression line was found
to be
ln y = 4.287 + 0.066t.

This means that ln c = 4.287 and ln b = 0.066.

(c) Step 3: Since ln c = 4.287 and ln b = 0.066, we have

c = e4.287 and b = e0.066 .

We are now able to write down the exponential equation relating y and t:

y = cbt = e4.287 e0.066t = e4.287+0.066t .

100 Chapter 3. Dealing with Numerical Data

Exercise 3

1. Outliers are observations that fall well above or below the overall bulk of the data. Consider a set
of 50 (univariate) data points with a single outlier. Suppose the outlier is removed from the data
set, which of the following is/are always true? Select all that apply.

(A) The removal will cause the mean to decrease.

(B) The removal will cause the interquartile range to decrease.

(C) The removal will cause the standard deviation to decrease.

(D) The removal will cause the range to change.

2. The GEA1000 midterm results for the year 2050 Semester 1 are shown in the boxplot below. There
were 50 students who took the test, and the test scores are out of 100. No outliers were removed.

Which of the following can be derived from the boxplot? Select all that apply.

(A) There is at least one outlier.

(B) The range is 40.

(C) The interquartile range is 40.

(D) The standard deviation is 14.

3. In the scatter plot below, the dotted straight lines mark the average values of X and Y .
Exercise 3 101

Which of the following statements is/are correct?

(I) The line Y = X cuts through the data points in half, with 50% of the data points on either
side of the line.
(II) The average of Y is larger than the average of X.

(A) Only (I).

(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).

4. The five-number summary of a numerical variable with 47 values is:

Min Q1 Median Q3 Max

12.0 15.0 16.5 18.0 24.0

Which of the following statements must be true? Select all that apply.

(A) There are no outliers in the data.

(B) There is at least one low outlier in the data.
(C) There is at least one high outlier in the data.
(D) There are both low and high outliers in the data.

5. The five-number summary for a numerical variable X with 77 values is given as 57, 68, 70, 72, 77.
Define Y = 10 2X. What is the IQR of Y ?

(A) 8.
(B) 2.
(C) 4.
(D) 8.

6. The boxplot below shows the distribution of the marks of 30 students.

102 Chapter 3. Dealing with Numerical Data

Which of the following statements must be true?

(I) There is only one student who scored higher than 23.5 marks.
(II) The range of the marks of the 30 students is 17.5.

(A) Only (I).

(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).

7. Professor X conducted a test for his class of 16 students, and tabulated the following five-number
summary for the test scores:

Minimum Q1 Median Q3 Maximum

41.20 45.00 50.75 54.12 58.90

Two days later, he discovered, to his horror, that he had made a mistake in the computation of
the test scores, and everyone should get 10 marks more.
The new (and correct) median score is (1) and the IQR is (2) .
Fill in the blanks for the statement above, give your answers correct to 2 decimal places.

8. In a Pokemon Bug-Catching Contest, Bug Catcher Aaron decides to record the dimensions of each
Pokemon he catches: height in metres and weight in kilograms. The results are summarised in the
following table.

Name Height(m) Weight(kg)

Beedrill 0.92 23.36
Scyther 1.43 52.86
Mothim 1.06 19.92
Whirlpede 1.36 74.74
Orbeetle 0.44 40.87

Which of the following visualisations correctly depict the information displayed in the table?
Select all that apply.
Exercise 3 103

(A)

(B)

(D)

(E)

9. Consider the following data set, which we will refer to as set A:

{15, 23, 13, 17, 8, 42, 4, 37, 12, 16}.

A student decided to do a check for outliers, after which such value(s) was/were removed. Let us
designate the set of remaining data points as set B. Which of the following statements is/are true?
Select all that apply.

(A) The range of B is 19.

(B) The median of B is lower than the median of A.
(C) The median of B is greater than the mean of B.
(D) The median of B is lower than the mean of A.
(E) The standard deviation of B is lower than the standard deviation of A.

10. Suppose that the following are 10 data points for a numerical variable X:

150, 50, r, 8, 20, 1, 32, 70, 10, 5,

Exercise 3 105

where r is an unknown number and r 6= 150. Based on the definition of an outlier for a boxplot,
if 150 is the only outlier in this data set, the minimum possible value of r is

(A) 149.9.
(B) 45.5.
(C) 30.
(D) 14.5.

11. Which of the following is true about a non-zero correlation coefficient?

(A) The correlation coefficient changes when we add 3 to all the values of both variables.
(B) The correlation coefficient changes when we subtract 5 from all the values of one variable.
(C) The correlation coefficient changes when we multiply all the values of one variable by 1.
(D) The correlation coefficient changes when we interchange the two variables.

12. Which of the following statements is/are true?

(I) A correlation coefficient of 0 means that there is no linear association between the two
variables.
(II) A correlation coefficient of 0.8 indicates a weaker linear association than a correlation
coefficient of 0.7.

(A) Only (I).

(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).

13. To find out if there is a relationship between exercising and intelligence, the researchers of a study
obtained the exercise duration and IQ scores of each participant. The data was plotted on a scatter
plot, with exercise duration on the x-axis and IQ scores on the y-axis. The correlation coefficient
was computed to be 0.8. From this information alone, what can you conclude?

(A) People should exercise more in order to improve IQ.

(B) The best fit regression line must have a positive slope.
(C) The intercept of the regression line must be positive.
(D) There is no relationship between exercise duration and IQ.

14. A scatter plot contains several hundred points. The horizontal and vertical axes represent variables
x and y respectively. Which of the following statements is/are always correct?

(I) If the correlation coefficient is 0.7, the gradient of the regression line will be 0.7.
(II) If the correlation coefficient is positive, the gradient of the regression line will be positive.
(III) Let r be the correlation coefficient obtained when we plot y against x. You may assume that
r is not equal to 1, 0 or 1. When we plot x against y, the correlation coefficient is 1r .

(A) Only (I) and (II).

(B) Only (I).
(C) Only (II).
(D) Only (I) and (III).

15. Which of the following is/are always correct about the correlation coefficient r between two
numerical variables? Select all that apply.
106 Chapter 3. Dealing with Numerical Data

(A) r has no unit of measurement.

(B) r will change if we add 3 to all the values of one variable.
(C) r will not change if we multiply all the values of one variable by 2.
(D) If r = 0, there is no association between the two variables.

16. A researcher examined the relationship between variables X and Y among 20 male subjects, and
he graphed a scatter plot as shown below. One of the lines in the graph (A, B, C or D) is the
actual best-fit regression line. Which one is it?

(A) Line A.
(B) Line B.
(C) Line C.
(D) Line D.

17. There are two primary six classes in a tuition center. Class A and Class B each has 100 students
and all students sat for a mathematics midterm test as well as a final examination. In Class A,
every student scores 1 point higher in the final examination than in the midterm. In Class B, every
student scores 1 point lower in the final examination than in the midterm. For the midterm test,
the average score is 50 and standard deviation is 20 for both Classes A and B. Suppose now Class
C is formed by combining all the students from Classes A and B. Which of the following statements
is/are correct?

(I) The correlation coefficient of Class C is smaller than the correlation coefficient of Class A.
(II) The correlation coefficient of Class C is larger than the correlation coefficient of Class B.

(A) Only (I).

(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Exercise 3 107

18. For the 40 students in a class, the results of their second English test are plotted against the results
of their first English test. It was found that for each student, the result of the second test is better
than that of the first test (i.e., the second test score is higher). Which of the following must be
true about the relationship between the students’ second test results and their first test results?

(I) If student A scores better than student B in the first test, then student A also scores better
than student B in the second test.
(II) The correlation between students’ second test results and first test results is positive.

(A) Only (I).

(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).

19. There is a weak linear association between numerical variables X and Y , where X ranges from 0
to 5 (inclusive). Based on the data from X and Y , the regression line is given by the equation
Y = 0.25X + 2. Which of the following statements must be true? Select all that apply.

(A) We can obtain the exact value of Y when X = 4.

(B) The predicted average value of Y is 4 when X = 8.
(C) The correlation coefficient is 0.25.
(D) The correlation between X and Y is positive.

20. Dash is interested to find out if there is a correlation between the time spent in playing computer
games (in hours) and student’s score (out of 100). After slicing the data according to gender, it
was found that the regression line for males is

score = 1.5 ⇥ hours spent gaming + 85

and the regression line for females is

score = 1.3 ⇥ hours spent gaming + 87.5.

Furthermore, Simpson’s Paradox is observed when the 2 subgroups are combined. Which of the
following is/are possible ranges of r, the correlation coefficient between the time spent gaming and
the students’ score when the subgroups are combined?

(A) 0  r  1.
(B) 1  r < 0.
(C) There is not enough information to guess the range of r.

21. There are 50 students in a class and the average (or mean) number of pens per student is 4. Not
all students in this class have the same number of pens. For a student with 4 pens, what is the
standard unit for the number of pens for this student?

(A) 1.
(B) 0.
(C) 1.
(D) The standard deviation is required to answer this question.

22. Jamus is interested in finding the association between the number of books read (R) by students,
and the number of books bought (B) within a month. He randomly sampled 1000 students from
University N, and constructed a regression line equation of R = 0.45B + 0.05, with a correlation
coefficient r of 0.87, and 0  B  10. Now, from the above information, we let D and C be the
R
mean daily number of books read and bought respectively. For simplicity, we let D = 30 and
B
C = 30 . Which of the following statements must be true?
108 Chapter 3. Dealing with Numerical Data

(A) The gradient of the regression line of D against C is 0.45.

(B) The intercept of the regression line of D against C is 0.05.
(C) The correlation coefficient r between D and C is di↵erent from 0.87.
(D) One can make a reliable prediction of the average daily number of books read for a student
who bought a book a day on average.

23. Which of the following gives the most likely correct values of the correlation coefficients obtained
from the four scatter plots? Here, r1 refers to the correlation coefficient of plot (1), r2 refers to
the correlation coefficient of plot (2) and so on.

(A) r1 = 0.8, r2 = 0.92, r3 = 0.45, r4 = 0.7.

(B) r1 = 0.06, r2 = 0.92, r3 = 0.9, r4 = 0.7.
(C) r1 = 0.06, r2 = 0.92, r3 = 0.45, r4 = 0.7.
(D) r1 = 0.06, r2 = 0.92, r3 = 0.45, r4 = 0.05.

24. A professor wants to know the percentage of right-handed students in NUS. Since he is teaching
a course in NUS this semester, he decides to do a survey in his class. From the single survey, he
concluded that eighty percent of students in NUS are right-handed. Which one of the following
fallacies was committed by the professor?

(A) Atomistic fallacy.

(B) Ecological fallacy.
(C) None of the other options.

25. There is a strong positive linear association between numerical variables X and Y , where X ranges
from 0 to 20. Based on the data from X and Y , the regression line for Y vs X is given by the
equation Y = 12X + 3. Which of the following statements must be true?

(A) We can obtain a valid prediction of the average value of Y when X = 22.
(B) The predicted average value of Y is 39 when X = 3.
(C) The correlation coefficient is 0.85.
(D) There is a deterministic relationship between X and Y .
Exercise 3 109

26. The regression line for Y vs X is given by Y = 0.56X + 10. The range of the X values is between
10 and 30. Let U = 2X and V = X 3. Which of the following statements must be true?
Select all that apply.

(A) The regression line for Y vs V is Y = 0.56V + 11.68.

(B) The regression line for Y vs U is Y = 0.28U + 10.
(C) The correlation coefficient between X and Y is 0.56.
(D) The gradient of the regression line for X vs Y is 1.79 (to 2 decimal places).

27. The following scatter plot has an outlier, which has both positive X and Y values. What will
happen to the correlation coefficient between X and Y if we remove the outlier from the plot?

(A) It will increase.

(B) It will decrease.
(C) It will remain the same.

28. There are at least two ways of measuring mass, namely kilograms (kgs) and pounds (lbs). Moreover,
the conversion between the two units of measurement is as follows:

1 kg = 2.20 lbs.

Within a data set, it is known that the correlation between numerical variables x and M is 0.72
where M is the mass, measured in kg. The equation of the regression line is M = 0.32x + 300 for
x ranging from 0 to 5 inclusive. Based on this information, if we change the scale of mass from
kilograms to pounds, which of the following statements is true?

(A) Both the correlation coefficient and gradient of the regression line remain unchanged.
(B) Only the correlation coefficient remains unchanged but the gradient of the regression line
changes.
(C) Only the gradient of the regression line remains unchanged but the correlation coefficient
changes.
(D) Both the correlation coefficient and the gradient of the regression line will change.

29. A team of researchers found that countries that mandated Bacillus Calmette-Guérin (BCG) vaccination
had lower COVID-19 death rates compared to countries that did not mandate BCG vaccination.
Based on this association alone, the researchers concluded that mandating BCG vaccinations causes
lower COVID-19 death rates. Which one of the following mistakes have they committed?

(A) Atomistic fallacy.

(B) Ecological fallacy.
110 Chapter 3. Dealing with Numerical Data

(C) Confusing correlation and causation.

(D) None of the other options.

30. A researcher wishes to examine the association between the number of hours students engage in
gaming and their academic performance. He conducts his study in 15 schools, and calculates for
each school the average number of hours students spend gaming and the average academic score
of students. He notices that there is a correlation of 0.9 between the two sets of averages. He
concludes that

“The correlation between gaming hours and academic score for all students from the 15 schools is
0.9”.

What fallacy is the researcher committing here?

(A) Atomistic fallacy.

(B) Ecological fallacy.

Univariate EDA: Histograms & Distributions
No ratings yet
Univariate EDA: Histograms & Distributions
41 pages
Lecture 03
No ratings yet
Lecture 03
53 pages
Univariate Analysis in Data Science
No ratings yet
Univariate Analysis in Data Science
35 pages
Analyzing Numerical Data in Excel
No ratings yet
Analyzing Numerical Data in Excel
36 pages
Descriptive Statistics: Graphing Methods
No ratings yet
Descriptive Statistics: Graphing Methods
41 pages
Statistics and Probabilities in Management
No ratings yet
Statistics and Probabilities in Management
17 pages
Graphical Techniques for Numerical Data
No ratings yet
Graphical Techniques for Numerical Data
34 pages
Advanced Data Analytics Overview
No ratings yet
Advanced Data Analytics Overview
98 pages
House Price Dataset Analysis Guide
No ratings yet
House Price Dataset Analysis Guide
41 pages
Descriptive Statistics Overview in R
No ratings yet
Descriptive Statistics Overview in R
73 pages
Stat 100 - Lecture 2 - 2026
No ratings yet
Stat 100 - Lecture 2 - 2026
41 pages
Statistical Methods Chapter 1 Overview A
No ratings yet
Statistical Methods Chapter 1 Overview A
13 pages
Understanding Histograms and Data Distribution
No ratings yet
Understanding Histograms and Data Distribution
52 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
43 pages
Descriptive Statistics Overview and Methods
No ratings yet
Descriptive Statistics Overview and Methods
23 pages
Overview of Descriptive Statistics Methods
No ratings yet
Overview of Descriptive Statistics Methods
40 pages
Introduction to Basic Statistics Concepts
No ratings yet
Introduction to Basic Statistics Concepts
27 pages
Analyzing Quantitative Data Distributions
No ratings yet
Analyzing Quantitative Data Distributions
75 pages
Histogram Analysis and Techniques
No ratings yet
Histogram Analysis and Techniques
2 pages
Histogram (DV)
No ratings yet
Histogram (DV)
4 pages
House Price Prediction Project Report
No ratings yet
House Price Prediction Project Report
37 pages
Understanding Histograms and Data Distribution
No ratings yet
Understanding Histograms and Data Distribution
86 pages
D2.1 - Distribution of Outcomes - Needle Graph and Histogram
No ratings yet
D2.1 - Distribution of Outcomes - Needle Graph and Histogram
13 pages
Statistics in Engineering Analysis
No ratings yet
Statistics in Engineering Analysis
31 pages
Descriptive Statistics Overview
No ratings yet
Descriptive Statistics Overview
81 pages
Relative Frequency Ogive Analysis
No ratings yet
Relative Frequency Ogive Analysis
41 pages
Describing Histogram Shapes in Statistics
No ratings yet
Describing Histogram Shapes in Statistics
3 pages
Lead Time Histogram Basics
No ratings yet
Lead Time Histogram Basics
7 pages
Understanding Quantitative Data Distribution
No ratings yet
Understanding Quantitative Data Distribution
13 pages
Tallest Bar on Histogram Explained
No ratings yet
Tallest Bar on Histogram Explained
59 pages
Topic02 NumericalSummary Slides
No ratings yet
Topic02 NumericalSummary Slides
60 pages
Descriptive Statistics Overview
No ratings yet
Descriptive Statistics Overview
23 pages
Excel Data Analysis Techniques
No ratings yet
Excel Data Analysis Techniques
78 pages
Numerical Data Summaries and Visualization
No ratings yet
Numerical Data Summaries and Visualization
7 pages
House Price Prediction Analysis Guide
No ratings yet
House Price Prediction Analysis Guide
28 pages
Visualizing and Summarizing Sample Data
No ratings yet
Visualizing and Summarizing Sample Data
16 pages
Introduction To Business Analytics Session 1b
No ratings yet
Introduction To Business Analytics Session 1b
18 pages
Descriptive Analytics-Univariate
No ratings yet
Descriptive Analytics-Univariate
38 pages
Business Data Analytics Overview
No ratings yet
Business Data Analytics Overview
66 pages
Visual Aids For Exploratory Data Analysis (EDA) : A Comprehensive Guide With R Implementation
No ratings yet
Visual Aids For Exploratory Data Analysis (EDA) : A Comprehensive Guide With R Implementation
17 pages
Business Analytics: Data Exploration Basics
No ratings yet
Business Analytics: Data Exploration Basics
76 pages
Univariate Analysis Techniques Explained
No ratings yet
Univariate Analysis Techniques Explained
2 pages
Data Visualization in Business Analytics
No ratings yet
Data Visualization in Business Analytics
34 pages
Histogram Project for Math 100
No ratings yet
Histogram Project for Math 100
17 pages
His To Grams
No ratings yet
His To Grams
13 pages
House Price Prediction Report
No ratings yet
House Price Prediction Report
35 pages
Understanding Histograms in Statistics
No ratings yet
Understanding Histograms in Statistics
9 pages
Univariate Analysis Notes for EDA
No ratings yet
Univariate Analysis Notes for EDA
24 pages
Home Price Statistical Analysis Report
No ratings yet
Home Price Statistical Analysis Report
3 pages
Data Exploration and Pre-processing Guide
No ratings yet
Data Exploration and Pre-processing Guide
21 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
33 pages
Histogram Complete Guide
No ratings yet
Histogram Complete Guide
8 pages
Newtown Property Data Analysis
No ratings yet
Newtown Property Data Analysis
62 pages
Descriptive Statistics Analysis of Economic Variables
No ratings yet
Descriptive Statistics Analysis of Economic Variables
11 pages
Understanding Histograms in Statistics
No ratings yet
Understanding Histograms in Statistics
17 pages
Outlier Detection Using KNN Method
No ratings yet
Outlier Detection Using KNN Method
6 pages
RapidMiner Data Preparation Overview
100% (1)
RapidMiner Data Preparation Overview
17 pages
Understanding Probability Sampling Methods
No ratings yet
Understanding Probability Sampling Methods
40 pages
Hypothesis Testing Exercises
100% (1)
Hypothesis Testing Exercises
4 pages
Understanding Statistical Measures
No ratings yet
Understanding Statistical Measures
1 page
Time Management's Impact on Life Satisfaction
No ratings yet
Time Management's Impact on Life Satisfaction
22 pages
Sampling Methods and Techniques Explained
No ratings yet
Sampling Methods and Techniques Explained
19 pages
Biostatistics Course Overview and Concepts
No ratings yet
Biostatistics Course Overview and Concepts
38 pages
Grip Width and Load-Velocity in Bench Press
No ratings yet
Grip Width and Load-Velocity in Bench Press
10 pages
Analyzing Interrupted Time Series with Control
No ratings yet
Analyzing Interrupted Time Series with Control
27 pages
PS 1 Section A
No ratings yet
PS 1 Section A
2 pages
Data Analysis and Hypothesis Testing
No ratings yet
Data Analysis and Hypothesis Testing
19 pages
Quantile vs. Quartile in Regression Analysis
No ratings yet
Quantile vs. Quartile in Regression Analysis
66 pages
Communication and Trust in SDM in China
No ratings yet
Communication and Trust in SDM in China
10 pages
Monthly Cell Phone Bill Analysis
No ratings yet
Monthly Cell Phone Bill Analysis
10 pages
Understanding Probability Concepts
No ratings yet
Understanding Probability Concepts
31 pages
Key Terms in Accounting Information Systems
No ratings yet
Key Terms in Accounting Information Systems
2 pages
Decision Sciences II Mid-Term Exam 2017
No ratings yet
Decision Sciences II Mid-Term Exam 2017
20 pages
Introduction to Structural Equation Modeling
100% (1)
Introduction to Structural Equation Modeling
18 pages
Linear & Nonlinear Regression in Excel
No ratings yet
Linear & Nonlinear Regression in Excel
6 pages
Student Marks Analysis and Statistics
No ratings yet
Student Marks Analysis and Statistics
5 pages
Bootstrap Method in R for Statistics
No ratings yet
Bootstrap Method in R for Statistics
114 pages
Modern Regression Homework 9 Solutions
No ratings yet
Modern Regression Homework 9 Solutions
14 pages
MAT202 Probability & Statistics Exam Paper
No ratings yet
MAT202 Probability & Statistics Exam Paper
3 pages
Econometrics I Exam Winter 2018/2019
No ratings yet
Econometrics I Exam Winter 2018/2019
5 pages
Assignment 2
No ratings yet
Assignment 2
21 pages
Melbourne Real Estate Statistics Analysis
No ratings yet
Melbourne Real Estate Statistics Analysis
10 pages
Bayesian Optimization for Hyperparameter Tuning
No ratings yet
Bayesian Optimization for Hyperparameter Tuning
33 pages
Bias-Variance Decomposition Explained
No ratings yet
Bias-Variance Decomposition Explained
10 pages
Business Statistics Test 2 Instructions
No ratings yet
Business Statistics Test 2 Instructions
4 pages