MODULE 3
Statistics and Machine Learning
In addition, we will see how some of the basic machine learning techniques using R could help us solve
data problems. Regarding machine learning, I would suggest reviewing the introductory part of the
previous chapter.
3.1 Basic Statistics
We will start with getting some descriptive statistics. Let us work with “[Link]” data, which you can
download from OA 6.4. This data contains 38 records of different people’s sizes in terms of height and
weight. Here is how we load it:
size = [Link](‘[Link]’,header=T,sep=‘,’)
Once again, this assumes that the data is in the current directory. Alternatively, you can replace
“[Link]” with “[Link]()” to let you pick the file from your hard drive when you run this line. Also,
while you can run one line at a time on your console, you could type them and save as an “.r” file, so
that not only can you run line-by-line, but you can also store the script for future runs.
Either way, I am assuming that you have loaded the data. Now, we can ask R to give us some basic
statistics about it by running the summary command:
The output, as shown above, shows descriptive statistics for the two variables or columns we have here:
“Height” and “Weight.” We have seen such output before, so I will not bother with the details.
Let us visualize this data on a scatterplot. In the following line, “ylim” is for specifying minimum and
maximum values for the y-axis:
library(ggplot2)
ggplot(size,aes(x=Height,y=Weight))+geom_point()+ylim (100,200)
3.2 Regression
Now that we have a scatterplot, we can start asking some questions. One straightforward question is:
What is the relationship between the two variables we just plotted? That is easy. With R, you can keep
the existing plotting information and just add a function to find a line that captures the relationship
ggplot(size, aes(x=Height,y=Weight)) + geom_point()+ stat_smooth(method=“lm”) +
ylim(100,200)
Compare this command to the one we used above for creating the plot in Figure 6.7. You will notice that
we kept all of it and simply added a segment that overlaid a line on top of the scatterplot. And that is
how easy it is to do basic linear regression in R, a form of supervised learning. Here, “lm” method refers
to linear model. The output is in Figure 6.8. You see that blue line? That is the regression line. It is also
a model that shows the connection between “Height” and “Weight” variables. What it means is that if
we know the value of “Height,” we could figure out the value of “Weight” anywhere on this line. Want
to see the line equation? Use the “lm” command to extract the coefficients:
You can see that the output contains coefficients for the independent or predictor variable (Height) and
the constant or intercept. The line equation becomes:
Weight=-130.354+4.113*Height
Try plugging in different values of “Height” in this equation and see what values of “Weight” you get
and how close your predicted or estimated values are to reality. With linear regression, we managed to
fit a straight line through the data. But perhaps the relationship between “Height” and “Weight” is not
all that straight. So, let us remove that restriction of linear model:
ggplot(size, aes(x=Height,y=Weight)) + geom_point() + geom_ smooth() +
ylim(100,200)
And here is the output (Figure 6.9). As you can see, our data fits a curved line better than a straight line.
Yes, the curved line fits the data better, and it may seem like a better idea than trying to draw a straight
line through this data. However, we may end up with the curse of overfitting and overlearning with a
curved shape for doing regression.
It means we were able to model the existing data really well, but in the process, we compromised so
much that we may not do so well for new data. Do not worry about this problem for now. We will come
back to these concepts in the machine learning chapters. For now, just accept that a line is a good idea
for doing regression, and whenever we talk about regression, we would implicitly mean linear
regression.
3.3 Classification
Let us start with classification using the KNN method. As you may recall, classification with KNN is an
example of supervised learning, where we have some training data with true labels, and we build a model
(classifier) that could then help us classify unseen data.
Before we use classification in R, let us make sure we have a library or package named “class” available
to us. You can find available packages in the “Packages” tab in RStudio (typically in the bottom-right
window where you also see the plots). If you see “class” there, make sure it is checked. If it is not there,
you need to install that package using the same method you did for the ggplot2 package.
3.4 Clustering
Now we will switch to the unsupervised learning branch of machine learning. Recall from Chapter 5
that this covers a class of problems where we do not have labels on our training data.
In other words, we do not have a way to know which data point should go to which class. Instead, we
are interested in somehow characterizing and explaining the data we encounter.
Perhaps there are some classes or patterns in them. Can we identify and explain these? Such a process
is often exploratory in nature. Clustering is the most widely used method for such exploration and we
will learn about it using a hands-on example.
• Data frame: Data frame generally refers to “tabular” data, a data structure that represents cases
(represented by the rows), each of which consists of a number of observations or measurements
(represented by the columns). In R it is a special case of list where each component is of equal length.
• Package: In R, packages are collections of functions and compiled code in a well-defined format.
• Library: The directory where the packages are stored is R called the library. Often,
“Package” and “library” are used interchangeably.
• Integrated Development Environment (IDE): This is an application that contains various tools for
writing, compiling, debugging, and running a program. Examples include Eclipse, Spyder, and Visual
Studio.
• Correlation: This indicates how closely two variables are related and ranges from −1 (negatively
related) to +1 (positively related). A correlation of 0 indicates no relation between the variables.
• Linear regression: Linear regression is an approach to model the relationship between the outcome
variable and predictor variable(s) by fitting a linear equation to observed data.
• Machine learning: This is a field that explores the use of algorithms that can learn from the data and
use that knowledge to make predictions on data they have not seen before.
• Supervised learning: This is a branch of machine learning that includes problems where a model could
be built using the data and true labels or values.
• Unsupervised learning: This is a branch of machine learning that includes problems where we do not
have true labels for the data to train with. Instead, the goal is to somehow organize the data into some
meaningful clusters or densities.
• Predictor: A predictor variable is a variable that is being used to measure some other variable or
outcome. In an experiment, predictor variables are often independent variables, which are manipulated
by the researcher rather than just measured.
• Outcome or response: Outcome or response variables are in most cases the dependent variables which
are observed and measured by changing the independent variables.
MACHINE LEARNING FOR DA TA SCIENCE
Machine learning is a very important part of doing data science, providing several crucial tools for
working on data problems. For instance, many data-intensive problems require us to do regression or
classification to develop decision-making insights.
This falls squarely within the machine learning realm. Then there are problems related to data mining
and data organization that call for various exploration techniques, such as clustering and density
estimation.
4.1 Machine Learning Introduction and Regression
So far, our work on data science problems has primarily involved applying statistical techniques to
analyze the data and derive some conclusions or insights. But there are times when it is not as simple as
that. Sometimes we want to learn something from that data and use that learning or knowledge to solve
not only the current problem but also future data problems.
We might want to look at shopping data at a grocery chain, combined with farming and poultry data,
and learn how supply and demand are related. This would enable us to make recommendations for
investments in both the grocery store and the food industries.
In addition, we want to keep updating the knowledge – often called a model – derived from analyzing
the data so far. Fortunately, there is a systematic way for tackling such data problems. In fact, we have
already seen this in the previous chapters: machine learning.
In this chapter, we will introduce machine learning with a few definitions and examples. Then, we will
look at a large class of problems in machine learning called regression. This is not the first time we have
encountered regression.
What Is Machine Learning?
Machine learning is a spin-off or a subset of artificial intelligence (AI), and in this book it is an
application of data science skills. Here, the goal, according to Arthur Samuel is to give “computers the
ability to learn without being explicitly programmed.”
Tom Mitchell puts it more formally: “A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P if its performance at tasks in T, as
measured by P, improves with experience E.”
For a moment, let us forget about the machine and think about learning in general. The New Oxford
American Dictionary (third edition)4 defines “to learn” as: “To get knowledge of something by study,
experience, or being taught; to become aware by information or from observation; to commit to
memory; to be informed of or to ascertain; to receive instruction.”
Therefore, it is important to come up with a new operational definition of learning in the context of the
machine, which we can formulate as:
Things learn when they change their behaviour in a way that makes them perform better in the future.
This ties learning to performance rather than knowledge. You can test learning by observing present
behavior and comparing it with past behavior. This is a more objective kind of definition and is more
satisfactory for our purposes. Of course, the more comprehensive and formal definition based on this
idea is what we saw before by Tom Mitchell.
In the context of this definition, machine learning explores the use of algorithms that can learn from the
data and use that knowledge to make predictions on data they have not seen before – such algorithms
are designed to overcome strictly static program instructions by making data-driven predictions or
decisions through building a model from sample inputs.
The first is the heavily hyped, self-driving Google car (now rebranded as WAYMO). As shown in Figure
4.1, this car is taking a real view of the road to recognize objects and patterns such as sky, road signs,
and moving vehicles in a different lane. This process itself is quite complicated for a machine to do. A
lot of things may look like a car (that blue blob in the bottom image is a car), and it may not be easy to
identify where a street sign is. The self-driving car needs not only to carry out such object recognition,
but also to make decisions about navigation. There is just so much unknown involved here that it is
impossible to come up with an algorithm (a set of instructions) for a car to execute. Instead, the car needs
to know the rules of driving, have the ability to do object and pattern recognition, and apply these to
making decisions in real time. In addition, it needs to keep improving. That is where machine learning
comes into play.
Fig: 4.1 Machine learning technology behind self-driving car. (Source: YouTube: Deep Learning:
Technology behind self- driving car.
Fig:4.2 Problem of optical character recognition.
Another classic example of machine learning is optical character recognition (OCR). Humans are good
with recognizing hand-written characters, but computers are not. Why? Because there are too many
variations in any one character that can be written, and there is no way we could teach a computer all
those variations. And then, of course, there may be noise – an unfinished character, joining with another
character, some unrelated stuff in the background, an angle at which the character is being read, etc. So,
once again, what we need is a basic set of rules that tells the computer what “A,” “a,” “5,” etc., look like,
and then have it make a decision based on pattern recognition. The way this happens is by showing several
versions of a character to the computer so it learns that character, just like a child will do through
repetitions, and then have it go through the recognition process (Figure 4.2).
Let us take an example that is perhaps more relevant to everyday life. If you have used any online services,
chances are you have come across recommendations. Take, for instance, services such as Amazon and
Netflix. How do they know what products to recommend? We understand that they are monitoring our
activities, that they have our past records, and that is how they are able to give us suggestions. But how
exactly? They use something called collaborative filtering (CF). This is a method that uses your past
behavior and compares its similarities with the behaviors of other users in that community to figure out
what you may like in the future.
Take a look at Table 8.1. Here, there are data about four people’s ratings for different movies. And the
objective for a system here is to figure out if Person 5 will like a movie or not based on that data as well
as her own movie likings from the past. In other words, it is trying to learn what kinds of things Person 5
likes (and dislikes), what others similar to Person 5 like, and uses that knowledge to make new
recommendations. On top of that, as Person 5 accepts or rejects its recommendations, the system extends
its learning to include knowledge about how Person 5 responds to its suggestions, and further corrects its
models.
Table 8.1 Machine learning-based collaborative filtering for movie recommendation.
Movie Name
Sherlock Avengers Titanic La La Land Wall-E
Rating Person 1 4 5 3 4 2
Person 2 3 2 3 4 4
Person 3 4 3 4 5 3
Person 4 3 4 4 5 2
Person 5 4 ? 4 ? 4
So, are you convinced that machine learning is a very important field of study? If the answer is “yes”
and you are wondering what it takes to create a good machine learning system, then the following list of
criteria from SAS6 may help:
a. Data preparation capabilities.
b. Algorithms – basic and advanced.
c. Automation and iterative processes.
d. Scalability.
e. Ensemble modeling.
In this chapter, we will primarily focus on the second criterion: algorithms. More specifically, we will
see some of the most important techniques and algorithms for developing machine learning applications.
We will note here that, in most cases, the application of machine learning is entwined with the
application of statistical analysis. Therefore, it is important to remember the differences in the
nomenclature of these two fields.
• In machine learning, a target is called a label.
• In statistics, a target is called a dependent variable.
• A variable in statistics is called a feature in machine learning.
• A transformation in statistics is called feature creation in machine learning.
Machine learning algorithms are organized into a taxonomy, based on the desired out- come of the
algorithm. Common algorithm types include:
• Supervised learning. When we know the labels on the training examples we are using to learn.
• Unsupervised learning. When we do not know the labels (or even the number of labels or
classes) from the training examples we are using for learning.
• Reinforcement learning. When we want to provide feedback to the system based on how it
performs with training examples.
4.2 Regression
Our first step is regression. Think about it as a much more sophisticated version of exploration. For
example, if you know the relationship between education and income (the more someone is educated,
the more money they make), we could predict someone’s income based on their education. Simply
speaking, learning such a relationship is regression.
In more technical terms, regression is concerned with modeling the relationship between variables of
interest. These relationships use some measures of error in the predictions to refine the models
iteratively. In other words, regression is a process.
We can learn about two variables relating in some way (e.g., correlation), but if there is a relationship of
some kind, can we figure out if or how one variable could predict the other? Linear regression allows us
to do that. Specifically, we want to see how a variable X affects a variable y. Here, X is called the
independent variable or predictor; y is called the dependent variable or response. Take a note of the
notation here. The X is in uppercase because it could have multiple feature vectors, making it a feature
matrix. If we are dealing with only a single feature for X, we may decide to use the lowercase x. On the
other hand, y is in lowercase because it is a single value or feature being predicted.
An example showing a relationship between annual return and excess return of stock using linear regression from
the stock portfolio dataset.9
As mentioned previously, linear regression fits a line (or plane, or hyperplane) to the dataset. For
example, in Figure 8.3, we want to predict the annual return using excess return of stock in a stock
portfolio. The line represents the relation between these two variables. Here, it happens to be quite linear
(see most of the data points close to the line), but such is not always the case.
Some of the most popular regression algorithms are:
• Ordinary least squares regression (OLSR)
• Linear regression
• Logistic regression
• Stepwise regression
• Multivariate adaptive regression splines (MARS)
Locally estimated scatterplot smoothing (LOESS)
Hands-On Example 8.1: Linear Regression
Before we move on to the gradient descent technique, let us see how we could solve linear regression in a
way described above. Below, in Table 8.2, is [Link], a completely made up dataset (you can
download it from OA 8.1). The attribute x is the input variable and y is the output variable that we are
trying to predict.
If we get more data (test set), we will only have x values and we would be interested in predicting y
values. To solve the problem of predicting y from x, we will start with a simple scatterplot of x versus y as
shown in Figure 8.4.
How did we create this plot? Think about your R training. First, we load the data using the following
command.
Table 8.2 Data for regression.
x y
1 3
2 4
3 8
4 4
5 6
6 9
7 8
8 12
9 15
10 26
11 35
12 40
13 45
14 54
15 49
16 59
17 60
18 62
19 63
20 68
4.3 Gradient Descent
It is possible to fit multiple lines to the same dataset, each represented by the same equation but with
different m and b values. Our job is to find the best one, which will represent the dataset better than the
other lines. In other words, we need to find the best set of m and b values.
A standard approach to solving this problem is to define an error function (sometimes also known as a
cost function) that measures how good a given line is. This function will take in a (m, b) pair and return
an error value based on how well the line fits our data. To compute this error for a given line, we will
iterate through each (x, y) point in our dataset and sum the square distances between each point’s y value
and the candidate line’s y value (computed at mx + b).
Formally, this error function looks like:
We have squared the distance to ensure that it is positive and to make our error function differentiable.
Note that normally we will use m to indicate number of data points, but here we are using that letter to
indicate the slope, so we have made an exception and used n. Also note that often the intercept for a line
equation is represented using c instead of b, as we have done.
The error function is defined in such a way that the lines that fit our data better will result in lower error
values. If we minimize this function, we will get the best line for our data. Since our error function
consists of two parameters (m and b), we can visualize it as a 3D surface. Figure 8.7 depicts what it
looks like for our dataset.
Each point in this 3D space represents a line. Let that sink in for a bit. Each point in this 3D figure
represents a line. Can you see how? We have three dimensions: slope (m), y- intercept (b), and error.
Each point has values for these three, and that is what gives us the line (technically, just the m and the
b). In other words, this 3D figure presents a whole bunch of possible lines we could have to fit the data
shown in Table 8.2, allowing us to see which line is the best.
The height of the function at each point is the error value for that line. You can see that some lines yield
smaller error values than others (i.e., fit our data better).
The darker blue color indicates the lower the error function value and the better it fits our data. We can
find the best m and b set that will minimize the cost function using gradient descent.
Gradient descent is an approach for looking for minima – points where the error is at its lowest. When
we run a gradient descent search, we start from some location on this surface and move downhill to find
the line with the lowest error.
To run gradient descent on this error function, we first need to compute its gradient or slope. The gradient
will act like a compass and always point us downhill. To compute it, we will need to differentiate our error
function. Since our function is defined by two parameters (m and b), we will need to compute a partial
derivative for each. These derivatives work out to be:
Now we know how to run gradient descent and get the smallest error. We can initialize our search to
start at any pair of m and b values (i.e., any line) and let the gradient descent algorithm march downhill
on our error function toward the best line. Each iteration will update m and b to a line that yields slightly
lower error than the previous iteration.
The direction to move in for each iteration is calculated using the two partial derivatives from the above
two equations.
Let us now generalize this. In the above example, m and b were the parameters we were trying to estimate.
But there could be many parameters in a problem, depending on the dimensionality of the data or the
number of features available. We will refer to these parameters as θ values, and it is the job of the
learning algorithm to estimate the best possible values of the θ.
Earlier we defined an error function using a model built with two parameters (mxi + b). Now, let us
generalize it. Imagine that we have a model that could have any number of parameters. Since this model
is built using training examples, we would call it a hypothesis function and represent it using h. It can
be defined as
If we consider θ0 = b, θ1 = m, and assign x0 = 1, we can derive our line equation using the above hypothesis
function. In other words, a line equation is a special case of this function.
Now, just as we defined the error function using the line equation, we could define a cost function using
the above hypothesis function as in the following:
Compare this to the error function defined earlier. Yes, we are now back to using m to represent number
of samples or data points. And we have also added a scaling factor of ½ in the mix, which is purely
out of convenience, as you will see soon.
And just as we did before, finding the best values for our parameters means chasing the slope for each
of them and trying to reach as low cost as possible. In other words, we are trying to minimize J(θ) and
we will do that by following its slope along each parameter. Let us say we are doing this for parameter
θj. That means we will take the partial derivative of J (θ) with respect to θj
This means we update θj (override its existing value) by subtracting a weighted slope or gradient from
it. In other words, we take a step in the direction of the slope. Here, α is the learning rate, with value
between 0 and 1, which controls how large a step we take downhill during each iteration. If we take
too large a step, we may step over the minimum. However, if we take small steps, it will require many
iterations to arrive at the minimum.