Module 3 Chapter 3
Module 3 Chapter 3
iteMo
Anuttöt
Basics of
Learning Theory
"Predicting the future isn't magic, it's artificial intelligence."
- Dave Waters
Learning Objectives
Understand the basics of learning theory and the key elements of
machine learning
Introduce Concept learning to obtain an abstraction and
Learn about hypothesis representation language arnd to generalization
from the data
thesis space using Find-S algorithm and its limitations explore searching the hvpo
Study about version spaces,
ation algorithm List-Then-Eliminate algorithm and Candidate Elimin
Introduce Inductive bias, aset of prior assumptions
considered by alearning algorithm
beyond the training data in order to perform induction
Discuss the tradeoff between the two factors
called bias and variance which evist
when modelling a machine learning
algorithm
Understand different challenges in model selection and
the
model evaluation
Study popular re-sampling model selection
LOOCV, etc.) to tune machine learning modelmethod called Cross-Validation (A-told.
Introduce the various learning frameworks and learn to
evaluate hvpothesis
INTRODUCTION TOLEARNINGAND ITS TYPES
3.1
through study, experience,
The
The process of acquiring knowledge and expertise
Generally, humans learn in different ways. To make
called as learning. machines bei
learn,
ng t
we
augh,
ne d
simulate the strategiesof human learning in machines. But, will the computers learn?
and logicians,
This
questi
has been raised over manycenturies by philosophers, mathematicians
First let us address the question-What sort of tasks can the computers learn? This depends
the nature of problems that the computers can solve. There are two kinds of problems -
and ill-posed. Computers can solve only well-posed problems, as these have
Specifications and have the following components inherent to it.
-wwelelll--dpefosin,
1. Class of learning tasks (T)
2. A measure of performance (P)
3. A source of experience (E)
The standard definition of learning proposed by Tom
from Efor the task T, andPimproves with experience E. LetMitchell is that a program can lear
us formalize the concept of learning
as follows:
Let x be the input and Xbe the input space,
space, which is the set of all possible outputs, thatwhich
is,
is the set of all inputs, and Y is the
output
Let Dbe the input dataset with yes/no.
Let the unknown target examples, (x,,y,).(x, y,),*.(x, y,) for n inputs.
The objective of the learning function isbe f: X ’ Y, that maps the input space to output spae
program to pick a function, g: 2 ’ Yto
All the possible formulae form a hypothesis space. In
which the learning algorithm chooses. short, let H be theapproximate hypothesis
set of all formulae from
samples. This is shown in Figure 3.1. The choice is good when the hypothesisg replicates ffor all
Ideal target
hypothesis f)
Training
samples
Learning Generated
algorithms hypothesis g(x)
Candidate Error
formulae Difference}
Hypothesis
space H
Let us assume a problem of predicting a label for a given input data. Let Dbe the input dataset
with bothpositive and negative examples. Let ybe the output with clas 0or 1. The simple learning
model can be given as:
2x,w, > Threshold, belongs toclass land
i=l
D
where, ,,x,",x,are the components of the input vector, ,w,...w, are the weights and +1
d
and -1 represent the class. This simple model is called perception model. One can simplify this by
making w, =band fixing it as 1, then the model can further be simplified as:
H(x) = sign(w' x).
This is called perception learning algorithm. The formal learning modelsare later discussed in
Section 3.7 of this chapter.
Learning Types
There are different types of learning. Some of the different learning methods are as follows:
1. Learn by memorization or learn by repetition also called as rote learning is done by
memorizing without understanding the logic or concept. Although rote learning is
basically learning by repetition, in machine learning perspective, the learning ocurs by
simply comparing with the existing knowledge for the same input data and producing the
output if present.
en
2. Learn byexamples also called as learn by experience or previous knowledge acquired
at some time, is like finding an analogy, which means pertorming inductive learning
from observations that formulate a general concept. Here, the learner learns by inferring a
general rule fromn the set of observations or examples. Therefore, inductive learning is also
called as discovery learning.
80 Machine Learning
expert or a teacher, generally called as passive learmi
3. Learn by being taught by an learning called active learning where the can learner
However, there is a special kind of data instances with the desired
interactively query a teacher/expert to label unlabelled
outputs.
critical thinking, also called as deductive learning, deduces new facte
4. Learning by
conclusion from related knownfacts and informatiorn.
Training Experience
Let us consider designing of a chess game. In direct experience, individual board states and correct
moves of the chess game are given directly. In indirect systerm, the move sequences and results are
only given. The training experience also depends on the presence of a supervisor who can label all
valid moves for a board state. In the absence of a supervisor, the game agent plays against itself and
learns the good moves, if the training samples cover all scenarios, or in other words, distributed
enough for performance computation. If the training samples and testing samples have the same
distribution, the results would be good.
Formally, Concept learning is defined as-"Given aset of hypotheses, the learner searches
through the hypothesis space to identify the best hypothesis that matches the target concept':.
Consider the following set of training instances shown in Table 3.1.
Table 3.1: Sample Training Instances
[Link]. Horns Tail Tusks Paws Fur Color Hooves Size Elephant
1. No Short Yes No No Black No B1g Yes
Yes Short No No No Brown Yes Mediumn No
4. No
Long No Yes Yes White No Medium No
5 No Short Yes Yes Yes Black No Big Yes
Here, in this set of training instances, the independent attributes considered are 'Horns', Tail',
Tusks',Paws', 'Fur', 'Color, Hooves' and 'Size'. The dependent attribute is 'Elephant'. The target
concept is to identify the animal to be an Elepharnt.
Let us now take this example and understand further the concept of hypothesis.
Target Concept: Predict the type of animal - For exampleElephant'.
Example 3.1: Explain Concept Learning Task of an Elephant from the dataset given in Table
Given,
Input: 5 instances each with 8 attributes
Target concept/function 'c: Elephant ’ (Yes, No}
Hypotheses H: Set of hypothesis each with conjunctions of literals as propositions [ie. ea
literal is represented as an attribute-value pair]
Solution: The hypothesis 'h' for the concept learning task of an Elephant is given as:
h= <No Short Yes ? ? Black No ?>
This hypothesis h is expressed in propositional logic form as below:
(Horns = No) a (Tail= Short) a (Tusks =Yes) a (Paws=?) A(Fur=)a (Color = Black) A(Hne
No) a (Size =?)
Output: Learn the hypothesis 'h to predict an 'Elephant' such that for agiven test instance
h(x) = c(r)
This hypothesis produced is also called as concept description which is a model that can
used to classify subsequent instances.
Thus, concept learning can also be called as Inductive Learning that tries to induce aget
function from specifictraining instances. This way of learning ahypothesis that can produe
approximate target function with a sufficiently large set of training instances Can also appr
ca
mately classify other unobserved instances and is called as inductive learning hypothesis, We
targt
only determine an approximate target function because it is very difficult to find an exact
approximate targe
function with the observed training instances. That is why a hypothesis is an
function that best maps the inputs to outputs.
85 as algorithm
besthypothesis
different
example, reduced
by classification. more set
hvpotheses
negative [Link]
cost in
function decision called of =384 including
defined or set empty solution
the a
Theory function is possible x2x2) or or
a instances oneas efficientfrom a function
be learning
target larger
a
in For whereas further when the to abe
Learning x3 instance
containing
distinct space
can space
variables. the representing ordered hypothesis/solution
target
machinea
thefunction be for following
training 2 hypotheses need
can
heuristic
represents x hypothesis
of
Basics approximates the hypothesis functioncan used 2x 8,749
every that
output algorithm are x2 hypothesis we hypotheses. are
targeta describe
space,
the
all-observed 2
=
classifies hernce hypotheses
hypothesis
linear that has distinct
x dataset. 1) hypothesis given
language
theto (2 x3+ the
the best hypotheses
3.1 are and searching optimized
a
that hypothesisrepresent a
variables
as
learning x4) anyit x3
there =81,920
x5x4
training of on possible
of would with Table is, x4onelarger
However, set the
approximations space
hypotheses hypothesis
representation tree. based
that
the attribute, x3and the wherein restrict
thethat would input a generated
aby consistent
hypothesis
as
only the instances; much
from an a
hypothesis/soluti
in >x3 finds generate
in hypothesis space the given in attribute. attribute to
order
hypotheses the White the instances x3 is hypothesis important that
possible algorithnm represents space
possible maps hypothesis isthat attribute of x4 of 3
Search will
the be each x4 the x the in
one
general
Black, set (3 strategy
of possible thatrepresents can space Big for 5 be of hypothesis methods
Spaceofall all set
3.4.2
Hypothesis learning space No
Short
No No
No
Brown, Medium,all 4
the 4 of empty
x each will each
best also
a the that
function the
hypothesis No Yes, values x the is Space the
search
set of this Generally,
best representsbias. Version
of
Yes, Long,Yes, Yes, (4
covering for the therefor ordering search
hypotheses generate thefor most improving
the set From machinealgorithm language each -Horns - - Yes, - -Hooves these
the -
Tail TusksPaws- Color - ol represents
'?"
Therefore, search Heuristic
a
is the space. Fur Size [?, only Thus, is Heuristic
spaewords, determine the of [Link], instances
Considering values the
Hypothesis search
[Link] algorithm
of a subset
set
The can to to
algorithms iteratively
Every regression including
instances. one
Hypothesis hypothesis specifyingVersion we symbols
other would manner The For values. distinct
So,
more instance., specific3.4.3Heuristicmeasure.
thespace.
tree two by of by
In fit a as
tested
[Link] the
86 Machine Learning hypothesis will be
initial [Link] hypothesis is a targu
the hypothesis space or a path [Link] tested
itis guaranteedto then
function or the goal state to see if it is a increases the efficiency because tough
solution, find
a
it will be selected. This method [Link] is useful for solving heuristic problems
whid
better hypothesis but may not bethe bestThe typical example problem solved by is search
could not solved by any other method.
the travelling salesman problem. climbing methods,
search methods are hill
Several commonly used heuristicsimulated-annealing, A* algorithm, andIgenetic algorithm
Constrairt
satisfaction problems, best-first search,
3.4.4 Generalization and Specialization
this concept hierarchy, let us apply this genei
In order to understand about how we construct
generalization of the most specific hypothes
principle of generalization/specialization relation. By the hypothesis space can be searched for
and by specialization of the most general hypothesis,
but does not match any negate
an approximate hypothesis that matches all positive instances
instance.
Now, after observing all the positive instances, an approximate hypothesis h5 is generated
which can now classify any subsequent positive instance to true.
larg: Specialization - General to Specific Learning This learning methodology will search through
the hypothesis space for an approximate hypothesis by specializing the most general hypothesis.
Example 3.3: lMlustrate learning by Specialization - General to Specific Learning for the data
instances shown in Table 3.1.
Solution: Start from the most general hypothesis which willmake true allpositive and negative
instances.
Initially,
h=<? ? ? ?>
roug
his more general to classify all instances to true.
hesis
I1: No Short Yes No No Black No Big Yes (Positive instance)
hl =<? 2 ?
<? ? No
<? ? ? Big
h2 imposes constraints so that it willnot classity a negative instance to true.
I3: No Short Yes No No Black No Mediunn Yes (Positive instance)
h3 = <No ? ?
tame
Yes .
88 Machine Learning
Black ? ?>
<? ?
No ?>
<? ?
Big>
<? ?
White No Medium No (Negative instance)
14: No No Yes Yes
Long ? ?>
h4 = <? 2 Yes 2
? ?>
<? Black
<? ? Big> ?
this negative instance.
Kemove any hypothesis inconsistent with
15: No Short Yes
Yes Yes Black No Big Yes (Positive instance)
? ?>
h5 = <? Yes
<? ? Black ?>
<? 2 ? ? ? Big>
Thus, h5 is the hypothesis space generated which will classify the positive instances to true and
negative instances to false.
Solution:
Step 1: Initialize 'h' to the most specific hypothesis. There are 6 attributes, so for each attribute,
we initially fill'ã inthe initial hypothesis ht.
h=<0
Step 2: Generalize the initial hypothesis for the first positive instance. I1 is a positive instance,
so generalize the most specific hypothesis 'h to include this positive instance. Hence,
I1: 29 Yes Excellent Good Fast Yes Positive instance
h=<9 Yes Excellent Good Fast Yes>
Step 3: Scan the next instance I2, since 2 is a positive instance. Generalize 'h' to include positive
instance 12. For each of the non-matching attribute value in h' put a ? to include this positive
instance. The third attribute value ismismatching in 'h with I2, so put a '?.
I2: 29 Yes Good Good Fast Yes Positive instance
h=<9 Yes Good Fast Yes>
Now,scan [Link] it is a negative instance, ignore it. Hence, the hypothesis remains the same
without any changeafter scanning I3.
I3: 28 No Good Good Fast No Negative instance
h=<9 Yes Good Fast Yes>
Now scan 14. Since it is a positive instance, check for mismatch in the
The 5th and 6th attribute value are mismatching, so add '? to those hypothesis 'h' with 14.
attributes in 'h'.
14: 9 Yes Good Good Slow No Positive instance
h=<9 Yes 2 Good ?>
Now, the final hypothesis generated with Find-S algorithm is:
h=<9 Yes Good ?>
It includes all positive instances and obviously ignores any negative instance.
List-Then-Eliminate Algorithm
The principle idea of this learning algorithm is to initialize the version space to contain all
and then eliminate any hypothesis that is found inconsistent with any hypotheses
the algorithm starts with a version space to contain all training instances. Initially
The hypotheses that are inconsistent with the hypotheses scanning each training instane
training
outputs the list of remaining hypotheses that are all instance are eliminated. Finally, the algorithm
consistent.
Algorithm 3.2: List-Then-Eliminate
Input: Version Space - a list of all
Output: Set of consistent hypotheses hypotheses
1. Initialize the
version space with alist of
2. For each training instance, hypotheses,
remove from version space any
hypothesis that is inconsistent.
This algorithm works fine
deploy this algorithm. Hence,, aif the
algorithm. variationhypotof this
hesisideaspaceis is finite but diticult '
Using the Candidate Elimination algorithm, we can compute the version space containing
all(and only those) hypotheses from H that are consistent with the given observed sequence of
training instances. The algorithm defines two boundariescalled 'general boundary which is a set of
all hypotheses that are the most general and 'specific boundary' which is a set of all hypotheses that
are the most specific. Thus, the algorithm limits the version space to contain only those hypotheses
that are most general and most specific. Thus, it provides a compact representation of List-thern
algorithm.
Example 3.4: Consider the same set of instances from the training dataset shown in Table 3.3
and generate version space as consistent hypothesis.
Solution:
Step 1: Initialize 'G' boundary to the maximally general
G=<? ? ?
hypotheses,
?>
Step 2: Initialize 'S boundary to the
each attribute, we initially fill '¢ in themaximally specific hypothesis. There are 6 attributes, so or
S= <0
hypothesis 'S.
Generalize the initial hypothesis for the
so generalize the most specific hypothesis 'S' to first positive instance. I1 is a positive instane
11: >9 Yes
Excellent Good
include this positive instance. Hence,
S1 =<9 Yes Fast Yes
G1 = <?
Excellent Good Fast
Positive instance
Yes>
?
?
Step 3: ?>
Iteration 1
Scan the next instance 12.
instance /2. For each of Since I2 is a
the positive instance,
instance. The third attribute non-mat
value isching attribute value in 'S1'.
generalize positive
'S1' to include Positivw
12:
S2 =<9
29 Yes
Yes
Good
?
miGoodsmatchingFastin 'S1' with I2, so include tni
Duta'?' to
put a '?'.
Prune G1 to
exclude all Good Fast Yes Positive instance
consistent
G2 = <?
withthis positive
?
iinstnconsiste nt hy Yes>
ance, there is nopotheses with the Since GI `
?
change. The resulpositive
?>
ting G2 instance.
is,
Basics of Learning Theory 93
Iteration 2
Now Scan 3,
No Good Good Fast No Negative instance
Since it is anegative instance, specialize G2 to exclude the negative example but stay
consistent with S. Generate hvpothesis for each of the non-matching attribute value in S2
and fill with the attribute value of S2. In those generated hypotheses, for all
matching attribute
values, put a ?. The first, second and 6th attribute values do not match, hence '3 hypotheses
are generated in G3.
There is no inconsistent hypothesis in S2 with the negative instance, hence S3
same.
remains the
G3= <9 ?
<? Yes ? ?
<? ? Yes>
S3 =<>9 Yes Good Fast Yes>
Iteration 3
Now Scan 14. Since it is a positive instance, check for mismatch in the
The 5th and 6h attribute value are mismatching, so add ? to those hypothesis 'S3 with 14.
attributes in 'S4.
14: >9 Yes Good Good Slow No Positive
instance
$4 =<9 Yes ? Good ? ?>
Prune G3 to exclude all inconsistent hypotheses with the positive
instance 14.
G3 =<9 ? ? ? ?>
<? Yes ? ? ?>
<? ? ? ? Yes> Inconsistent
Since the third hypothesis in G3 is inconsistent with this positive
one. The resulting G4 is, instance, remove the third
G4 = <9 ? ? ?>
<? Yes ? ? ?>
Using the two boundary sets, S4 and G4, the version space is
consistent hypotheses. converged to contain the set of
The final version space is,
<9 Yes ? ? ?>
<9 ? Good ? ?>
<? Yes ? Good ?
Thus, the algorithm finds the version space to contain
general and most specific. only those hypotheses that are most
The diagrammatic representation of deriving the
version space is shown in Figure 3.2.
94 " Machine Learning
Version space
Yes ? ? >9 ? Good ? Yes Good
G3: 29 ? ? ? ? ? Yes ? ?
? ? Yes
G2:? ? ? ?
G1:? ?
G:?
?
?
Figure 3.2: Deriving the
Version Space
3.5 INDUCTION BIASES
Induction is a process of
model. Inductive bias is learning a target
the set of prior function or
the training data, in
similar to prior knowledge used
order to
perform assumpt
for induction. iItons
general
consi
is also d izibyng atraining dataalgorithm bevo
ered
into agenerd