ADVANCED MACHINE LEARNING AND DEEP LEARNING
Syllabus
Introduction to Machine Learning: Introduction, Training, Rote Learning,
Learning Concepts, General-to-Specific, Ordering, Version Spaces, Candidate
Elimination, Inductive Bias, Decision-Tree Induction, The Problem of Overfitting, The
Nearest Neighbor Algorithm, Learning Neural Networks, Supervised Learning,
Unsupervised Learning, Reinforcement Learning.
Books
1. Artificial Intelligence Illuminated - Ben Coppin
2. Deep Learning - Ian Goodfellow, Yoshua Bengio, Aaron Courville
3. Fundamentals of Deep Learning – Nikhil Budama
4. Neural Networks and Deep Learning – Charu Aggarwal
5. Hands-on Deep Learning Algorithms with Python – Sudharsan Ravichandran
1
Introduction to Machine Learning
Modulue-1
Dr. Veerabhadrappa S T
Associate Professor
Department of Electronics & Communication,
JSS Academy of Technical Education, Bengaluru
veerabhadrappast@[Link]
Introduction
Learning and intelligence are intimately related to each
other. It is usually agreed that a system capable of
learning deserves to be called intelligent; and
conversely, a system being considered as intelligent is,
among other things, usually expected to be able to
learn. Learning always has to do with the self
improvement of future behaviour based on past
experience.
—Sandip Sen and Gerhard Weiss, Learning in
Multiagent Systems
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
3
What is Machine Learning?
•Definition: Machine learning is a branch of artificial intelligence (AI)
that allows computers to learn from data and make decisions without
being explicitly programmed.
•Key Idea: Machines improve their performance as they are exposed
to more data over time.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
4
ML Model Training Overview
•Training Process: The process of teaching a machine learning algorithm to recognize
patterns in data.
• Data Collection: Gathering relevant data.
• Model Selection: Choosing an appropriate algorithm (e.g., decision tree, neural
network).
• Training: Feeding data into the model and adjusting parameters to learn from the
data.
• Evaluation: Testing model accuracy on unseen data.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
5
Training
Learning to classify unseen data clearly assumes that there is some
relationship between the data and the classifications—in other words,
some function
f can be generated such that if a piece of data x belongs in
classification y, then f(x) = y
Most real-world problems are not so simple, and producing a function
that
approximates the correct mapping is one of the main challenges of
machine learning.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
6
Training
For example, let us consider a system that is to learn how to evaluate
static
chess positions.
First, we will consider a number of variables:
x 1: Number of white pieces on the board
x2 : Number of black pieces on the board
x3 : Number of black pieces threatened by white pieces
x4: Number of white pieces threatened by black pieces
x5: Can white checkmate on the next go?
X6 : Can black checkmate on the next go?
X7 : Number of different moves white can make
x8 : Number of different moves black can make
f(x1, x2, x3, x4, x5, x6, x7, x8) =w1x1 + w2x2 + w3x3 + w4x4 + w5x5 +
w6x6 + w7x7 + w8x8
An item of training data might be
f(10, 2, 1, 0, true,S false,
Veerabhadrappa 10,of Electronics
T, Department 1) = 100 & Communication, JSS Academy of Technical Education, Bengaluru
7
Rote Learning in AI in AI
•Rote learning in AI refers to a form of memorization where a model simply memorizes
information without truly understanding its context.
•Rote Learning is a learning process where the AI system stores and reproduces data or
patterns without the ability to generalize and apply knowledge effectively.
•
Features of Rote Learning in AI
•Rote learning in AI learns by memorizing specific examples. For example, Rote
learning might memorize a large dataset of text or images without understanding the
context.
•Limited Generalization: Rote learning systems struggle to generalize their
knowledge to new, unseen situations. They may perform well on tasks similar to what
they've memorized but poorly on tasks outside their narrow scope.
•Lack of Adaptability: These systems typically do not adapt well to changes in data or
environment because they lack the ability to reason or adapt their knowledge.
•Limited Problem-Solving: Rote learning systems are not effective at problem-solving
or making decisions based on the information they've memorized.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
8
Learning Concepts
Learning by Taking Advice: Expert Guidance
•Learning by taking advice involves seeking guidance from experts or mentors.
•This approach enables learners to tap into the experiences and insights of those who
have already mastered the subject matter.
•This method is particularly beneficial in fields that require practical skills and
knowledge, as it bridges the gap between theory and application.
Learning in Problem-Solving: Navigating Challenges
•Learning through problem-solving is a dynamic approach that encourages critical
thinking and creativity.
•This method places learners in scenarios where they must analyze, strategize, and
find solutions.
•Problem-solving learning enhances analytical skills, adaptability, and the ability to
approach complex issues with confidence.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
9
Learning from Examples
•Induction is a learning approach where general principles are inferred from specific
examples.
•It's about recognizing patterns and making educated predictions based on observed
instances.
•Inductive learning is foundational in subjects like mathematics and science, where
hypotheses are formed through observation and experimentation.
Explanation-based Learning: Understanding the "Why“
• Explanation-based learning involves comprehending the underlying principles
behind concepts.
• Instead of memorizing isolated facts, learners grasp the rationale and mechanisms.
• This method promotes a deep understanding of subjects and fosters the ability to
apply learned knowledge in various contexts.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
10
Inductive Learning / Concept Learning
Inductive learning: system tries to induce a “general rule” from a set of observed
instances.
Task:
• Learn (to imitate) a function f: X Y
• Training Examples:
• Learning algorithm is given the correct value of the function for particular inputs
training examples
• An example is a pair (x, f(x)), where x is the input and f(x) is the output of the
function applied to x.
• Goal:
• Learn a function h: X Y that approximates
f: X Y as well as possible.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
11
Basic Requirements to Apply Inductive Learning
Algorithm
[Link] the examples in the form of a table 'T' where each row
corresponds to an example and each column contains an
attribute value.
[Link] a set of m training examples, each example
composed of k attributes and a class attribute with n possible
decisions.
[Link] a rule set, R, having the initial value false.
[Link], all rows in the table are unmarked.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
12
Necessary Steps for Implementation
•Step 1: divide the table 'T' containing m examples into n sub-tables (t1, t2,.....tn). One table for each possible value of the class
attribute. (repeat steps 2-8 for each sub-table)
•Step 2: Initialize the attribute combination count ' j ' = 1.
•Step 3: For the sub-table on which work is going on, divide the attribute list into distinct combinations, each combination with 'j '
distinct attributes.
•Step 4: For each combination of attributes, count the number of occurrences of attribute values that appear under the same
combination of attributes in unmarked rows of the sub-table under consideration, and at the same time, not appears under the same
combination of attributes of other sub-tables. Call the first combination with the maximum number of occurrences the max-
combination ' MAX'.
•Step 5: If 'MAX' == null, increase ' j ' by 1 and go to Step 3.
•Step 6: Mark all rows of the sub-table where working, in which the values of 'MAX' appear, as classified.
•Step 7: Add a rule (IF attribute = "XYZ" --> THEN decision is YES/ NO) to R whose left-hand side will have attribute names of the 'MAX'
with their values separated by AND, and its right-hand side contains the decision attribute value associated with the sub-table.
•Step 8: If all rows are marked as classified, then move on to process another sub-table and go to Step 2. Else, go to Step 4. If no sub-
tables are available, exit with the set of rules obtained till then.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
13
Inductive Learning Example
Food Chat Fast Price Bar BigTip
(3) (2) (2) (3) (2)
great yes yes normal no yes
great no yes normal no yes
mediocre yes no high no no
great yes yes normal yes yes
• Instance Space X: Set of all possible objects described by attributes (often
called features).
• Target Function f: Mapping from Attributes to Target Feature
(often called label) (f is unknown)
• Hypothesis Space H: Set of all classification rules hi we allow.
• Training Data D: Set of instances labeled with Target Feature
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
14
Inductive Learning Algorithm
• Task:
• Given: collection of examples
• Return: a function h (hypothesis) that approximates f
• Inductive Learning Hypothesis:
Any hypothesis found to approximate the target function well over a sufficiently
large set of training examples will also approximate the target function well over
any other unobserved examples.
• Assumptions of Inductive Learning:
• The training sample represents the population
• The input features permit discrimination
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
15
Inductive Learning Setting
h: X Y
• Task: New examples
• Learner (or inducer) induces a general rule h from a set of observed examples that classifies new examples accurately.
An algorithm that takes as input specific instances and produces a model that generalizes beyond these instances.
• Classifier - A mapping from unlabeled instances to (discrete) classes.
• Classifiers have a form (e.g., decision tree) plus an interpretation procedure (including how to handle unknowns, etc.)
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
16
Inductive learning method
• Fitting a function of a single variable to some data points
• Examples are (x, f(x) pairs;
• Hypothesis space H – set of hypotheses we will consider for
function f, in this case polynomials of degree at most k
• Construct/adjust h to agree with f on training set
• (h is consistent if it agrees with f on all examples)
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
18
Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
19
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
20
Version Spaces
A version space is a
collection of hypotheses that
are consistent with a set of
training examples.
Version space learning is a
machine learning technique
that uses version spaces to
learn binary classification.
A version space is a hierarchial
representation of knowledge that
enables you to keep track of all the
useful information supplied by a
sequence of learning examples
without remembering any of the
examples.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
21
Candidate Elimination
The candidate elimination method stores not just a single hypothesis, but two sets of hypotheses.
Version Space Method Learning Algorithm: Candidate-Elimination
The version space method handles positive and negative examples symmetrically.
Given:
•A representation language.
•A set of positive and negative examples expressed in that language.
Compute: a concept description that is consistent with all the positive examples and none of the negative
examples.
Method:
•Initialize G, the set of maximally general hypotheses, to contain one element: the null description (all
features are variables).
•Initialize S, the set of maximally specific hypotheses, to contain one element: the first positive example.
•Accept a new training example.
• If the example is positive:
• Generalize all the specific models to match the positive example, but ensure the following:
• The new specific models involve minimal changes.
• Each new specific model is a specialization of some general model.
• No new specific model is a generalization of some other specific model.
• Prune away all the
Veerabhadrappa general models
S T, Department that &fail
of Electronics to match the
Communication, JSSpositive example.
Academy of Technical Education, Bengaluru
22
Candidate Elimination
• If the example is negative:
• Specialize all general models to prevent match with the negative example, but ensure the
following:
• The new general models involve minimal changes.
• Each new general model is a generalization of some specific model.
• No new general model is a specialization of some other general model.
• Prune away all the specific models that match the negative example.
• If S and G are both singleton sets, then:
• if they are identical, output their value and halt.
• if they are different, the training cases were inconsistent. Output this result and halt.
• else continue accepting new training examples.
The algorithm stops when:
[Link] runs out of data.
[Link] number of hypotheses remaining is:
1. 0 - no consistent description for the data in the language.
2. 1 - answer (version space converges).
3. 2+ - all descriptions in the language are implicitly included.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
23
Candidate Elimination
The method operates as follows: Two sets are maintained of hypotheses, hs and hg: hs
is initialized as {<∅, ∅, ∅, ∅, ∅, ∅, ∅, ∅>} and hg is initialized as {<?, ?, ?, ?, ?, ?, ?, ?>}.
When a positive training example is encountered, it is compared with the
hypotheses contained in hg.
If any of these hypotheses does not match thetraining example, it is removed from hg.
The positive training data are then compared with the hypotheses contained in hs.
If one of these hypotheses does not match the training data, it is replaced by the set of
slightly more general hypotheses that are consistent with the data, and such that
there is at least one hypothesis in hg that is more general.
This method is applied in reverse for negative training data. By applying
this method to each item of training data, the sets hg and hs move closer to
each other and eventually between them contain the full version space of
hypotheses that match all the training data.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
24
Inductive Bias
All learning methods have an inductive bias. Inductive bias refers to the restrictions that are imposed
by the assumptions made in the learning method.
For example, in the above discussions we have been assuming that the solution to the problem of
road safety can be expressed as a conjunction of a set of eight concepts.
This does not allow for more complex expressions that cannot be expressed as a conjunction.
This inductive bias means that there are some potential solutions that we cannot explore, and which
are, therefore, not contained within the version space. we examine. This may seem like an
unfortunate limitation, but in fact inductive bias is essential for learning. In order to have an
unbiased learner, the version space would have to contain every possible hypothesis that could
possibly be expressed.
This would impose a severe limitation: the solution that the learner produced could never be any
more general than the complete set of training data.
In other words, it would be able to classify data that it had previously seen (as the rote learner could)
but would be unable to generalize in order to classify new, unseen data.
The inductive bias of the candidate elimination algorithm is that it is only able to classify a new piece
of data if all the hypotheses contained within its version space give the data the same classification.
Hence, the inductive bias does impose a limitation on the learning method.
In the 14th century, William of Occam proposed his famous “Occam’s razor,” which simply states
that it is best to choose the simplest hypothesis to explain any phenomenon.
We can consider this to be a form of inductive bias, which states that the best hypothesis to fit a set
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
of training data is the
25
simplest hypothesis. e will see later how this inductive bias can be useful in learning decision trees.
Decision-Tree Induction
A variation of this kind of tree, where the leaf nodes are all Boolean values is called a decision
tree. A decision tree takes in a set of attribute values and outputs a Boolean decision.
A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node holds a class label.
The best-known decision tree induction algorithm is ID3, which was developed by Quinlan in the
1980s.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
26
Decision-Tree Induction:
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
27
Decision-Tree Induction: Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
28
Decision Tree Induction Algorithm
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
29
Decision-Tree Induction: Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
30
Decision-Tree Induction: Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
31
Decision-Tree Induction: Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
32
Decision-Tree Induction: Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
33
Decision-Tree Induction: Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
34
Decision-Tree Induction: Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
35
Decision-Tree Induction: Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
36
Decision-Tree Induction: Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
37
Decision-Tree Induction: Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
38
Decision-Tree Induction: Example
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
39
Design Issues of Decision Tree Induction
How should training examples be split?
– Method for specifying test condition (Splitting Criterion)
• depending on attribute types
– Measure for evaluating the goodness of a test condition (Selecting Best Attribute)
• information gain, gain ratio, Gini index, misclassification error, statistical test, …
• How should the splitting procedure stop?
– Stop splitting if all the examples belong to the same class
– Early termination depending on the results of a statistical test.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
40
Methods for Expressing Test Conditions
Splitting Criterion
• Depends on attribute types
– Binary
– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split
– 2-way split (Binary split)
– Multi-way split
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
41
The Nearest Neighbor Algorithm
The nearest neighbor method is a widely used local approximation method
that uses the historical values close to the current state vector to estimate
the trajectory of the current state by using the trajectories of the
neighboring points.
(1)The inclusion of irrelevant variables lowers the classification accuracy,
(2)The algorithm works primarily on numerical variables,
(3)categorical variables can be handled but must be specially treated by
the algorithm, and
(4) classification accuracy will be degraded if the scales of variables are
not in proportion to their importance.
Distance=|x1−x2|+|y1−y2|.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
42
Design Issues of Decision Tree Induction
Advantages of Nearest-Neighbor Methods
They are easy to implement.
They can also give quite good results if the features are chosen carefully (and if
they are weighted carefully in the computation of the distance). There are several
serious disadvantages of the nearest-neighbor methods.
Disadvantages of Nearest-Neighbor Methods
The most serious shortcoming of nearest-neighbor methods is that they are very
sensitive to the presence of irrelevant parameters. Adding a single parameter
that has a random value for all objects (so that it does not separate the classes)
can cause these methods to fail miserably.
Similar to a neural network, nearest-neighbor methods do not simplify the
distribution of objects in parameter space to a comprehensible set of features.
Instead, the training set is retained in its entirety as a description of the object
distribution. There are some thinning methods that can be used on the training
set, but the result still may not provide a compact description of the object
distribution.
TheseVeerabhadrappa
methods are also rather
S T, Department slow if&the
of Electronics trainingJSS
Communication, set has many
Academy examples.
of Technical Education, Bengaluru
43
Design Issues of Decision Tree Induction
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
44
Design Issues of Decision Tree Induction
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
45
Design Issues of Decision Tree Induction
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
46
Design Issues of Decision Tree Induction
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
47
Design Issues of Decision Tree Induction
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
48
Design Issues of Decision Tree Induction
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
49
Information Gain
The method used by ID3 to determine which features to use at each stage of the
decision tree is to select, at each stage, the feature that provides the greatest
information gain.
Information gain is defined as the reduction in entropy. The entropy of a set of training
data, S, is defined as H(S) = -p1 log2 p1 - p0 log2 p0
where p1 is defined as the proportion of the training data that includes
positive examples, and p0 is defined as the proportion that includes negative examples.
The entropy of S is zero when all the examples are positive, or when all the examples
are negative.
The entropy reaches its maximum value of 1 when exactly half of the examples are
positive and half are negative.
Veerabhadrappa S T, Department of Electronics & Communication, JSS Academy of Technical Education, Bengaluru
50