0% found this document useful (0 votes)
17 views48 pages

Understanding Deep Learning Basics

The document provides an overview of deep learning and its relationship with artificial intelligence and machine learning, explaining various paradigms such as supervised, unsupervised, and reinforcement learning. It details the structure and function of artificial neural networks, including simple and multi-layer perceptrons, as well as the process of training these models using loss functions and gradient descent. Additionally, it discusses the challenges of modeling complex problems like XOR, highlighting the need for multilayer perceptrons to achieve accurate predictions.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views48 pages

Understanding Deep Learning Basics

The document provides an overview of deep learning and its relationship with artificial intelligence and machine learning, explaining various paradigms such as supervised, unsupervised, and reinforcement learning. It details the structure and function of artificial neural networks, including simple and multi-layer perceptrons, as well as the process of training these models using loss functions and gradient descent. Additionally, it discusses the challenges of modeling complex problems like XOR, highlighting the need for multilayer perceptrons to achieve accurate predictions.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DEEP LEARNING

Artificial Intelligence is human behavior performed by machines. It is a discipline


science that deals with creating computer programs that perform operations comparable to the
what the human mind does, such as learning or logical reasoning.

Machine Learning is a strategy to achieve artificial intelligence. It is the study and the
construction of algorithms that can learn from and make predictions about data. If there is data,
it is machine learning. In practice, machine learning can solve problems of
artificial intelligence 'narrow' (Narros AI). That is, they are models that can solve very well a
problem but not another. Nowadays, we have models that are very good at solving a single
task: if we run them out of there, he doesn't know what to do, and the model is not good. Segmentation and
image classification, voice recognition, prediction problems through a set
of tabular data, these are all problems that can be solved very well with ML. In learning
mechanical (or automatic) data is available, which is used to train a model to make predictions from
unknown data.
We have different paradigms within machine learning: Supervised Learning where
there is a dataset with labels that supervise the learning process and a
prediction of new data with the trained model. The algorithms of Learning No
Supervised learning tries to learn the structure of data. And the algorithms of Semi Learning.
Supervised learning uses both labeled and unlabeled data during the process of
learning. Through the use of unlabeled data, representations can be learned from
the data and use this to make more accurate predictions (it is better known how they are distributed)
data on which predictions are made). Reinforcement Learning is learning from
the interaction with the environment. That is to say, reinforcement is given to the model when it performs the task well.
There is an agent that interacts with the environment by performing actions, and if it performs the action well, it is given
a reward. It is unsupervised because there are no labels.

There are two types of machine learning models: discriminative models and models
generative. Discriminative Models are models that end up learning a kind of
conditional probability given the data, P(Y|X). What is the probability that given a data point, it is from
determined class. A predictive-discriminative model is learned, where the aim is to discriminate and
separate the data into, for example, 2 sets. On the other hand, Generative Models seek
generate data instead of classifying; create things that make sense. In this case, one learns the
distribution of the data, that is P(X).
We will be interested in solving supervised discriminative problems (e.g. classification of
images

Image Classification
It predicts the probability that there is a cat in a given image, for example.
For Python, images are matrices of numbers (intensities). In general, they are values that range
between 0 and 255 or can go between 0 and 1. If working with a color image, there are 3 RGB, 3
"slices of bread" one after the other. That is to say, 3 matrices of the same size where each one
represents a different color channel.

Previously, features were extracted from images manually through code routines and
Once the relevant characteristics for the problem to be solved were established, it was possible to
to say, through a classifier, whether what is in the image is a dachshund or not.
EXAMPLE: we have the image of a bicycle, other representations are generated through
routines that are used to perform feature extraction (e.g. average color of the image,
standard deviation in the colors of the image, divide the image into quadrants and see the average color
from each quadrant). The most relevant characteristics for the problem are selected, they are inputted.
these characteristics in a classifier and the prediction is made.
The problem was that the features were extracted manually, which made it difficult.
solve a problem in computer vision.
Deep Learning is a technique for implementing machine learning. Models based on
deep learning are capable of learning representations of the training data in multiple
levels of abstraction (layers), composing simple modules that successively transform
such representations in others with a higher level of abstraction. Modules that are very combined.
simple but combined, one behind the other in different layers (levels of abstraction),
They begin to specialize and learn characteristics that are useful for solving the problem.
in question. The first layers specialize in low-level features (e.g., in images they can be
the edges, the corners, etc.) and the subsequent ones will start to combine those layers to build
more complex concepts and finally we are going to have very complex concepts that are based on the
simpler. This is an 'end to end' trained model where all this is trained together. It is no longer
they have decoupled stages but rather it is learned together. The fundamental component is the neurons
artificial

Artificial Neural Networks


They are the inner product between an input data vector or features (Xi) and a weight vector.
The weights will be learned in the learning process during the training of the
model. This sum is passed through a non-linear activation function. This function takes features of
input and has weights that parameterize it.

In a neuron, there are inputs, Xi. There are weights, Wi.


that are learned. The product of each combination is added and
this sum is passed through an activation function. This function
it is a nonlinear function.
Simple Perceptron
A neuron is called a simple perceptron. That is, a simple perceptron is the simplest neural network.
A simple example of a perceptron is predicting whether it will rain or not based on data.
atmospheric. The activation function is non-linear because it only takes values 1 and -1 (it rains or it doesn't)
It rains). A neuron has data input and a prediction as output. These are not models.
deep

Multi-Layer Perceptron
A neuron can model a certain amount of problems and if we want to model
more complex problems we need to combine the neurons. By combining them we build
multilayer perceptrons, where many neurons are placed in parallel to take those inputs,
they process and return an output and this output is put back into a new layer that
process what came out of the previous layer. These are deep models (because they have several layers).
The typical deep model, or at least the most basic one, has an input layer, a hidden layer.
and output layer.
How do we learn the weights of the neural network?
With a loss function. For example, in a regression problem, the mean squared error.
It could be the loss function. The loss function compares the output of the network with the ground.
truth or labels. The loss function can take many forms.
We are going to learn w with an optimization algorithm called gradient descent.

Gradient Descent
With gradient descent, the loss error is graphed. That is, the computation of the function of
cost in the training data. The loss function is computed for the data of
training and for a certain value of w1 and w2 (if we have more features, they will also be
more weights), we have a loss error.

On the axes of this graph we have the values of the


different weights (w1 and w2). This is a model of 2
features, if we had more features the graph would be
more dimensions.

We want to reach the minimum surface of error.


(point B) since this is the set of parameters that
minimize the loss function on the data of
training.

To move from point A to point B we use the derivative (a mathematical function that tells us how)
direction walk so that the function changes as much as possible). When we extend the derivative to
in functions of multiple variables we have a gradient, which is the vector of partial derivatives. In
in this case, they are derivatives of the loss function with respect to each of the parameters that
represent my neuron (the ws). The gradient vector points in the direction in which the function
it grows quickly, but as we look for the direction in which the function decreases quickly, we take the
negative of the gradient descent.
How to calculate the gradient
Analytical derivation: derive by hand and write the code. An expression is written that states the
form that the derivative takes from the function. It's easy to do for a neuron, but
not for more than one.
Numerical differentiation: finite differences. The value of the function is computed at two points and is
Make the line. These methods have many stability problems and are expensive.
Symbolic derivation: it is carried out using the rules studied in Mathematical Analysis but
automated (e.g., Maple, Mathematica). They return an analytical expression that can be
very ugly.
Automatic derivation: derivatives are defined for 'primitive' operations such as
example, a sum or a product (mathematics and control). A graph is constructed of
operations and is derived following the chain rule. There are frameworks that
they implement these automatic differentiation algorithms (TensorFlow, Pytorch).

Artificial Neuron: Simple Perceptron


In general, we are going to ask ourselves if the value that comes out of the neuron is greater than a certain threshold. This
This term is called bias. But an easier way to express the perceptron is by moving this bias (b) to
the sum of products. b is a parameter that we are also going to learn, so we can
see it as a weight of the network. It is put inside the network as one more weight within the vector of
pesos y se lo fija con una entrada en -1 para que se comporte como resta. Esto se llama bias trick.

What is a neuron? → The most basic operation that we can represent in a neuron is the
inner product between the input and the weights.
Activation Functions
●Sign function: we pass z (the result of the summation) through the sign function. If z<0, the function
value -1 and if z > 0, the function is worth 1.

●Linear sign function: returns -1 up to a certain value of a.

Sigmoid function: sends anything from the range of real numbers to the range (0,1). It could be useful.
to represent, for example, a probability (e.g. for binary classifier). The regression
Logistics, for example, is a simple perceptron with a sigmoid activation function.

Decision Boundary (for simple perceptron)


We want to solve a problem where we have 2 input features (x1 and x2). The boundary of
the decision of a classifier marks the boundary where the space stops corresponding to a class for
start to correspond to another. Separate the space between the different classes. To calculate it and have
the equation of the line, we set to 0 the sum of the perceptron and solve for one variable
function of the other. In order to draw the equation of the line we need to define w1 and w2. As we progress
As we vary w1 and w2, we will obtain different classifiers, that is, different
decision boundaries. The key is in how to find the values of w1 and w2.
Modeling the operation X1 OR X2
In logical operations we have boolean variables (they can only take values 0 and 1) and this is
You know it as the truth table of that operation:
0 (false) 0 (false) 0

1 (true) 0 (false) 1

0 1 1

1 1 1

The dataset has 4 elements, that is, 4 possible truth values for the problem. The idea is
draw a line that acts as a decision boundary leaving on one side the combinations that give
1, on the other side, the combinations that give -1. Only the one that is false/false gives negative (false).
The true and false classes remain separated.

At the decision boundary, we add the bias term (w0) to shift the line away from the origin.

How to train my perceptron?


So far we have entered the weights manually, but we are going to learn to input this data with a dataset.
training consisting of pairs of data (x) and desired labels (d). For this, it is necessary to
define a loss function.

Loss function (or cost function)


In the context of machine learning, it is a function that measures the quality of a prediction.
based on the agreement between that prediction and the label provided in the dataset
training. Measure how good my model's prediction is.
- Returns a high value if our prediction is incorrect, and low if the prediction is correct.
- It serves as a guide in the process of searching for the parameters of our model to be trained.
- We will call the loss function L. L takes data and the parameters of the neuron.
- The simplest loss function is that of the simple perceptron with linear output (without function of
activation). And it is calculated as the squared error of the distance between the label and the
product of weights and features.

Learning as an optimization problem


Vamos a ver los problemas de aprendizaje como si fueran un problema de optimización. En un
First, we sum the value of the loss function, computed for that data.
training, with the current state of the parameters. The aim is to minimize the average loss for
all the data from the dataset, selected from all possible ws. There are infinite ws for a
perceptron, that is, the search space is infinite. This is why these are difficult problems. The
argument found that minimizes the average loss, is called That is, they are the parameters
optimal that solve the problem for the training data. The w will be the model, it is
to say, what is kept in memory. This is called the principle of minimizing empirical risk:
it is about finding the best model using the few data available, under the
hypothesis that the data to come are data that behave similarly to the data from
training.

Strategies to find the optimal parameters


Random search
It's the simplest way. Parameters are randomly thrown. For the algorithm, we initialize a
variable "the best value of the loss function" and we iterate a certain number of
iterations. In each iteration, we randomly sample the weight w. We evaluate the value of the
loss function for the perceptron with that w and we wonder if the value of the function of
loss is smaller than the best value we have so far or not. In case it is, we
we save that value of the loss function and we keep those weights as the best that
we have so far. At each step we discard the previous value, we only keep the
better but we don’t use the rest of the information. The quality of the solution will depend on the
number of times we iterate. It is not the best algorithm since we have no guarantee of finding
nothing.
Gradient descent method
We update the weights in the direction that ensures a reduction in the loss function. This
the direction is given by the gradient.

w is initialized randomly, then steps are taken in the direction of the gradient and
is re-computing the gradient at each step and updating the function. This method only
it guarantees convergence to local minima. If the function is convex, the local minimum is also the
global minimum. The squared error used as the loss function of the simple perceptron in the
Linear case results in a convex function. That is, if we are training a model to optimize
this perceptron, we are guaranteed to find the global minimum. We never have this
guarantee.

Algorithm: the weights are initialized randomly and given the dataset, we update w (value of the
pesos) as the value that w has at that moment minus the gradient multiplied by the size
This will scale the size of that gradient so that there are no very large jumps.
big.
EXAMPLE: gradient calculation in the linear case of a perceptron

Modeling the XOR


XOR is the operation of the exclusive OR. That is, it is another logical operation where for the output to be
Only one of the two has to be true (when both are true, it is false).
Then change the map of Cartesian points.

We cannot establish a decision boundary that solves the XOR problem with a perceptron.
simple linear. Not even with a simple perceptron that has a non-linear activation function, it does not
It is possible. A multilayer perceptron (MLP) is needed then. That is,
needs the output of the neuron (which is going to be a line like the one in the second graph above)
now enter another neuron that will combine the output of several neurons from the previous layer. In a
multi-layer perceptron many neurons are placed in parallel to be able to take the outputs from them
neurons combined in a prediction. Lines start to be combined through the
concatenation of neurons. When certain characteristics are present, the MLP is known as
universal approximator since this perceptron can approximate any function.
A multilayer perceptron (MLP) is a structure that combines several neurons into a single model.
We started talking about neural network architecture. The architecture is given by: how many
capas tiene el perceptrón, cuántas neuronas tiene cada capa, que función de activación se usa en
each layer, whether the layer has bias or not. In a supervised learning paradigm like the one in which
We are going to train the model using labels. When training the model begins, it
they have to compare each of the model's outputs with the label we want it to
correspond to that data. Everything will be guided by the loss function.
In the MLP, we no longer have a single weight vector (before we had a weight vector w for each
neuron) but rather it has a matrix where there is a weight vector for each neuron in the layer and
each neuron will have the same number of weights as input features. Following the example
From the following drawing, in the first layer we will have 5 weight vectors arranged in a
matrix where each vector has 4 weights (because there are 4 input features).
In these feedforward networks where only forward feeding occurs, the output of the previous layer is what
what enters into the next one (each arrow is an x). Therefore, each neuron of the second hidden layer will
to have 5 pesos because 5 features are entering the neurons of the next layer. To the last one.
we call it output layer.
All the neurons in a layer have the same activation function, but 2 layers can have
different activation function.
Having many parameters can lead to overfitting.

Interpretation of the drawing: we have 3 exits. We can think of, for example, a problem of
classification where we input a data point and we can classify it into 3 different classes. We say that each
neuron, for example, spits out the probability that my input belongs to class 1, 2, or 3. Also
we could be regressing 3 values such as age, height, and weight of a person, based on a
a bunch of parameters of a person.
Example: problem with 2 input features
For example, we may be estimating the rain (whether it will rain or not) and the temperature for tomorrow.
based on today's atmospheric pressure and humidity.
The first layer is a hidden layer that has 3 neurons, each neuron will have 2 weights: one that
multiply the first entry and another that multiplies the second. Then, the matrix will have a
column for each neuron (these are the weights). When the data arrives, we are going to perform a product.
between the pink vector and the Wh matrix. We are going to multiply a first matrix of 1x2 and a second
2x3 matrix so the output matrix will be 1x3, which represents the outputs of each neuron.
Size of a weight matrix of a layer: number of input features x number of
output neurons. The matrices are the result of doing XxW.

Universal approximation theorem


It establishes that a feed-forward network with a single hidden layer is sufficient to approximate, with
an arbitrary precision, any function with a finite number of discontinuities, always and
when the activation functions of the hidden neurons are nonlinear.
It is an existence theorem: it states that the solution exists but does not say that it is easy.
find her.

How to train my multilayer perceptron?


Simple perceptrons give us a convex error surface, however, when we train
with multilayer perceptrons we obtain non-convex error surfaces (with many minima
locales). That is to say, we have no guarantee, through gradient descent, of finding the
global minimum of the function.
Non-convex loss function in MLP
The gradient descent does not guarantee finding the global minimum in non-convex functions.
However, for large neural networks (with many parameters), most of the
local minima are similar and show similar performance on the test datasets
(as long as the test data is distributed similarly to the training data).
The probability of finding 'bad' local minima decreases with the size of the network.
Focusing too much on finding the global minimum in the training dataset does not
it is useful for practice because we can overfit.

Gradient method for MLP


Given the dataset, the loss function is calculated for all elements of the dataset, the ...
average of the loss function for all elements (it is summed and divided by the quantity of
elements of the dataset), the derivative of this is taken with respect to the ws and it is returned until satisfied
some convergence criterion (e.g., local minimum).

When we train our multilayer perceptron model, we are going to take the gradient of the
loss function with respect to the model parameters (ws).

Now the W are matrices, no longer just a single vector as it was before.
The gradient of the loss function will be derived from the Ws. Therefore, the vector
the gradient will have a component for each parameter.
There are as many matrices as there are hidden layers. Each hidden layer has a weight matrix.
different.

Variants of the gradient method


Classical gradient descent → each element of the dataset is taken and passed through
inside the perceptron and we have a prediction for each element of the dataset. Once
we compute the loss function L for each of those elements
processed in relation to the ground truth associated with that element. The average of the
loss function; this number is a scalar. We are looking for the gradient of this number. This
it's good with a certain number of elements, when we move to having many (e.g. images
The classic gradient is not scalable since in each iteration we see the dataset.
whole.
●Stochastic gradient descent → instead of summing all terms
From the loss function with all the data, only one data point is taken. One data point is taken, and it is done
the prediction, the loss function is calculated using only this data and the
gradient. This gradient indicates the direction in which we need to move so that the
the loss function decreases for that data. With classic gradient descent
we calculated an average so the gradient of the average was much more significant
(I had information about all the elements of the dataset and now, only about one element). It
sample an element of the dataset randomly and the weights are updated. To this gradient
It is called stochastic because an element from the dataset is randomly sampled.
(we take one element, compute the loss function and update; we take another,
we update and so on). It's a way to make progress much faster, but it's also more
erratic because the vectors point in the direction that benefits a single element. This
usually works quite well.
●Mini-batch gradient descent → it's an intermediate point (it's neither 10,000,000
of images not 1, there are 32 images). A batch of 32 elements is sampled, the processed ones
32 elements together, the prediction is made by summing 32 terms in the loss function,
compute the gradient of this, it is updated and now another 32 elements are sampled without
replacement. This is done until the entire dataset is consumed, once everything is consumed
dataset we say that a training epoch was completed. A training epoch
it has consumed the dataset all at once during the gradient algorithm
descendant. When we perform training by mini-batches the convergence criterion
directly, it is training for 20 epochs, that is, looking at the dataset 20 times (but
we update the w many more times because it is updated for each mini-batch). We usually
to end up with the best model in validation. All parameters (ws) are updated together.
Batches of size 3: 3 samples/data with a single feature each. The first layer has a
weight matrix of size 2 because each neuron has only one input feature. The
The green matrix has a size of 3x2 because each column corresponds to a neuron and each row
corresponds to a data point. We process the 3 data points at the same time through a product. When we have
a batch, we put it inside a matrix, we perform the product and process the entire batch together.
The matrices of B are the bias. The bias is also the weights of the network. After having done the
the product of X and Wh, we obtain Zh and add an element to each of the outputs of the
neuron. Each neuron has a single bias, which is why the bias matrix is 1x2 (we have one bias for
the neuron 1 and another bias for neuron 2). Once we made the output, we added the bias and
we get the output of the layer. If we had to apply activation functions (e.g. ReLU), the
we apply in this step.

Batches of size 4: 4 samples/data with 2 input features each. The product of the matrix
the input and the weight matrix is 4x3 where we have one column for each neuron and one row
for each observation. The size of the weight matrices of my perceptron does not depend on the
number of elements in the batch, but depends on the number of input features and the
number of neurons. That is, we can train a perceptron with a batch size of 40 and in test
enter only 1 piece of data.
How to calculate the gradient: Automatic Differentiation / Backpropagation
We express the mathematical operation as if it were a graph of operations. If we know how
process each of the operations independently, then we can compose the
gradient through the chain rule and obtain the final gradient estimate. The algorithm
Backpropagation is an algorithm for obtaining gradients; it gives us the gradient of a function.
in relation to certain parameters. The gradient descent algorithm is the algorithm of
optimization, which uses backpropagation when calculating the gradient.

The derivative of the function at a given point is the slope of the tangent line to the curve at that point.
point and it indicates how the function varies at that point. In this case, we see if the function grows or
decrease. When h is very small we can approximate the function using the derivative. The derivative
then indicates the 'sensitivity' of the function to the change in the variable. The value of the derivative me
indica cómo cambia la función, permite entender lo que le pasa a una función localmente. La
The derivative is the rate of change.

We need to be able to derive in order to make the gradient descent algorithm, in some
Cases like in the maximum function there are points where the derivative is not defined, so we take
a subgradient. The maximum function is not differentiable, that is, we cannot derive it everywhere.
its spectrum but a generalization of the gradient is usually used for non-differentiable functions that
it is called sub-gradient.
The derivative of f with respect to X is 1 when X is greater than or equal to Y, and 0 otherwise. When we
we move in X, if X is greater than Y the function will increase, but if Y is greater than X the function does not
change.

We will use the idea of the chain rule to easily compute derivatives in networks.
Neural because we can think of each operation as if it were a graph. The chain rule
optimizes the gradient calculation process because it reuses already calculated operations (e.g. the
derivative of f with respect to q:: first we use it for the derivative of f with respect to x, and then
for the derivative of f with respect to y). When structuring the operations as a graph, once that
we have a derivative for each operation, we will be able to reuse them and have a much more
optimized gradient calculation. Backpropagation is a flow of chained gradients in a
circuit of arithmetic operations.
Gradient flow:

EXAMPLE 1

Graph that represents the operation:

To apply the backpropagation algorithm, we need to know at what point we will instantiate it.
that derivative function. We need to know at what point we want to calculate the derivative. All the
The algorithm is used to calculate the derivative of the function at a special point if we want the derivative.
At another point, we have to run the algorithm again. So we define a point, we give it
values to the parameters. In this case: 2, -1, -3, -2 and -3 (colored in red). These are the values in the
we are going to instantiate the derivative. We then do the forward pass. 0.73 is the value of having
evaluated the function at the input we provided. We now perform the backward pass (which gives you
name this algorithm). The gradients of each of the operations are computed.
regardless. We start from back to front and multiply the local gradients (the
calculated in each operation of the graph) with the gradient that we are back-propagating. We are going to go
partially applying the chain rule (in green).
We take the derivative of the function with respect to the variable.
2. We evaluate the result in what is coming in (if the result is a constant, not
we have nothing to evaluate)
3. We multiply by the gradient we have been backpropagating
4. Whenever we have an addition operation, we copy the gradient to both outputs.
because the two derivatives will be the same (this is so because the derivative of (x+y) with
Regarding x, it is equal to the derivative of (x+y) with respect to y.

Finally, we obtain the value of the gradient vector of the function with respect to each of the
variables. It is a gradient vector with all numbers, we don't have mathematical expressions because
Everything is already instantiated. The backpropagation algorithm will give us the gradient of the function.
instantiated at a point. It is an automatic differentiation algorithm that allows calculating the gradient.
The backpropagation algorithm is used to efficiently calculate a gradient.

When the output of a node is shared, the backpropagated gradients are summed before
multiply them by the local gradient.

Patterns in the operation graph


Lasumadistributes the gradient equally to all its inputs.
The operation max sends the gradient only to the largest input.
The operation product multiplies the other input by the backpropagated gradient
PyTorch
PyTorch is an open-source deep learning framework developed to be flexible and modular for
the research, but with the stability and support needed for its use in production.
PyTorch implements a bunch of operations between tensors (matrices
multidimensional of more than 3 dimensions.
It allows optimization through dynamic computation graphs (dynamic because a
As it is being executed, the graph is being written. Autograd is the library that
implement backpropagation.
●Supports accelerated computing on GPU. The difference between GPU and CPU is enormous in terms
of speed.

It is essential that in order to request a gradient for any operation of things with PyTorch the
we have built as operations between PyTorch tensors. That is, the graph should not be cut
from PyTorch. All defined operations must be between PyTorch tensors. If we break
the graph because we use a function that is not PyTorch, the gradient is cut off and there is no
backpropagation and we will not be able to calculate the gradients.

Components of PyTorch
It is structured in classes. PyTorch offers us a bunch of different components that always
they are needed when implementing a Deep Learning algorithm. The classes are:
●Data: to collect data, PyTorch provides us with dataset and tensor objects. The dataset has
Let's say I lift the data and once lifted, I put it inside tensors.
●Data Loaders: once we put the data into tensors, we create a data loader.
We give it a dataset. We iterate through the data loader and it returns one batch at a time.
●Networks: PyTorch also allows us to create neural networks (we do not have to code them from scratch).
zero the product between x and w). We ask you to give us dense layers (dense because each
neuron is densely connected with all the ones from the previous layer, they are the ones that
we studied up to this moment). We then ask PyTorch to directly give us
a linear/dense layer, what we pass to it are the input features, features of
output and the activation function.
Training functions: training loops can be built where...
a forward pass with the dataset data, the cost function is computed, the.
Backpropagation, the weights of the network are updated and it iterates.
TorchScript: is used to pre-compile the assembled models so that they run in a way
much more optimized. When a model is created and it is intended to be deployed, in
Generally, the model is compiled (optimizations are made) to run faster.
Tensors in PyTorch
In PyTorch we can:
Perform operations in parallel on GPU

Distribute operations across multiple machines


●Keep track of the operational graph that gives rise to them. We can write
operation graphs from code.

Each one of these


variables (x, w, b, y) are tensors.
How do we process an image with a perceptron?
If we had to process an image with a perceptron, we need to extract features from the
image by hand and then we feed them to the perceptron. Another option we could do is to input
directly the pixels to the perceptron as if they were features.
If we think of NxN images, we generally take the image, vectorize it (it is turned into a row
next to the other) and that is what goes into the perceptron.

In an image of 1,024x1,024, for example, each neuron in the first layer will have about 1 million of
connections/weights. And if we have, for example, 1,000 neurons we are going to have one billion of
parameters only in the first layer, which is not scalable.

So processing images with a perceptron has the following disadvantages:


The original structure of the data is lost. At first, in the image, the pixels of the
the first row and those in the second row are neighboring pixels but when vectorizing the image it is lost
that notion. The natural structure of the data structure we are using is broken.
For large images, the number of neurons (and their consequent connections) increases.
exponentially
We don't have a clear notion of multi-scale/multi-resolution analysis (something that is useful
in general in image analysis). Multi-scale analysis is looking at data at multiple
scales (e.g. an image in multiple resolutions).
Neural networks inspired by the visual system
The neurons of the early visual cortex (those closest to the retina) are
are organized hierarchically, where the first ones react to simple patterns such as lines, and the
Subsequent layers respond to more complex patterns by combining the activations they receive.
in the proposed model, the neurons in the upper layers have a larger receptive field and are
less sensitive to the position from which the stimulus comes.

Assuming we are looking at a character, each of the neurons in the first layer will go to
look at a very small area of the entrance. That is to say, the value that the long neuron will be affected
only for a part of the input. That is to say, these neurons have a limited receptive field.
The receptive field is the area of input that affects the output value of the neuron.
A neuron in the last layer combines the information coming from the previous layers (in this
case, of the 3 previous layers) and each previous neuron looked at a specific part of the data therefore
the receptive field of any neuron is always greater than that of a previous neuron. That is,
upper layers have an increasingly broad receptive field and react to increasingly complex patterns
more complex.

Deep learning models perform analysis at multiple levels of abstraction and also integrate
as part of the model to the classifier. The classifier is a multilayer perceptron. The rest are neurons.
convolutional that extract features. All this forms a graph of operations and can be
derive as if it were a small and very simple perceptron. In summary: a data point enters, and features are extracted.
features, it goes through the classifier, a prediction is made, the loss function is computed and
the gradient of the loss function with respect to the model parameters is backpropagated. It
they can train the classifier and the feature extractor together, that's why we call it
end-to-end training.
Convolutional Neural Networks
A convolutional neural network is any network that uses a convolutional layer somewhere.

1D Convolution
A convolution is a mathematical operation that is performed between functions and represents the integral
of a product of a function x with a translated and reflected version of another w. We call x
input signal, we call it a filter and the result is called the convolved signal. We have
two functions, and the result is another function. In the continuum, we shift the filter and we keep extracting.
as the output value in our convoluted signal the area contained between the 2 functions (by
this reaches its maximum when the filter occupies the entire area of the input signal.

The signal is not usually a continuous function; generally, it is a sampled function since we have
in an instant of time a number. Now the signal x is a vector of numbers where each of
the positions is a value of the signal. W is also a vector. In the discrete signal, we are going to sum the
value of the product between vector w and the input signal part where we are standing
moment (e.g. the count of the first square is: (1x1)+(4x2)+(-1x0)+(0x-1) = 9).

2D Convolution
If we work with images, the signals are bi-dimensional, so instead of working with a
unique index, we work with two indices. i is the image, k is the convolutional filter. We are going to go
moving along the image with the filter and taking the inner product between the filter and the
what is below the image.
The convolutional filter is also called a Kernel. In an image, we move the kernel over the image.
The kernel is also an image. Depending on what we have drawn in the kernel is what
we are going to obtain as a convolved signal. What comes out of having executed an operation of
Convolution, in the context of neural networks, is called Feature Map. Very white values.
high values are represented, while very dark values represent low values. When making the
convolution operation in the image, the highest output values will be where there is something
in the image that looks very much like what is drawn in the kernel. This happens because we are
calculating a dot product, and the dot product is maximum when the two vectors are
aligned (they are the same). It is a way of doing template matching, that is, looking for things in
images. In a feature map, we will have transformed the image into something that represents a
specific characteristic of the image (e.g. an edge map). When we perform a convolution
we can think that each of the output pixels is the result of what it 'spits out' a
neuron. A neuron that was parameterized by the kernel parameters. That is to say, now the weights
Our neurons are going to be the kernels. The convoluted filter is precisely what we are going to
to learn, it is initialized with random values but since they are the parameters of the neuron
Let's learn. If we want to learn how to extract features, we let the parameters
initialized with a random value are free and can be learned through the process of
training (exactly the same as seen so far).
This process is part of the previous forward of your image analysis.

This is the result of convoluting the background image with a kernel. In this case, we have a
border highlight filter, that is, edges oriented as the
filter (kernel). What has higher values are the edges.

Fully connected layers (dense) vs Convolutional layers


Each line of the following drawing is a parameter. The green circles are the inputs and the blue ones are the
neurons. In a fully connected (dense) case, each neuron has as many weights as inputs
In the case of the convolutional network, each neuron will have as many parameters as elements.
there is in the filter. In the example, each neuron has 3 parameters and those 3 parameters are the same
in all the neurons. We went from having 30 parameters to only 3.
We may need to process an input image that is not only in grayscale, but
What is an RGB image (3 color channels: red, green, blue). When we think about the
representation of this image, in reality, there are 3 images (one that represents the R channel, another the G
and another the B). C is the number of input channels, H is the height of the image, and W is the width of the
image. The kernel convolves in space, that is, in the W and H dimensions, but in
depth, kernels usually have as many channels as the input channels the data has
We are about to analyze. The kernel cannot be moved in the C dimension. If we are processing
2D multichannel images, we have 3D kernels because we have an additional dimension that corresponds.
to the channels. To create multiple 'slices of bread' and have multiple feature maps in the output
we have to add kernels (we added the red one in the image). The number of output feature maps
it is equivalent to the number of kernels we have. Each feature map is a feature extractor.

The kernel connectivity is local in space and total in depth.


Additionally, they implement the weight-sharing mechanism, which is the idea that all the
output neurons of the same feature map share the same weights. This not only gives us
it allows to reduce the amount of parameters but also allows us to introduce the
invariance to translation. And it is a desirable property when we are doing
image classification.
Non-linear activation functions (e.g., ReLU) are applied to each element.
individually.
Hyperparameters of a convolutional layer
When we define a convolutional layer in PyTorch, we need to specify the following
parameters:
Kernel: kernel size. We usually use 3x3.
Stride: step size. How many pixels we will jump. A slide of 2 reduces a lot the
output dimensionality (second image).
Padding: adding garbage around the image so it doesn't shrink and can be maintained
the dimensions. The size of the padding depends on the size of the kernel. With a kernel of
3x3, it requires a padding of 1. With a 5x5 kernel, we need a padding of 2.
We can have zero padding where we add a border of zeros, but there is also the
mirror padding where the last row is copied and shifted.

Max Pooling Layer


The pooling layer takes the input and reduces the dimensionality of the input through a
grouping operation. Generally, it is used at most as a grouping operation: it is taken
a window and we keep a maximum. In this way, we manage to drastically reduce the
resolution of the feature map since we have fewer parameters in the processing of the network. And
In addition, we managed to introduce invariance to the more aggressive translation than in a common convolution.
(for example, when doing max pooling we bring the data that there was a cat in that window but we forget in
where it was). Contributes to the invariance with respect to small translations in the images of
input. Furthermore, this layer does not add parameters to the model. Max pooling is not usually used in the
input layers but rather in the middle layers.
When building a convolutional neural network, we construct something like the following:
We have an input image and the first convolutional layer. The first convolutional layer
it has 4 output feature maps (that is, 4 kernels). This first convolutional layer adds to the model
4 parameters, 4 kernels. Then we apply sub-sampling such as max or average pooling where
we reduce the dimensionality. We still have 4 feature maps because max pooling does not mix
nothing, just groups (maintaining the number of feature maps). We move on to a new step of
convolution, this time with 6 feature maps. The kernel of this layer has a depth size of 4
because the previous layer had 4 (the depth of a kernel depends on the number of features
input maps). In the end, we have several feature maps, we vectorize them and feed them into a
multilayer perceptron. This MLP is our classifier.
So with the feature extraction and classification stages, we are going to input an image.
and obtain a prediction.

Advantages of Convolutional Networks


Naturally adapted to the regular structure of images (through the operation
of convolution). The notion of spatiality is not lost.
Invariants with respect to translations.
End-to-end learning. They allow me to learn the features and the classifier together.
Low memory requirements because it implements weight sharing.
Efficient at test time.
Good degree of generalization if trained with enough data.

Some classic architectures for classification


LeNet-5 (because it has 5 layers). This network has convolutions with 5x5 kernels and a stride of 1. This
the model has 60,000 parameters.
AlexNet: unlike before, now each layer has more feature maps. Instead of having 6 feature
maps in the first layer we now have 96, in the second 286 and in the third 384, etc. In the end
we have a perceptron with 4,000 neurons. It has a total of 60 million parameters, of which
58 million correspond to the last dense layers. That is to say, by using convolutional layers we
we save having a fortune of parameters.

VGG: they proposed to make the networks deeper. To achieve this without increasing the amount of
parameters use very small kernels (3x3). The fact that the networks are deeper makes
that the layer filters (kernels) are not very visually understood (they are very small).

GoogLeNet: in the end, an operation is performed that we call global average pooling where for each
The feature map takes the average. That is, there is a value for each feature map. The GAP is what is
put in the MLP. This model has only 4 million parameters.
Cross Entropy vs Accuracy
Each column is a data point and the rows represent the probability of it being one class or another.
The model prediction is above and the label, or ground truth, is below.
First classifier: in the first case, the model did not match the real label (because the most class
the prediction was 3 and the correct class was class 1). In both case 2 and case 3, if
He got it right. If we compute the accuracy, we get 2/3, and if we compute the cross-entropy, we get 4.14.
Second classifier: same accuracy but much lower cross-entropy.
If we had to define which classifier is better, we would say the second because it assigns more.
probability of the correct class. So, if we use cross-entropy as the loss function
to guide the learning of the network, in a case like the example, this will lead to a model with
the second classifier is better. While if we used accuracy both models would be equally
Good since it is a discrete measure.

Binary Classification: in the binary case, there is a single output from the perceptron, which
represents the true class probability. The false class probability would be 1 - this
value. In this case, entropy is called binary cross-entropy.
Multi-Class Classification: when we have multiple labels we are in a case of
categorical cross-entropy. There are several categories and the summation is performed over all
the possible categories.
Multi-Label Classification: multiple labels can be present at the same time (e.g. in an X-ray)
of the thorax, a person can have more than one pathology at the same time). In this case, we could
use the sigmoid as the activation function because each label is independent of the
another.

Activation functions
When a layer, whether convolutional or fully connected, processes a data; many come out
values, one for each neuron, and an activation function is applied to that. Basically, each
the neuron has an activation function.
In the case of a fully connected layer, the sigma function is applied, which can be anything.
the case of a convolutional layer is the same, except that the inner product is performed
only in the little square of the kernel.
The sigmoid activation function has certain problems. If we use it in a multi-layer perceptron.
very deep, the gradient tends to 0 when X tends to∞ o -∞We are going to have no signal from
gradient, so the weights of the network do not change. In other words, the network does not learn. When
We compute the gradients using the chain rule and backpropagate gradients from the end.
to the beginning of the neural network. If we have functions of the sigmoid type and they start to pass through
this sigmoid values very positive or very negative, the values we recover become more and more
smaller gradients. Gradients that are very small are then produced and the product between gradients is
smaller and smaller as well. The multiplication of small gradients by the chain rule makes
that the gradient vanishes. It is known as the vanishing gradient problem.

It was proposed to use the ReLu (rectified linear units) activation function, which is the maximum.
between 0 and what is entering the function (x). That is to say, everything that is negative is transformed into 0 and
Everything positive is left as is. This is a non-linear function that has points where it is not.
derivable. At all positive points the function is differentiable and has a gradient, so if it
What flows inside the network is positive, we will have a gradient signal. There is less probability.
that the gradient vanishes in the case of the sigmoid activation function. Of all
In forms, the gradient is 0 for negative values, and it can trigger a Dying ReLU process. It is
This is why the Leaky ReLU was proposed, which avoids the problem of the Dying ReLU since it also generates
non-zero gradients for negative x values. That is to say, a function with a small slope is used.
for negative numbers.

Activation functions must always be non-linear because otherwise the theorem does not apply.
universal approximation. There is no modeling power.

Overfitting and regularization


Neural networks have many parameters, which makes them very prone to overfitting.
In fact, we often overfit when tackling a new problem. If we don't manage to get the network
overfitting, the model is not learning. It is not a bad practice to start training the network and
see what overfits.
The prediction error is the value of the loss function, the iterations will be divided into
epochs. A loss function is computed for both training data and for
validation data (we do not learn from this data; we look at it during the
training but they are not used for training). There comes a point when the loss function in the
training data keeps decreasing and the loss function on the validation data starts to
This is the moment we say that the model starts to overfit as it is
specializing in the training data and starting to perform poorly on the data it hasn't seen.

In cases of overfitting, we can use regularization mechanisms. Regularization is


any modification we make to the learning process algorithm to improve
generalization problems.

The first thing we can do is early stopping, that is, we monitor the function of
loss in validation and at the moment we see that during several consecutive iterations the
The validation loss function starts to rise, we stop the training.

Another technique that we can use is L2 norm regularization (weight decay). When
we train a neural network exploring a parameter space; the space of possible ws.
Since in principle the ws can take any value, for a single problem there are infinite
possible neural networks. This makes the model have a high complexity. Reduce the amount
Parameterization of a model, in general, is a way to regularize. We give less freedom to the
model. But this is a very aggressive way to reduce the complexity of the model. Regularization
por norma L2 podemos verla como una forma “suave” de bajar la complejidad del modelo ya que se
it adds a regularization term to the loss function that is the squared Euclidean norm
from the weights. That is, we take all the weights that the network has, we put them into a vector
long, we take the norm and this is what we add to the loss function. In this way, the ws
they are going to lean towards the origin. We are telling the model that we prefer the values of w that are
closer to the origin. We limit the parameter space that the model can take to those
that are close to the origin. We do this because if the value of w is very large, the model is
giving a lot of importance to that feature. Through this regularization, we are preventing a
feature takes all the responsibility; we distribute the responsibility. We do not do it in a way
explicit and harsh, we only tell him that they need to be close to the origin (without saying how close)
It is a soft way to narrow down the parameter space. The indirectly reduced
model complexity.
There are many w that can minimize a loss function for a training dataset.
given. We can reduce the complexity of the model by limiting the number of possible models to
to build. One way to do this is by preventing the existence of very large wi, that is, by adding a
restriction on the value that the weights wi can take.

Influence of the parameterλ: λ It tells us how much attention we pay to the term of regularization. A
λ too large, underfit. The model then does not learn as it is no longer given attention.
data term. And if it is very small, there is overfitting.

Data augmentation works very well especially on images. During training time,
the dataset is artificially increased using transformations on the data while preserving the
tags. e.g. rotate, enlarge, move, change the color, add Gaussian noise to the images (the cat
it is still a cat but for the network it is new data). These networks need a lot of data to
to generalize reasonably well, so using data augmentation during training is
one of the techniques that are most commonly used in practice. It is always important to use an augmentation of
data if we are working with images. During testing time, it is possible to generate N versions of
the test image, estimate the predictions and average them.
It is not only used with images; it can also be used over series, for example, where
we can apply transformations to the series that do not completely destroy the data.

Data augmentation mechanisms allow me to introduce invariances to transformations.


what we do to the image. Convolution as an operation is not invariant to rotation, without
embargo, if we always rotate the images in different positions, the model learns to detect
regardless of how the image is rotated. When we gain invariance, the model does not
It doesn't matter to him whether the images are rotated or not. The model classifies the same.

Another regularization technique that is widely used is dropout, where it is proposed to turn off
randomly to some neurons during training so that the model does not assign
too much responsibility on a neuron or a particular feature. When we say turn off
we refer to ignoring the output of the neurons. Randomly, with a probability (1-p), the
neurons are ignored in the forward pass during training. We turn off neurons with a
determined probability in each iteration of the gradient descent algorithm. In a
iteration has a structure, in the following iteration a new structure and so on. This allows for
the model I do not end up trusting too much in any particular neuron. It is an indirect way
to train many models at the same time because we are changing the architecture; and to train many
Models and making them predict all together generally has better performance than not doing so.
dropout is done by layers.
We only turn off neurons during training, during testing (when we are about to do
predictions) all the neurons are used and the output weights of the neuron are multiplied
by p. Neurons do not randomly shut down.

Another important thing to do is to normalize the model inputs. It generally helps.


when we have tabular data. The mean and the deviations are always calculated using only the
training data (if we include the test data we create data leakage), but all data (test and
validation also) are normalized.
In the context of neural networks, it is good to normalize the data to have a 'nice'
error surface where it is easy to reach the minimum. If the data is not normalized, they move
everyone on different scales so there is an error surface where in some directions
we have to take very big steps and in other smaller steps. Doing this is difficult when
we are training. That it is difficult to reach the minimum implies that the model's performance will be
worse.
In the case of images, the range of the features (pixels) is the same (0-255), therefore, in
In general, it is not necessary to divide by the deviation. It is usual to consider the value of the mean and the deviation.
calculated over all the pixels of all the training images, instead of considering it
per feature (pixel). The statistic is calculated for each image, unlike the case with tabular data
where we calculate the statistics with the entire complete dataset. Somehow, we do
normalization by images. In multi-channel images, each channel is usually normalized
independently (an average and a deviation for each channel).

Variants of gradient descent


The gradient descent with Momentum smooths the steps taken by gradient descent.
in the most oscillating dimensions. It has to do with 'taking advantage of the moment.' Without momentum,
reaching the minimum becomes more costly. With momentum, what we do is 'gain momentum' and
continue going in the direction towards the minimum. It is a way to find the minimum faster. When
See that we always go in the same direction, we take bigger steps to get there faster.
Accelerate the convergence of the method. To do this, we calculate a velocity vector before applying
directly the weighted exponential average.
for aα. α it tells me how much the current vector weighs compared to the previous ones, that is, how much memory
we have. In general, we use aα From 0.9. The moment ignores the direction in which it oscillates a lot.

Varying the learning rate (step size):


High LR: the algorithm may oscillate chaotically and never converge.
Low LR: the algorithm may stop learning (or take a long time) and takes a lot
time to converge.

Something that is commonly done is to anneal/schedule the learning rate, for example starting with a
very high learning rate to explore the error surface in jumps and as we go
As we progress through the iterations, we begin to reduce the learning rate throughout the training.
We start with a large delta and decrease it as we progress in the process of
training in gradient descent iterations.
Step decay: reduce the learning rate (e.g. divide by 2) every fixed number of epochs, or when
The validation error does not improve.
Exponential Decay: decay with a negative exponent.

An adaptive learning rate is also often used for each parameter of the vector W. The methods
previous ones apply the same global LR to all parameters. Adaptive methods scale the
LR for each parameter.
- RMSProp
- Adadelta
- Adagrad
- Adam (Momentum + RMSProp)

Neural networks beyond image classification


These types of networks can also be used to solve other problems that are not related to
classification or regression. Particularly in images, in the field of computer vision, we have
various problems such as classification (when an image is received, we want to say which category
it has), semantic segmentation (a category is assigned to each pixel), object detection (when
we put a bounding box around the object and say what class the object is in
bounding box) and instance segmentation (similar to the semantic segmentation problem but
We also want to state the class of the pixel, as well as specify its instance.

Classical image segmentation techniques


●Thresholding → intensities are observed and thresholds are set (e.g. everything that the intensity
more than 128 is a tumor and anything less is not a tumor -- but there remained, for example, the
skull and the tumor of the same tumor.

●Growth of regions → a seed is placed in the tumor, which begins to grow until
that crashes into the edges. Where the edges collide, there is no further growth.
Explicit deformable models → active contours that begin to move until
collide with edges.

Convolutional neural networks for image segmentation


Binary image segmentation (e.g., segmenting pools in satellite images)
Multi-class image segmentation (e.g. segmenting a brain into anatomical structures)

A first approach to the problem, the most inefficient and basic of all, is the sliding window.
the one that cuts the image into many patches, classification networks are trained and it is passed to
Each patch across the network. The problem is viewed as if it were a classification problem. We can
to think that segmentation is a classification problem at the pixel level. Each pixel is classified
independently. Patches of the image and the label in which it will be classified are taken.
patch is the label that corresponds to the central pixel. Continuing with the example of the image
next the first and second pixel are class cow and the third pixel is class grass. It is very
inefficient because it does not make use of the features shared between patches.

In 2015, a new type of architecture was created that serves to solve this problem more
efficient. Fully Convolutional Networks are fully convolutional networks, that is, they are networks
without dense layers (fully connected). In these networks, there are only convolutions and pooling. They
process the image just as we saw so far and when we reach any low resolution level
we extract as many feature maps as possible classes we have in the segmentation problem (e.g. if
we have 21 possible classes, we extract 21 feature maps of the same size as the input image
and we say that each feature map corresponds to the probability that a pixel belongs to the class that
we are interested. We set it up as if it were a probability map for each pixel (each pixel now
it has probability vectors) and the label we assign it is that of the class with the highest
probability. That is, through convolutions, we end up transforming the image into a
probability matrix where each of the pixels in the probability matrix is actually
the probability that that pixel has that particular class. The probabilities for a single pixel
They always sum to 1. The last feature map has as many channels as possible classes. We compute the
loss function for each pixel and we average; each pixel is looked at as if it were a problem of
independent classification, cross-entropy is calculated and then the average of everything is calculated
that.
You combine an encoder-decoder architecture for image segmentation. The image is taken from
input and convolution-convolution-pooling is performed (in each convolution the size is not altered but
in each pooling the size is reduced by half) until reaching a kind of 'bottleneck' where
from there the resolution of the image starts to be raised again with convolution-convolution-up
convolution until reaching the original resolution. A U-shaped path is created where the
resolution and then it starts to be increased with inverse operations to what was done
previously. First it is encoded (encoder) and then it is decoded (decoder).
The gray arrows are called skip connections and are used to skip parts of the network. It remains
feed-forward but we skip connections. This helps us not to miss details and also to
that the gradient jumps layers; that is, skip connections help the earlier layers receive
gradient with signal and not faded.

To increase the resolution of a feature map, interpolation is applied.

By passing the image through the UNet, we obtain probability maps. In the following example, where
the image has to be classified into 3 classes, a probability map with 3 'slices' is obtained
of milk bread" where a white value implies high probability and a black value, low probability.
Each pixel adds 1. The probability map undergoes ArgMax (we keep the label)
which corresponds to the class with the highest probability) and with that we reconstruct the segmentation. We
calculate the pixel loss function (e.g. Cross Entropy) of the output probability maps
comparing them with the One-hot version of the ground-truth, and then averaging over all the pixels
from the image.
Unlike the architectures we have already seen (AlexNet, LeNet, etc.), fully connected networks
We can train convolutional ones with an image of a certain size and be able to segment images.
of another size because everything adapts to the size of the input. So something we usually do
a lot is the training by patches where the image is patched (especially when they are
very large images): pieces are cut out, training is done with the pieces and when it is done
prediction the full volume is entered. The mini-batches are made up of patches, not images
complete. At test time, if the network is fully convolutional, it is enough to insert the new image and the
The prediction will be of the appropriate size. If the network is not of the appropriate size, tiling can be done externally.
from the network, and then reconstruct the segmentation.

If the network does not have any fully connected (dense) layer, it is said to be fully
convolutional. Dense layers can be implemented as convolutional layers. The large
The advantage of a fully convolutional network is that it can process images of any size.
size, and can be trained and evaluated on images of different sizes.

Basic architecture for localization


Given the following image, it must be stated where the dog with a little box is. The dataset consists of photos.
for dogs, the size of the box is the ground truth and the architecture could be some architecture of
classification but instead of just doing classification we add a couple of outputs (values
continuously that we work separately) which are the central coordinate (x,y) and the size of the box.
So, when we define the loss function to enter the model, we have 2 terms: the
classification label (cross-entropy) and the second regression term where it is compared with
the ground truth (box coordinates and box size).

Convolutional neural networks for time series processing


Vamos a usar redes neuronales para procesar series de tiempo. Podemos querer resolver distintos
types of problems such as predicting future events (e.g. dollar value/average temperature of AR in 1
year) or classify events in a time series (e.g. see if an electrocardiogram has an event
abnormal or not). Both problems can be tackled with neural networks.
In a time series classifier, we are going to use convolutional networks just as we saw them.
until today but applying 1D convolutions (vectors). We then have our vector of
input, we perform a convolution and obtain feature maps (which cease to be bread slices)
lactal). We can apply pooling, convolution, etc. One of the problems we usually have.
when adopting a common architecture (e.g. LeNet) the size of the inputs always has to be
likewise, so if we want to process time series of different sizes/lengths in principle
We should trim it or do something like that. So, if we take out the dense part, we need to...
arrive at some classification. This is usually done with a global pooling layer at the end where
average all the feature maps (4 feature maps come in the image and when applying global max)
Pooling we have a vector of 4 elements). The size of the output vector from the global pooling will be
determined by the number of feature maps that arrive, not by the length. This then is a
a way to convert something of arbitrary length to something that always has the same length; and here if
We can connect it later with the perceptron to make the final classification.

In the case of predictions in time series, we can work with a perceptron and look.
our data as if they were simply strips of a table each one. We can have, for
example, 4 features (e.g. data from the 4 previous days) and we try to predict the 5th feature (day 5).
The time series is split into pieces and a table is created where the ground truth is the label of
tomorrow and all previous days are used as input features.

The interesting thing about using convolutions to predict time series is that it is not necessary to
let's cut the data when testing, because in fully convolutional networks it adapts the
size of the output to the size of the input. We could train a fully convolutional network
to take 4 data points to produce 1 (prediction for tomorrow) and be able to fit the entire dataset in
test times. Being fully convolutional, the network will output a value for each of the
entries that were processed. In any case, convolutional networks are not widely used for
process time series. If we are implementing a convolutional network to cross series of
time and making future predictions, the convolutions have to be causal (the data that will be
Entering the network must be prior to what we are predicting.
If we have to make predictions based on a time series, we can include as input the
dato recién predecido, descartar el primer día usado hasta el momento, reemplazarlo por el valor
to predict and predict a new day again. In any case, doing this greatly reduces the
accuracy throughout each of the steps we take over time. The predictions are going to be
less secure because we are going to be inputting data that we ourselves
we predict. Now, in a model that generates music (we provide a song model and it starts to
generate what follows) this would not be a problem.

Recurrent neural networks are specifically designed to process time series.


(sequential data). These networks have loops, meaning they will consume at every instant of
time and updating memory. We can then process data of any length because
we will be consuming data, updating the memory, and at some point
making the prediction. Just as fully convolutional networks can
scale to any image size, recurrent networks can in principle
process any length of sequence. The key of recurrent networks are the
shared weights over time. Just like a network
convolutional had weights that were shared across space, these networks
they have that memory that is like weights we share to process
over time. We always use the same weights to process throughout the
time, but let's update the memory.

With this, problems of different types can be solved:


One to one → has a simple input and a simple output
One to many → consume a single piece of data but produces many (e.g., image captioning where
we move from an image to a sequence of words
Many to one → has several inputs and a single output which is the label (e.g. analysis of
sentiment where we have a sequence of words and we want to define whether the text is or not
aggressive
Many to many → has multiple inputs and multiple outputs, but the amount of input is not the same.
same as the output (e.g. translation of texts where a sequence of words is passed
to another sequence of words)
Many to many → has multiple inputs and multiple outputs, with the same amount of input.
output (e.g. classification of actions in a video: we say what happens in each frame
from the video)
H is what we are going to call the hidden state of the neuron. At an instant t, we are going to consume a
time instant of my data xt, we are going to process it with a function that takes the data and the state
From the previous data. Combining memory and the data, the state is generated at the current time.
function f is parameterized by a weight matrix W (which is always the same for all the
moments of time). In the function f we have the input xt and the state at the previous time ht-1. At xt
it is multiplied by a weight matrix, while ht-1 is multiplied by another matrix of
pesos. Both products are summed, the result is passed through an activation function (tanh goes from -1
1) and so we ended up updating the status to the current time. Finally, when at some point
Whenever we want to make a prediction, the prediction is made by taking the state and multiplying.
by another weight matrix. The weights of the recurrent network are now these 3 matrices: the matrix
associated with the input, the matrix associated with the state, and the matrix associated with the output (the one that multiplies the
state to produce the output label). With these matrices, we will control the size of the
output, from the previous state, etc. This is the most basic recurrent model we can think of.

The state is a vector of numbers that is initialized to some value and is updated.
updates by mixing information from the data and the previous state through products
matrices. The state is not the prediction.

The distinct property that these networks have compared to the ones we have been seeing is that the weights
they are shared over time. State h enters its initial state, the first data enters and
we apply the function f. The function f returns an updated state (updated value vector).
This vector enters the function again with the data 2, but the weights we apply from the function are
the same that we applied before. The W matrices are the same at each point in time while
we are at a step of gradient descent.

Once we have produced the outputs, we use them to compute the loss function.
We compute the loss function for each of the terms and backpropagate it accordingly.
view until now. We call this backpropagation through time because although the
the algorithm is the same, the computation graph is built over time. When we take the
derivative, we are going to take the derivative of L (loss function) with respect to W.
Open problems in DL and ML
How to generalize to multiple data domains?
Many times when there is a domain change between the data we use for training and the
data that we use in real life the models do not work as well. In these cases there are certain
techniques that can be applied. It has to do with the robustness of the models we train and is
key between a successful model in production and an unsuccessful one.

When we train a model to solve a task (e.g. classifying images into different
categories) it may happen that we train images with a domain (e.g. images taken from
Amazon) and then we test on images that come from a different domain (e.g. they have a background of
A domain change is a change in the distribution of data. When we are in a
problem we face with domain changes, the task we want to solve is the
the same (e.g. we want to predict the same classes), what changes is the distribution of the data.
So, there are domain adaptation techniques that allow models to be adapted from a
dominion over the other.

●Multi-site issues: a model works well in one hospital but not in another because it is used
a different device for taking medical images (it slightly changes the arrangement of the
pixels).

If we have labels in all domains: Supervised domain adaptation


Solution 1: train with data from all sites (in case there are annotations for all of them). In
In general, by doing this we not only do not lose performance but the opposite happens because
these networks need to learn the features and to learn to extract characteristics it's good to see
a lot of data diversity.
Solution 2: we use transfer learning through fine-tuning. We transfer
knowledge from one model to another. That is, we take the model that was trained with data from
source domain and we transfer the learned weights to the new model with data from the domain
target. We use this W to initialize the model and train the new model with the target domain.
When initializing W with the values of the model that is trained to specifically solve the task
What is wanted to be resolved, it is very likely to start from a much better place on the surface of
error and end up converging to a better place as well. These weights are already good extractors of
features and can be used to initialize a new model and from there continue training.
If we have different architectures, we cannot transfer weights.

If we have no labels in the target domain: Unsupervised domain adaptation


In this case, there is data in the target domain but there are no labels. One of the things that
They can be used through some form of representation learning.
self-supervised, we learn representation with that data and then it is fine-tuned with the data
of those who have labels.

Ideas to solve it:


Map the distribution of data from site 1 to site 2, and retrain our model
The data from the target domain is transformed to that of the source domain. If we have a model that
works very well in the source domain (separating the dataset into class 1 and class 2) and when they arrive
the target domain data has distribution shift, what we can do is learn
a function that transforms the data from one domain to another in order to use the classifier that already
we have trained or re-trained the model with the transferred data. We learn functions of
mapping.
For example, the source domain is daytime images and the target domain is nighttime images. We could...
train a model that transforms daytime images into nighttime images. In this way, if
our target domain is nighttime images, we transform all daytime images into
images at night (we have the label because we already had it) and we trained the model at night.
2. Learning domain-invariant features
We force the model to use the features to solve the problem that needs to
to ensure that they are independent of the changes in the domain that are occurring. One way to
Achieving it is through adversarial learning.
We have a network with a feature extraction part (green) which consists of the convolutional layers and
another classifier part (blue) which is the multilayer perceptron. Even if there is no data with
label in the target domain, what we do know is that there is data that comes from one domain and data
that come from another domain. What we are going to do, then, is train a classifier network of
domain (pink) that is going to learn to tell me which domain the data comes from. Instead of telling us whether the
data comes from one domain or another based on the image, it will tell us through the feature maps
that are generated in the feature extraction part. We will use the features to determine if the data
they come from one domain or another. We have 2 heads in this network: one that predicts and the other that says of
What domain does the image come from? We are going to train the model so that it knows how to classify well and at the same time.

I deceived the network so that it could not distinguish which domain the data comes from. That is,
we force the learned features not to be useful for distinguishing which domain the data comes from.
We are going to enforce invariance in the learned features.
The feature extraction part can only be trained with data from the source domain because it
it needs the label for classification. However, when we pass data to the classifier network
we can do the domain with data from both domains because we don't need to have labels from
class to compute the loss function. We are going to train this iteratively: we pass images
from the source domain to train the upper part and then from time to time we will add images
from the domains (without label) so that the network can distinguish which domain it comes from. In each iteration, to
Sometimes we do one thing and sometimes another.
The domain classifier adds a second term to the loss function that is related to
by classifying which domain the image comes from. But the interesting thing is that it does so by looking at the
features, not to the image. So we have defined a loss function with 2 terms and
we try to maximize the ranking term and minimize the classifier term
domain. It is about deceiving the network. By deceiving the network, the features become invariant to
the representation we are learning (invariants to the domain). This helps for better classification
images from the target domain.
How to interpret the trained models?
What mechanisms do we have to ask the network why it is making the decisions
What is he/she drinking.

Saliency maps through occlusion


A saliency map is a type of heat map of the image that indicates where the
Attention in that image. In this case, we want to generate maps that indicate where to look.
red to tell us that the class that is there is a given class. The simplest technique that is known
To do this is the occlusion technique (covering a part of the image). We are going to cover parts of the
systematically image (we move the gray square throughout the image) and record what it is
probability of the class that interests us. The different probability values are recorded as if
they were heat maps. This technique is agnostic to architecture.

Salience maps by backpropagation


We perform the forward pass of the network and then compute the gradient of the class score.
interest regarding the input pixels. This derivative will have as many components as pixels.
There is at the entrance. Indicate how the prediction changes when we slightly move the value of
pixel. For example: we take the prediction for dog and compute the derivative of the prediction (score) for
dog regarding image x.

Class activation maps


It also generates heat maps that are used to verify that the network is making predictions with
things that make sense or if it is, for example, predicting that there is a dog looking at the corner
superior where there is nothing.
We take the average of each feature map and create the GAP, which we then connect with a
covers the number of classes that the model has. To construct a class activation map, we are going to
take the weighted average of the feature maps (before applying GAP) weighting it by the weight that it
corresponds to the value of having done GAP in that feature map for the class we are interested in. That is,
We take the feature maps, multiply each one by the corresponding weight, and sum them up.
the weighted average to reconstruct the activation map.

How to avoid bias in trained models?


We talk about bias when the model performs better in a certain population.
that in another. Algorithmic bias is a kind of systematic reproduction of error in models
in certain population groups (e.g. having lower performance in one population than in the
another).

Biases in AI models
●Gender bias in the translation of texts → probably in the training data there is
there has been an over-representation of men.
Geographical distribution bias in image classification (e.g. girlfriend or disguise) → when
the classification model is shown western brides, it classifies them as brides.
When Indian brides are shown to him, he classifies them as costumes.
Bias in face recognition → the performance of recognition models
Facial treatments are better for people with fair skin than for those with dark skin.
●Biased data → the bias in the data often reflects imbalances in infrastructure
institutional and social power relations.

Protected attributes: are variables against which we want to protect algorithmic bias such as
gender, ethnicity, country of origin, age, weights, skin color, etc. We do not want the model to be
biased with respect to these attributes.
For example, in an algorithm that recommends whether or not we should grant credit to a person, gender
It could be a protected variable. We didn't want to give credit to someone based on gender.

How do we know if our model is fair?


Demographic parity: the classification must be independent of the protected attribute (e.g. the
the probability of granting credit to a female is the same as the probability
to grant credit to a male individual

You might also like