Understanding Deep Learning Basics
Understanding Deep Learning Basics
Machine Learning is a strategy to achieve artificial intelligence. It is the study and the
construction of algorithms that can learn from and make predictions about data. If there is data,
it is machine learning. In practice, machine learning can solve problems of
artificial intelligence 'narrow' (Narros AI). That is, they are models that can solve very well a
problem but not another. Nowadays, we have models that are very good at solving a single
task: if we run them out of there, he doesn't know what to do, and the model is not good. Segmentation and
image classification, voice recognition, prediction problems through a set
of tabular data, these are all problems that can be solved very well with ML. In learning
mechanical (or automatic) data is available, which is used to train a model to make predictions from
unknown data.
We have different paradigms within machine learning: Supervised Learning where
there is a dataset with labels that supervise the learning process and a
prediction of new data with the trained model. The algorithms of Learning No
Supervised learning tries to learn the structure of data. And the algorithms of Semi Learning.
Supervised learning uses both labeled and unlabeled data during the process of
learning. Through the use of unlabeled data, representations can be learned from
the data and use this to make more accurate predictions (it is better known how they are distributed)
data on which predictions are made). Reinforcement Learning is learning from
the interaction with the environment. That is to say, reinforcement is given to the model when it performs the task well.
There is an agent that interacts with the environment by performing actions, and if it performs the action well, it is given
a reward. It is unsupervised because there are no labels.
There are two types of machine learning models: discriminative models and models
generative. Discriminative Models are models that end up learning a kind of
conditional probability given the data, P(Y|X). What is the probability that given a data point, it is from
determined class. A predictive-discriminative model is learned, where the aim is to discriminate and
separate the data into, for example, 2 sets. On the other hand, Generative Models seek
generate data instead of classifying; create things that make sense. In this case, one learns the
distribution of the data, that is P(X).
We will be interested in solving supervised discriminative problems (e.g. classification of
images
Image Classification
It predicts the probability that there is a cat in a given image, for example.
For Python, images are matrices of numbers (intensities). In general, they are values that range
between 0 and 255 or can go between 0 and 1. If working with a color image, there are 3 RGB, 3
"slices of bread" one after the other. That is to say, 3 matrices of the same size where each one
represents a different color channel.
Previously, features were extracted from images manually through code routines and
Once the relevant characteristics for the problem to be solved were established, it was possible to
to say, through a classifier, whether what is in the image is a dachshund or not.
EXAMPLE: we have the image of a bicycle, other representations are generated through
routines that are used to perform feature extraction (e.g. average color of the image,
standard deviation in the colors of the image, divide the image into quadrants and see the average color
from each quadrant). The most relevant characteristics for the problem are selected, they are inputted.
these characteristics in a classifier and the prediction is made.
The problem was that the features were extracted manually, which made it difficult.
solve a problem in computer vision.
Deep Learning is a technique for implementing machine learning. Models based on
deep learning are capable of learning representations of the training data in multiple
levels of abstraction (layers), composing simple modules that successively transform
such representations in others with a higher level of abstraction. Modules that are very combined.
simple but combined, one behind the other in different layers (levels of abstraction),
They begin to specialize and learn characteristics that are useful for solving the problem.
in question. The first layers specialize in low-level features (e.g., in images they can be
the edges, the corners, etc.) and the subsequent ones will start to combine those layers to build
more complex concepts and finally we are going to have very complex concepts that are based on the
simpler. This is an 'end to end' trained model where all this is trained together. It is no longer
they have decoupled stages but rather it is learned together. The fundamental component is the neurons
artificial
Multi-Layer Perceptron
A neuron can model a certain amount of problems and if we want to model
more complex problems we need to combine the neurons. By combining them we build
multilayer perceptrons, where many neurons are placed in parallel to take those inputs,
they process and return an output and this output is put back into a new layer that
process what came out of the previous layer. These are deep models (because they have several layers).
The typical deep model, or at least the most basic one, has an input layer, a hidden layer.
and output layer.
How do we learn the weights of the neural network?
With a loss function. For example, in a regression problem, the mean squared error.
It could be the loss function. The loss function compares the output of the network with the ground.
truth or labels. The loss function can take many forms.
We are going to learn w with an optimization algorithm called gradient descent.
Gradient Descent
With gradient descent, the loss error is graphed. That is, the computation of the function of
cost in the training data. The loss function is computed for the data of
training and for a certain value of w1 and w2 (if we have more features, they will also be
more weights), we have a loss error.
To move from point A to point B we use the derivative (a mathematical function that tells us how)
direction walk so that the function changes as much as possible). When we extend the derivative to
in functions of multiple variables we have a gradient, which is the vector of partial derivatives. In
in this case, they are derivatives of the loss function with respect to each of the parameters that
represent my neuron (the ws). The gradient vector points in the direction in which the function
it grows quickly, but as we look for the direction in which the function decreases quickly, we take the
negative of the gradient descent.
How to calculate the gradient
Analytical derivation: derive by hand and write the code. An expression is written that states the
form that the derivative takes from the function. It's easy to do for a neuron, but
not for more than one.
Numerical differentiation: finite differences. The value of the function is computed at two points and is
Make the line. These methods have many stability problems and are expensive.
Symbolic derivation: it is carried out using the rules studied in Mathematical Analysis but
automated (e.g., Maple, Mathematica). They return an analytical expression that can be
very ugly.
Automatic derivation: derivatives are defined for 'primitive' operations such as
example, a sum or a product (mathematics and control). A graph is constructed of
operations and is derived following the chain rule. There are frameworks that
they implement these automatic differentiation algorithms (TensorFlow, Pytorch).
What is a neuron? → The most basic operation that we can represent in a neuron is the
inner product between the input and the weights.
Activation Functions
●Sign function: we pass z (the result of the summation) through the sign function. If z<0, the function
value -1 and if z > 0, the function is worth 1.
Sigmoid function: sends anything from the range of real numbers to the range (0,1). It could be useful.
to represent, for example, a probability (e.g. for binary classifier). The regression
Logistics, for example, is a simple perceptron with a sigmoid activation function.
1 (true) 0 (false) 1
0 1 1
1 1 1
The dataset has 4 elements, that is, 4 possible truth values for the problem. The idea is
draw a line that acts as a decision boundary leaving on one side the combinations that give
1, on the other side, the combinations that give -1. Only the one that is false/false gives negative (false).
The true and false classes remain separated.
At the decision boundary, we add the bias term (w0) to shift the line away from the origin.
w is initialized randomly, then steps are taken in the direction of the gradient and
is re-computing the gradient at each step and updating the function. This method only
it guarantees convergence to local minima. If the function is convex, the local minimum is also the
global minimum. The squared error used as the loss function of the simple perceptron in the
Linear case results in a convex function. That is, if we are training a model to optimize
this perceptron, we are guaranteed to find the global minimum. We never have this
guarantee.
Algorithm: the weights are initialized randomly and given the dataset, we update w (value of the
pesos) as the value that w has at that moment minus the gradient multiplied by the size
This will scale the size of that gradient so that there are no very large jumps.
big.
EXAMPLE: gradient calculation in the linear case of a perceptron
We cannot establish a decision boundary that solves the XOR problem with a perceptron.
simple linear. Not even with a simple perceptron that has a non-linear activation function, it does not
It is possible. A multilayer perceptron (MLP) is needed then. That is,
needs the output of the neuron (which is going to be a line like the one in the second graph above)
now enter another neuron that will combine the output of several neurons from the previous layer. In a
multi-layer perceptron many neurons are placed in parallel to be able to take the outputs from them
neurons combined in a prediction. Lines start to be combined through the
concatenation of neurons. When certain characteristics are present, the MLP is known as
universal approximator since this perceptron can approximate any function.
A multilayer perceptron (MLP) is a structure that combines several neurons into a single model.
We started talking about neural network architecture. The architecture is given by: how many
capas tiene el perceptrón, cuántas neuronas tiene cada capa, que función de activación se usa en
each layer, whether the layer has bias or not. In a supervised learning paradigm like the one in which
We are going to train the model using labels. When training the model begins, it
they have to compare each of the model's outputs with the label we want it to
correspond to that data. Everything will be guided by the loss function.
In the MLP, we no longer have a single weight vector (before we had a weight vector w for each
neuron) but rather it has a matrix where there is a weight vector for each neuron in the layer and
each neuron will have the same number of weights as input features. Following the example
From the following drawing, in the first layer we will have 5 weight vectors arranged in a
matrix where each vector has 4 weights (because there are 4 input features).
In these feedforward networks where only forward feeding occurs, the output of the previous layer is what
what enters into the next one (each arrow is an x). Therefore, each neuron of the second hidden layer will
to have 5 pesos because 5 features are entering the neurons of the next layer. To the last one.
we call it output layer.
All the neurons in a layer have the same activation function, but 2 layers can have
different activation function.
Having many parameters can lead to overfitting.
Interpretation of the drawing: we have 3 exits. We can think of, for example, a problem of
classification where we input a data point and we can classify it into 3 different classes. We say that each
neuron, for example, spits out the probability that my input belongs to class 1, 2, or 3. Also
we could be regressing 3 values such as age, height, and weight of a person, based on a
a bunch of parameters of a person.
Example: problem with 2 input features
For example, we may be estimating the rain (whether it will rain or not) and the temperature for tomorrow.
based on today's atmospheric pressure and humidity.
The first layer is a hidden layer that has 3 neurons, each neuron will have 2 weights: one that
multiply the first entry and another that multiplies the second. Then, the matrix will have a
column for each neuron (these are the weights). When the data arrives, we are going to perform a product.
between the pink vector and the Wh matrix. We are going to multiply a first matrix of 1x2 and a second
2x3 matrix so the output matrix will be 1x3, which represents the outputs of each neuron.
Size of a weight matrix of a layer: number of input features x number of
output neurons. The matrices are the result of doing XxW.
When we train our multilayer perceptron model, we are going to take the gradient of the
loss function with respect to the model parameters (ws).
Now the W are matrices, no longer just a single vector as it was before.
The gradient of the loss function will be derived from the Ws. Therefore, the vector
the gradient will have a component for each parameter.
There are as many matrices as there are hidden layers. Each hidden layer has a weight matrix.
different.
Batches of size 4: 4 samples/data with 2 input features each. The product of the matrix
the input and the weight matrix is 4x3 where we have one column for each neuron and one row
for each observation. The size of the weight matrices of my perceptron does not depend on the
number of elements in the batch, but depends on the number of input features and the
number of neurons. That is, we can train a perceptron with a batch size of 40 and in test
enter only 1 piece of data.
How to calculate the gradient: Automatic Differentiation / Backpropagation
We express the mathematical operation as if it were a graph of operations. If we know how
process each of the operations independently, then we can compose the
gradient through the chain rule and obtain the final gradient estimate. The algorithm
Backpropagation is an algorithm for obtaining gradients; it gives us the gradient of a function.
in relation to certain parameters. The gradient descent algorithm is the algorithm of
optimization, which uses backpropagation when calculating the gradient.
The derivative of the function at a given point is the slope of the tangent line to the curve at that point.
point and it indicates how the function varies at that point. In this case, we see if the function grows or
decrease. When h is very small we can approximate the function using the derivative. The derivative
then indicates the 'sensitivity' of the function to the change in the variable. The value of the derivative me
indica cómo cambia la función, permite entender lo que le pasa a una función localmente. La
The derivative is the rate of change.
We need to be able to derive in order to make the gradient descent algorithm, in some
Cases like in the maximum function there are points where the derivative is not defined, so we take
a subgradient. The maximum function is not differentiable, that is, we cannot derive it everywhere.
its spectrum but a generalization of the gradient is usually used for non-differentiable functions that
it is called sub-gradient.
The derivative of f with respect to X is 1 when X is greater than or equal to Y, and 0 otherwise. When we
we move in X, if X is greater than Y the function will increase, but if Y is greater than X the function does not
change.
We will use the idea of the chain rule to easily compute derivatives in networks.
Neural because we can think of each operation as if it were a graph. The chain rule
optimizes the gradient calculation process because it reuses already calculated operations (e.g. the
derivative of f with respect to q:: first we use it for the derivative of f with respect to x, and then
for the derivative of f with respect to y). When structuring the operations as a graph, once that
we have a derivative for each operation, we will be able to reuse them and have a much more
optimized gradient calculation. Backpropagation is a flow of chained gradients in a
circuit of arithmetic operations.
Gradient flow:
EXAMPLE 1
To apply the backpropagation algorithm, we need to know at what point we will instantiate it.
that derivative function. We need to know at what point we want to calculate the derivative. All the
The algorithm is used to calculate the derivative of the function at a special point if we want the derivative.
At another point, we have to run the algorithm again. So we define a point, we give it
values to the parameters. In this case: 2, -1, -3, -2 and -3 (colored in red). These are the values in the
we are going to instantiate the derivative. We then do the forward pass. 0.73 is the value of having
evaluated the function at the input we provided. We now perform the backward pass (which gives you
name this algorithm). The gradients of each of the operations are computed.
regardless. We start from back to front and multiply the local gradients (the
calculated in each operation of the graph) with the gradient that we are back-propagating. We are going to go
partially applying the chain rule (in green).
We take the derivative of the function with respect to the variable.
2. We evaluate the result in what is coming in (if the result is a constant, not
we have nothing to evaluate)
3. We multiply by the gradient we have been backpropagating
4. Whenever we have an addition operation, we copy the gradient to both outputs.
because the two derivatives will be the same (this is so because the derivative of (x+y) with
Regarding x, it is equal to the derivative of (x+y) with respect to y.
Finally, we obtain the value of the gradient vector of the function with respect to each of the
variables. It is a gradient vector with all numbers, we don't have mathematical expressions because
Everything is already instantiated. The backpropagation algorithm will give us the gradient of the function.
instantiated at a point. It is an automatic differentiation algorithm that allows calculating the gradient.
The backpropagation algorithm is used to efficiently calculate a gradient.
When the output of a node is shared, the backpropagated gradients are summed before
multiply them by the local gradient.
It is essential that in order to request a gradient for any operation of things with PyTorch the
we have built as operations between PyTorch tensors. That is, the graph should not be cut
from PyTorch. All defined operations must be between PyTorch tensors. If we break
the graph because we use a function that is not PyTorch, the gradient is cut off and there is no
backpropagation and we will not be able to calculate the gradients.
Components of PyTorch
It is structured in classes. PyTorch offers us a bunch of different components that always
they are needed when implementing a Deep Learning algorithm. The classes are:
●Data: to collect data, PyTorch provides us with dataset and tensor objects. The dataset has
Let's say I lift the data and once lifted, I put it inside tensors.
●Data Loaders: once we put the data into tensors, we create a data loader.
We give it a dataset. We iterate through the data loader and it returns one batch at a time.
●Networks: PyTorch also allows us to create neural networks (we do not have to code them from scratch).
zero the product between x and w). We ask you to give us dense layers (dense because each
neuron is densely connected with all the ones from the previous layer, they are the ones that
we studied up to this moment). We then ask PyTorch to directly give us
a linear/dense layer, what we pass to it are the input features, features of
output and the activation function.
Training functions: training loops can be built where...
a forward pass with the dataset data, the cost function is computed, the.
Backpropagation, the weights of the network are updated and it iterates.
TorchScript: is used to pre-compile the assembled models so that they run in a way
much more optimized. When a model is created and it is intended to be deployed, in
Generally, the model is compiled (optimizations are made) to run faster.
Tensors in PyTorch
In PyTorch we can:
Perform operations in parallel on GPU
In an image of 1,024x1,024, for example, each neuron in the first layer will have about 1 million of
connections/weights. And if we have, for example, 1,000 neurons we are going to have one billion of
parameters only in the first layer, which is not scalable.
Assuming we are looking at a character, each of the neurons in the first layer will go to
look at a very small area of the entrance. That is to say, the value that the long neuron will be affected
only for a part of the input. That is to say, these neurons have a limited receptive field.
The receptive field is the area of input that affects the output value of the neuron.
A neuron in the last layer combines the information coming from the previous layers (in this
case, of the 3 previous layers) and each previous neuron looked at a specific part of the data therefore
the receptive field of any neuron is always greater than that of a previous neuron. That is,
upper layers have an increasingly broad receptive field and react to increasingly complex patterns
more complex.
Deep learning models perform analysis at multiple levels of abstraction and also integrate
as part of the model to the classifier. The classifier is a multilayer perceptron. The rest are neurons.
convolutional that extract features. All this forms a graph of operations and can be
derive as if it were a small and very simple perceptron. In summary: a data point enters, and features are extracted.
features, it goes through the classifier, a prediction is made, the loss function is computed and
the gradient of the loss function with respect to the model parameters is backpropagated. It
they can train the classifier and the feature extractor together, that's why we call it
end-to-end training.
Convolutional Neural Networks
A convolutional neural network is any network that uses a convolutional layer somewhere.
1D Convolution
A convolution is a mathematical operation that is performed between functions and represents the integral
of a product of a function x with a translated and reflected version of another w. We call x
input signal, we call it a filter and the result is called the convolved signal. We have
two functions, and the result is another function. In the continuum, we shift the filter and we keep extracting.
as the output value in our convoluted signal the area contained between the 2 functions (by
this reaches its maximum when the filter occupies the entire area of the input signal.
The signal is not usually a continuous function; generally, it is a sampled function since we have
in an instant of time a number. Now the signal x is a vector of numbers where each of
the positions is a value of the signal. W is also a vector. In the discrete signal, we are going to sum the
value of the product between vector w and the input signal part where we are standing
moment (e.g. the count of the first square is: (1x1)+(4x2)+(-1x0)+(0x-1) = 9).
2D Convolution
If we work with images, the signals are bi-dimensional, so instead of working with a
unique index, we work with two indices. i is the image, k is the convolutional filter. We are going to go
moving along the image with the filter and taking the inner product between the filter and the
what is below the image.
The convolutional filter is also called a Kernel. In an image, we move the kernel over the image.
The kernel is also an image. Depending on what we have drawn in the kernel is what
we are going to obtain as a convolved signal. What comes out of having executed an operation of
Convolution, in the context of neural networks, is called Feature Map. Very white values.
high values are represented, while very dark values represent low values. When making the
convolution operation in the image, the highest output values will be where there is something
in the image that looks very much like what is drawn in the kernel. This happens because we are
calculating a dot product, and the dot product is maximum when the two vectors are
aligned (they are the same). It is a way of doing template matching, that is, looking for things in
images. In a feature map, we will have transformed the image into something that represents a
specific characteristic of the image (e.g. an edge map). When we perform a convolution
we can think that each of the output pixels is the result of what it 'spits out' a
neuron. A neuron that was parameterized by the kernel parameters. That is to say, now the weights
Our neurons are going to be the kernels. The convoluted filter is precisely what we are going to
to learn, it is initialized with random values but since they are the parameters of the neuron
Let's learn. If we want to learn how to extract features, we let the parameters
initialized with a random value are free and can be learned through the process of
training (exactly the same as seen so far).
This process is part of the previous forward of your image analysis.
This is the result of convoluting the background image with a kernel. In this case, we have a
border highlight filter, that is, edges oriented as the
filter (kernel). What has higher values are the edges.
VGG: they proposed to make the networks deeper. To achieve this without increasing the amount of
parameters use very small kernels (3x3). The fact that the networks are deeper makes
that the layer filters (kernels) are not very visually understood (they are very small).
GoogLeNet: in the end, an operation is performed that we call global average pooling where for each
The feature map takes the average. That is, there is a value for each feature map. The GAP is what is
put in the MLP. This model has only 4 million parameters.
Cross Entropy vs Accuracy
Each column is a data point and the rows represent the probability of it being one class or another.
The model prediction is above and the label, or ground truth, is below.
First classifier: in the first case, the model did not match the real label (because the most class
the prediction was 3 and the correct class was class 1). In both case 2 and case 3, if
He got it right. If we compute the accuracy, we get 2/3, and if we compute the cross-entropy, we get 4.14.
Second classifier: same accuracy but much lower cross-entropy.
If we had to define which classifier is better, we would say the second because it assigns more.
probability of the correct class. So, if we use cross-entropy as the loss function
to guide the learning of the network, in a case like the example, this will lead to a model with
the second classifier is better. While if we used accuracy both models would be equally
Good since it is a discrete measure.
Binary Classification: in the binary case, there is a single output from the perceptron, which
represents the true class probability. The false class probability would be 1 - this
value. In this case, entropy is called binary cross-entropy.
Multi-Class Classification: when we have multiple labels we are in a case of
categorical cross-entropy. There are several categories and the summation is performed over all
the possible categories.
Multi-Label Classification: multiple labels can be present at the same time (e.g. in an X-ray)
of the thorax, a person can have more than one pathology at the same time). In this case, we could
use the sigmoid as the activation function because each label is independent of the
another.
Activation functions
When a layer, whether convolutional or fully connected, processes a data; many come out
values, one for each neuron, and an activation function is applied to that. Basically, each
the neuron has an activation function.
In the case of a fully connected layer, the sigma function is applied, which can be anything.
the case of a convolutional layer is the same, except that the inner product is performed
only in the little square of the kernel.
The sigmoid activation function has certain problems. If we use it in a multi-layer perceptron.
very deep, the gradient tends to 0 when X tends to∞ o -∞We are going to have no signal from
gradient, so the weights of the network do not change. In other words, the network does not learn. When
We compute the gradients using the chain rule and backpropagate gradients from the end.
to the beginning of the neural network. If we have functions of the sigmoid type and they start to pass through
this sigmoid values very positive or very negative, the values we recover become more and more
smaller gradients. Gradients that are very small are then produced and the product between gradients is
smaller and smaller as well. The multiplication of small gradients by the chain rule makes
that the gradient vanishes. It is known as the vanishing gradient problem.
It was proposed to use the ReLu (rectified linear units) activation function, which is the maximum.
between 0 and what is entering the function (x). That is to say, everything that is negative is transformed into 0 and
Everything positive is left as is. This is a non-linear function that has points where it is not.
derivable. At all positive points the function is differentiable and has a gradient, so if it
What flows inside the network is positive, we will have a gradient signal. There is less probability.
that the gradient vanishes in the case of the sigmoid activation function. Of all
In forms, the gradient is 0 for negative values, and it can trigger a Dying ReLU process. It is
This is why the Leaky ReLU was proposed, which avoids the problem of the Dying ReLU since it also generates
non-zero gradients for negative x values. That is to say, a function with a small slope is used.
for negative numbers.
Activation functions must always be non-linear because otherwise the theorem does not apply.
universal approximation. There is no modeling power.
The first thing we can do is early stopping, that is, we monitor the function of
loss in validation and at the moment we see that during several consecutive iterations the
The validation loss function starts to rise, we stop the training.
Another technique that we can use is L2 norm regularization (weight decay). When
we train a neural network exploring a parameter space; the space of possible ws.
Since in principle the ws can take any value, for a single problem there are infinite
possible neural networks. This makes the model have a high complexity. Reduce the amount
Parameterization of a model, in general, is a way to regularize. We give less freedom to the
model. But this is a very aggressive way to reduce the complexity of the model. Regularization
por norma L2 podemos verla como una forma “suave” de bajar la complejidad del modelo ya que se
it adds a regularization term to the loss function that is the squared Euclidean norm
from the weights. That is, we take all the weights that the network has, we put them into a vector
long, we take the norm and this is what we add to the loss function. In this way, the ws
they are going to lean towards the origin. We are telling the model that we prefer the values of w that are
closer to the origin. We limit the parameter space that the model can take to those
that are close to the origin. We do this because if the value of w is very large, the model is
giving a lot of importance to that feature. Through this regularization, we are preventing a
feature takes all the responsibility; we distribute the responsibility. We do not do it in a way
explicit and harsh, we only tell him that they need to be close to the origin (without saying how close)
It is a soft way to narrow down the parameter space. The indirectly reduced
model complexity.
There are many w that can minimize a loss function for a training dataset.
given. We can reduce the complexity of the model by limiting the number of possible models to
to build. One way to do this is by preventing the existence of very large wi, that is, by adding a
restriction on the value that the weights wi can take.
Influence of the parameterλ: λ It tells us how much attention we pay to the term of regularization. A
λ too large, underfit. The model then does not learn as it is no longer given attention.
data term. And if it is very small, there is overfitting.
Data augmentation works very well especially on images. During training time,
the dataset is artificially increased using transformations on the data while preserving the
tags. e.g. rotate, enlarge, move, change the color, add Gaussian noise to the images (the cat
it is still a cat but for the network it is new data). These networks need a lot of data to
to generalize reasonably well, so using data augmentation during training is
one of the techniques that are most commonly used in practice. It is always important to use an augmentation of
data if we are working with images. During testing time, it is possible to generate N versions of
the test image, estimate the predictions and average them.
It is not only used with images; it can also be used over series, for example, where
we can apply transformations to the series that do not completely destroy the data.
Another regularization technique that is widely used is dropout, where it is proposed to turn off
randomly to some neurons during training so that the model does not assign
too much responsibility on a neuron or a particular feature. When we say turn off
we refer to ignoring the output of the neurons. Randomly, with a probability (1-p), the
neurons are ignored in the forward pass during training. We turn off neurons with a
determined probability in each iteration of the gradient descent algorithm. In a
iteration has a structure, in the following iteration a new structure and so on. This allows for
the model I do not end up trusting too much in any particular neuron. It is an indirect way
to train many models at the same time because we are changing the architecture; and to train many
Models and making them predict all together generally has better performance than not doing so.
dropout is done by layers.
We only turn off neurons during training, during testing (when we are about to do
predictions) all the neurons are used and the output weights of the neuron are multiplied
by p. Neurons do not randomly shut down.
Something that is commonly done is to anneal/schedule the learning rate, for example starting with a
very high learning rate to explore the error surface in jumps and as we go
As we progress through the iterations, we begin to reduce the learning rate throughout the training.
We start with a large delta and decrease it as we progress in the process of
training in gradient descent iterations.
Step decay: reduce the learning rate (e.g. divide by 2) every fixed number of epochs, or when
The validation error does not improve.
Exponential Decay: decay with a negative exponent.
An adaptive learning rate is also often used for each parameter of the vector W. The methods
previous ones apply the same global LR to all parameters. Adaptive methods scale the
LR for each parameter.
- RMSProp
- Adadelta
- Adagrad
- Adam (Momentum + RMSProp)
●Growth of regions → a seed is placed in the tumor, which begins to grow until
that crashes into the edges. Where the edges collide, there is no further growth.
Explicit deformable models → active contours that begin to move until
collide with edges.
A first approach to the problem, the most inefficient and basic of all, is the sliding window.
the one that cuts the image into many patches, classification networks are trained and it is passed to
Each patch across the network. The problem is viewed as if it were a classification problem. We can
to think that segmentation is a classification problem at the pixel level. Each pixel is classified
independently. Patches of the image and the label in which it will be classified are taken.
patch is the label that corresponds to the central pixel. Continuing with the example of the image
next the first and second pixel are class cow and the third pixel is class grass. It is very
inefficient because it does not make use of the features shared between patches.
In 2015, a new type of architecture was created that serves to solve this problem more
efficient. Fully Convolutional Networks are fully convolutional networks, that is, they are networks
without dense layers (fully connected). In these networks, there are only convolutions and pooling. They
process the image just as we saw so far and when we reach any low resolution level
we extract as many feature maps as possible classes we have in the segmentation problem (e.g. if
we have 21 possible classes, we extract 21 feature maps of the same size as the input image
and we say that each feature map corresponds to the probability that a pixel belongs to the class that
we are interested. We set it up as if it were a probability map for each pixel (each pixel now
it has probability vectors) and the label we assign it is that of the class with the highest
probability. That is, through convolutions, we end up transforming the image into a
probability matrix where each of the pixels in the probability matrix is actually
the probability that that pixel has that particular class. The probabilities for a single pixel
They always sum to 1. The last feature map has as many channels as possible classes. We compute the
loss function for each pixel and we average; each pixel is looked at as if it were a problem of
independent classification, cross-entropy is calculated and then the average of everything is calculated
that.
You combine an encoder-decoder architecture for image segmentation. The image is taken from
input and convolution-convolution-pooling is performed (in each convolution the size is not altered but
in each pooling the size is reduced by half) until reaching a kind of 'bottleneck' where
from there the resolution of the image starts to be raised again with convolution-convolution-up
convolution until reaching the original resolution. A U-shaped path is created where the
resolution and then it starts to be increased with inverse operations to what was done
previously. First it is encoded (encoder) and then it is decoded (decoder).
The gray arrows are called skip connections and are used to skip parts of the network. It remains
feed-forward but we skip connections. This helps us not to miss details and also to
that the gradient jumps layers; that is, skip connections help the earlier layers receive
gradient with signal and not faded.
By passing the image through the UNet, we obtain probability maps. In the following example, where
the image has to be classified into 3 classes, a probability map with 3 'slices' is obtained
of milk bread" where a white value implies high probability and a black value, low probability.
Each pixel adds 1. The probability map undergoes ArgMax (we keep the label)
which corresponds to the class with the highest probability) and with that we reconstruct the segmentation. We
calculate the pixel loss function (e.g. Cross Entropy) of the output probability maps
comparing them with the One-hot version of the ground-truth, and then averaging over all the pixels
from the image.
Unlike the architectures we have already seen (AlexNet, LeNet, etc.), fully connected networks
We can train convolutional ones with an image of a certain size and be able to segment images.
of another size because everything adapts to the size of the input. So something we usually do
a lot is the training by patches where the image is patched (especially when they are
very large images): pieces are cut out, training is done with the pieces and when it is done
prediction the full volume is entered. The mini-batches are made up of patches, not images
complete. At test time, if the network is fully convolutional, it is enough to insert the new image and the
The prediction will be of the appropriate size. If the network is not of the appropriate size, tiling can be done externally.
from the network, and then reconstruct the segmentation.
If the network does not have any fully connected (dense) layer, it is said to be fully
convolutional. Dense layers can be implemented as convolutional layers. The large
The advantage of a fully convolutional network is that it can process images of any size.
size, and can be trained and evaluated on images of different sizes.
In the case of predictions in time series, we can work with a perceptron and look.
our data as if they were simply strips of a table each one. We can have, for
example, 4 features (e.g. data from the 4 previous days) and we try to predict the 5th feature (day 5).
The time series is split into pieces and a table is created where the ground truth is the label of
tomorrow and all previous days are used as input features.
The interesting thing about using convolutions to predict time series is that it is not necessary to
let's cut the data when testing, because in fully convolutional networks it adapts the
size of the output to the size of the input. We could train a fully convolutional network
to take 4 data points to produce 1 (prediction for tomorrow) and be able to fit the entire dataset in
test times. Being fully convolutional, the network will output a value for each of the
entries that were processed. In any case, convolutional networks are not widely used for
process time series. If we are implementing a convolutional network to cross series of
time and making future predictions, the convolutions have to be causal (the data that will be
Entering the network must be prior to what we are predicting.
If we have to make predictions based on a time series, we can include as input the
dato recién predecido, descartar el primer día usado hasta el momento, reemplazarlo por el valor
to predict and predict a new day again. In any case, doing this greatly reduces the
accuracy throughout each of the steps we take over time. The predictions are going to be
less secure because we are going to be inputting data that we ourselves
we predict. Now, in a model that generates music (we provide a song model and it starts to
generate what follows) this would not be a problem.
The state is a vector of numbers that is initialized to some value and is updated.
updates by mixing information from the data and the previous state through products
matrices. The state is not the prediction.
The distinct property that these networks have compared to the ones we have been seeing is that the weights
they are shared over time. State h enters its initial state, the first data enters and
we apply the function f. The function f returns an updated state (updated value vector).
This vector enters the function again with the data 2, but the weights we apply from the function are
the same that we applied before. The W matrices are the same at each point in time while
we are at a step of gradient descent.
Once we have produced the outputs, we use them to compute the loss function.
We compute the loss function for each of the terms and backpropagate it accordingly.
view until now. We call this backpropagation through time because although the
the algorithm is the same, the computation graph is built over time. When we take the
derivative, we are going to take the derivative of L (loss function) with respect to W.
Open problems in DL and ML
How to generalize to multiple data domains?
Many times when there is a domain change between the data we use for training and the
data that we use in real life the models do not work as well. In these cases there are certain
techniques that can be applied. It has to do with the robustness of the models we train and is
key between a successful model in production and an unsuccessful one.
When we train a model to solve a task (e.g. classifying images into different
categories) it may happen that we train images with a domain (e.g. images taken from
Amazon) and then we test on images that come from a different domain (e.g. they have a background of
A domain change is a change in the distribution of data. When we are in a
problem we face with domain changes, the task we want to solve is the
the same (e.g. we want to predict the same classes), what changes is the distribution of the data.
So, there are domain adaptation techniques that allow models to be adapted from a
dominion over the other.
●Multi-site issues: a model works well in one hospital but not in another because it is used
a different device for taking medical images (it slightly changes the arrangement of the
pixels).
I deceived the network so that it could not distinguish which domain the data comes from. That is,
we force the learned features not to be useful for distinguishing which domain the data comes from.
We are going to enforce invariance in the learned features.
The feature extraction part can only be trained with data from the source domain because it
it needs the label for classification. However, when we pass data to the classifier network
we can do the domain with data from both domains because we don't need to have labels from
class to compute the loss function. We are going to train this iteratively: we pass images
from the source domain to train the upper part and then from time to time we will add images
from the domains (without label) so that the network can distinguish which domain it comes from. In each iteration, to
Sometimes we do one thing and sometimes another.
The domain classifier adds a second term to the loss function that is related to
by classifying which domain the image comes from. But the interesting thing is that it does so by looking at the
features, not to the image. So we have defined a loss function with 2 terms and
we try to maximize the ranking term and minimize the classifier term
domain. It is about deceiving the network. By deceiving the network, the features become invariant to
the representation we are learning (invariants to the domain). This helps for better classification
images from the target domain.
How to interpret the trained models?
What mechanisms do we have to ask the network why it is making the decisions
What is he/she drinking.
Biases in AI models
●Gender bias in the translation of texts → probably in the training data there is
there has been an over-representation of men.
Geographical distribution bias in image classification (e.g. girlfriend or disguise) → when
the classification model is shown western brides, it classifies them as brides.
When Indian brides are shown to him, he classifies them as costumes.
Bias in face recognition → the performance of recognition models
Facial treatments are better for people with fair skin than for those with dark skin.
●Biased data → the bias in the data often reflects imbalances in infrastructure
institutional and social power relations.
Protected attributes: are variables against which we want to protect algorithmic bias such as
gender, ethnicity, country of origin, age, weights, skin color, etc. We do not want the model to be
biased with respect to these attributes.
For example, in an algorithm that recommends whether or not we should grant credit to a person, gender
It could be a protected variable. We didn't want to give credit to someone based on gender.