0% found this document useful (0 votes)
36 views3 pages

Activation Functions in Neural Networks

The document is an assignment for an Introduction to Machine Learning course, covering various concepts such as gradient descent, activation functions, and maximum likelihood estimation. It includes multiple-choice questions with solutions provided for each question. Key topics discussed include the effects of activation functions in neural networks, transformations for linear separability, and the relationship between MLE and MAP.

Uploaded by

Vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views3 pages

Activation Functions in Neural Networks

The document is an assignment for an Introduction to Machine Learning course, covering various concepts such as gradient descent, activation functions, and maximum likelihood estimation. It includes multiple-choice questions with solutions provided for each question. Key topics discussed include the effects of activation functions in neural networks, transformations for linear separability, and the relationship between MLE and MAP.

Uploaded by

Vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Assignment 5

Introduction to Machine Learning


Prof. B. Ravindran
1. If the step size in gradient descent is too large, what can happen?
(a) Overfitting
(b) The model will not converge
(c) We can reach maxima instead of minima
(d) None of the above
Sol. (b)
Ref. lecture

2. Recall the XOR(tabulated below) example from class where we did a transformation of features
to make it linearly separable. Which of the following transformations can also work?

X1 X2 Y
-1 -1 -1
1 -1 1
-1 1 1
1 1 -1

(a) X1′ = X12 , X2′ = X22


(b) X1′ = 1 + X1 , X2′ = 1 − X2
(c) X1′ = X1 X2 , X2′ = −X1 X2
(d) X1′ = (X1 − X2 )2 , X2′ = (X1 + X2 )2
Sol. (c), (d)
(c)

X1′ X2′ Y
1 -1 -1
-1 1 1
-1 1 1
1 -1 -1

(d)

X1′ X2′ Y
0 4 -1
4 0 1
4 0 1
0 4 -1

The two transformations above are linearly separable.

1
3. What is the effect of using activation function f (x) = x for hidden layers in an ANN?

(a) No effect. It’s as good as any other activation function (sigmoid, tanh etc).
(b) The ANN is equivalent to doing multi-output linear regression.
(c) Backpropagation will not work.
(d) We can model highly complex non-linear functions.

Sol. (b)
Ref. lecture

4. Which of the following functions can be used on the last layer of an ANN for classification?

(a) Softmax
(b) Sigmoid
(c) Tanh
(d) Linear
Sol. (a), (b), (c)
Ref. lecture

5. Statement: Threshold function cannot be used as activation function for hidden layers.
Reason: Threshold functions do not introduce non-linearity.

(a) Statement is true and reason is false.


(b) Statement is false and reason is true.
(c) Both are true and the reason explains the statement.
(d) Both are true and the reason does not explain the statement.
Sol. (a)
The reason is that threshold function is non-differentiable so we will not be able to calculate
gradient for backpropagation.

6. We use several techniques to ensure the weights of the neural network are small (such as
random initialization around 0 or regularisation). What conclusions can we draw if weights of
our ANN are high?
(a) Model has overfitted.
(b) It was initialized incorrectly.
(c) At least one of (a) or (b).
(d) None of the above.
Sol. (d)
Overfitting may be because of high weights but the two are not always associated.

2
7. On different initializations of your neural network, you get significantly different values of loss.
What could be the reason for this?
(a) Overfitting
(b) Some problem in the architecture
(c) Incorrect activation function
(d) Multiple local minima
Sol. (d)
Ref. lecture

8. The likelihood L(θ|X) is given by:

(a) P (θ|X)
(b) P (X|θ)
(c) P (X).P (θ)
P (θ)
(d) P (X)

Sol. (b)
Ref. lecture

9. You are trying to estimate the probability of it raining today using maximum likelihood esti-
mation. Given that in n days, it rained nr times, what is the probability of it raining today?
nr
(a) n
nr
(b) nr +n
n
(c) nr +n
(d) None of the above.

Sol. (a)
The question follows the same idea as the coin example discussed in the class.
10. Choose the correct statement (multiple may be correct):
(a) MLE is a special case of MAP when prior is a uniform distribution.
(b) MLE acts as regularisation for MAP.
(c) MLE is a special case of MAP when prior is a beta disrubution .
(d) MAP acts as regularisation for MLE.
Sol. (a), (d)
Ref. lecture

Common questions

Powered by AI

The perceived probability of an event like rain can be estimated using MLE by dividing the number of times the event occurred (nr) by the total number of observations (n), resulting in the probability estimation of nr/n. This approach is similar to the coin-flipping example, focusing on frequency observations .

Different initializations leading to different loss values can be attributed to multiple local minima in the loss landscape. This variance occurs because the starting point heavily influences the path taken in optimization, potentially leading to different local minima . Strategies like using better initialization methods and optimization techniques may help alleviate this issue.

MLE is a special case of MAP when the prior is a uniform distribution. In MAP, prior information can adjust estimations, contrasting MLE, which does not consider priors. Therefore, if the prior distribution is non-informative (uniform), MAP and MLE yield the same results, with MLE acting as a baseline devoid of regularization effects .

If the step size in gradient descent is too large, the model will not converge. Instead of gradually approaching the minimum point, the model might overshoot, never settling down into the minima .

For the final layer in an ANN for classification, appropriate functions include softmax, sigmoid, and tanh. These functions are preferred because they introduce non-linearity and enable the conversion of the logits into probabilities which are crucial for classification tasks .

High weights in a neural network do not necessarily indicate overfitting. While high weights can cause overfitting, they are not always associated directly with it. Overfitting may have other causes, and high weights might also occur due to improper initialization or architecture .

Threshold functions are non-differentiable and therefore unsuitable for use as activation functions in hidden layers of neural networks. Since backpropagation relies on calculating the gradient to update the weights, non-differentiability prevents these calculations, hindering effective training of the network .

The XOR problem can be made linearly separable by using the transformations X'1 = X1X2, X'2 = -X1X2 or X'1 = (X1 - X2)^2, X'2 = (X1 + X2)^2. This implies that nonlinear relationships in data can be addressed through appropriate transformations that expose linear features .

Using the activation function f(x) = x in hidden layers of an ANN makes the network equivalent to performing multi-output linear regression. This is because such a linear activation does not introduce non-linearity into the model, limiting the network's ability to capture complex patterns in the data .

In Bayesian inference, the likelihood represents the plausibility of the data under different parameter values. The likelihood function L(θ|X) is mathematically defined as P(X|θ), showing the probability of the data given the parameters. It guides the updating of beliefs about the parameters in light of observed data .

You might also like