Activation Functions in Neural Networks
Activation Functions in Neural Networks
The perceived probability of an event like rain can be estimated using MLE by dividing the number of times the event occurred (nr) by the total number of observations (n), resulting in the probability estimation of nr/n. This approach is similar to the coin-flipping example, focusing on frequency observations .
Different initializations leading to different loss values can be attributed to multiple local minima in the loss landscape. This variance occurs because the starting point heavily influences the path taken in optimization, potentially leading to different local minima . Strategies like using better initialization methods and optimization techniques may help alleviate this issue.
MLE is a special case of MAP when the prior is a uniform distribution. In MAP, prior information can adjust estimations, contrasting MLE, which does not consider priors. Therefore, if the prior distribution is non-informative (uniform), MAP and MLE yield the same results, with MLE acting as a baseline devoid of regularization effects .
If the step size in gradient descent is too large, the model will not converge. Instead of gradually approaching the minimum point, the model might overshoot, never settling down into the minima .
For the final layer in an ANN for classification, appropriate functions include softmax, sigmoid, and tanh. These functions are preferred because they introduce non-linearity and enable the conversion of the logits into probabilities which are crucial for classification tasks .
High weights in a neural network do not necessarily indicate overfitting. While high weights can cause overfitting, they are not always associated directly with it. Overfitting may have other causes, and high weights might also occur due to improper initialization or architecture .
Threshold functions are non-differentiable and therefore unsuitable for use as activation functions in hidden layers of neural networks. Since backpropagation relies on calculating the gradient to update the weights, non-differentiability prevents these calculations, hindering effective training of the network .
The XOR problem can be made linearly separable by using the transformations X'1 = X1X2, X'2 = -X1X2 or X'1 = (X1 - X2)^2, X'2 = (X1 + X2)^2. This implies that nonlinear relationships in data can be addressed through appropriate transformations that expose linear features .
Using the activation function f(x) = x in hidden layers of an ANN makes the network equivalent to performing multi-output linear regression. This is because such a linear activation does not introduce non-linearity into the model, limiting the network's ability to capture complex patterns in the data .
In Bayesian inference, the likelihood represents the plausibility of the data under different parameter values. The likelihood function L(θ|X) is mathematically defined as P(X|θ), showing the probability of the data given the parameters. It guides the updating of beliefs about the parameters in light of observed data .