0% found this document useful (0 votes)
10 views7 pages

InterView Questions

The document provides a comprehensive overview of machine learning and deep learning concepts, including model evaluation metrics, handling overfitting, and the differences between various algorithms. It covers topics such as activation functions, regression techniques, and the importance of feature transformation. Additionally, it discusses the use of different models and methods for classification, clustering, and dimensionality reduction.

Uploaded by

marwanabbas418
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views7 pages

InterView Questions

The document provides a comprehensive overview of machine learning and deep learning concepts, including model evaluation metrics, handling overfitting, and the differences between various algorithms. It covers topics such as activation functions, regression techniques, and the importance of feature transformation. Additionally, it discusses the use of different models and methods for classification, clustering, and dimensionality reduction.

Uploaded by

marwanabbas418
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Model Answers

Machine Learning
Q1
By examining the learned weights, we can gain insight into how the model represents the
data. In linear models, each weight reflects the importance of its corresponding feature:
ŷ = wT x + b

Q2
The normal equation provides a direct, closed-form solution for the optimal weights in linear
regression:
w = (X T X)−1 X T y

Q3
A poorly chosen model generally results in large prediction errors and fails to predict the
target accurately.

Q4
R2 measures how much variance in the data is explained by the model:
(y − ŷ)2
P
2
R =1− P
(y − ȳ)2
Mean Squared Error (MSE) measures the average squared prediction error:
1X
MSE = (y − ŷ)2
n

Q5
Non-linear relationships can be handled by feature transformation. For example:
z = x2
The model becomes linear in z:
y = wz + b

1
Q6
Overfitting occurs when the model fits training data very well but fails on unseen data.
Underfitting happens when the model performs poorly on both training and test sets.

Q7
Overfitting can be reduced using regularization:
X
L2 (Ridge): λ w2

or by increasing training data.

Q8
Lasso regression (L1 ) drives some weights exactly to zero:
X
λ |w|

Ridge regression (L2 ) shrinks weights without eliminating them:


X
λ w2

Q9
Mutual Information measures dependency between feature X and target Y :
X p(x, y)
I(X; Y ) = p(x, y) log
p(x)p(y)

Q10
The sigmoid activation function maps values to [0, 1]:
1
σ(z) =
1 + e−z

Q11
The Receiver Operating Characteristic (ROC) curve is a classification evaluation tool that
illustrates the trade-off between the True Positive Rate (TPR) and the False Positive Rate
(FPR) at different decision thresholds. It helps assess how well a model distinguishes between
classes regardless of class imbalance.
The True Positive Rate (also known as Recall) is defined as:

TP
TPR =
TP + FN

2
The False Positive Rate is defined as:
FP
FPR =
FP + TN
The Area Under the ROC Curve (AUC) provides a single scalar value that summarizes
the model’s performance. An AUC of 1 indicates perfect classification, while an AUC of 0.5
corresponds to random guessing.

Q12
Dimensionality reduction can be achieved using PCA:

Xreduced = XW

Q13
Multiclass classification can be solved using one-vs-rest binary logistic classifiers.

Q14
The elbow method or bias–variance analysis can be used to select k in KNN.

Q15
Hard margin enforces a perfect decision boundary, which can lead to overfitting and higher
sensitivity to noise, while soft margin allows a small margin for misclassification.

Q16
The bias–variance trade-off describes the effect of model complexity: reducing complexity
lowers overfitting but increases bias, leading to underfitting, while more complex models
reduce bias but increase variance.

Q17
Random Forest improves performance by aggregating multiple decision trees:
T
1X
ŷ = ŷt
T t=1

Q18
For imbalanced data, suitable metrics include Precision, Recall, F1-score, and AUC.

3
Q19
Boosting trains models sequentially, where each new model focuses more on the samples
with higher errors from previous models, improving performance step by step. It is generally
faster and more targeted than stacking and bagging. Bagging trains multiple models (usually
decision trees) on different bootstrap samples of the data (with replacement) and combines
their predictions by averaging or voting. Stacking combines different types of models by
training a meta-model on their outputs to make the final prediction

Q20
In bagging, each model is trained on a bootstrap sample drawn with replacement.

Q21
Silhouette score measures how close a point is to other points in its own cluster compared
to points in other clusters. A higher value indicates better clustering. Davies–Bouldin
index measures how similar clusters are to each other. Lower values indicate better cluster
separation.

Q22
DBSCAN works by grouping points that are close to each other based on a distance threshold
and a minimum number of points; it forms dense regions as clusters and labels sparse points
as noise.

Q23
A high learning rate usually causes poor validation performance because the model over-
shoots the optimal weights, leading to instability and failure to generalize.

Q24
With Stochastic Gradient Descent, the training accuracy per epoch will fluctuate and look
noisy instead of smoothly increasing because updates are based on individual samples or
small batches.

Q25
Dimensionality reduction can be achieved using PCA.

4
Q26
MSE is very sensitive to outliers because it squares the errors, so a few large errors can
dominate the loss and distort training:
1X
MSE = (y − ŷ)2
n
MAE is less sensitive to outliers but has a constant gradient, which makes optimization
slower and less stable, especially near the minimum:
1X
MAE = |y − ŷ|
n
We can solve these problems by using loss functions like Huber Loss, which behaves like
MSE for small errors and like MAE for large errors, or by handling outliers through data
preprocessing.

Q27
An AUC of 0.5 indicates random guessing.

Q28
In d dimensions, the decision boundary has dimension d − 1.

Q29
First, perform EDA to understand the data, identify useful features, check linearity or non-
linearity, determine preprocessing needs, and detect class imbalance. Then, evaluate the
models selected from this step using appropriate metrics or ensemble methods like voting to
choose the best one.

Deep Learning
Q1
The sigmoid function causes the vanishing gradient problem because its gradients become
very small for large positive or negative inputs, which makes learning very slow in deep
networks. It is also not zero-centered, so gradient updates can be inefficient, and it saturates
easily, causing neurons to stop learning.

Q2
The main problem of ReLU is the “dying ReLU” issue, where neurons output zero for all
inputs if they receive large negative values, causing their gradients to become zero and
stopping learning. ReLU can also be sensitive to large learning rates, which may push many

5
neurons into this inactive state. This can be solved by using variants like Leaky ReLU or
Parametric ReLU, which allow a small negative slope, or by using proper weight initialization
and smaller learning rates.

Q3
For regression tasks, the best activation function for the output layer is usually a linear
(identity) activation, because it allows the model to predict any real-valued number without
restriction.

Q4
The main difference between CNN and MLP is in how they handle data. An MLP connects
every neuron to all neurons in the next layer, so it ignores spatial structure and has a large
number of parameters. A CNN uses convolutional layers with shared weights and local
connections, which allows it to capture spatial patterns like edges and textures, makes it
more efficient, and works especially well for images and grid-like data.

Q5
We use convolution when the data has spatial or local structure, such as images, audio
signals, or time-series. Convolutions are useful because they focus on local patterns (like
edges in images), use shared weights which greatly reduce the number of parameters, and
preserve spatial relationships. This makes models more efficient, faster to train, and better
at generalizing compared to fully connected layers.

Q6
Convolution on images works by sliding a small matrix called a filter (or kernel) over the
image. At each position, the filter is placed on top of a small region of the image, the values
are multiplied element by element, and then summed to produce one number. This number
becomes a pixel in a new image called a feature map. By moving the filter across the whole
image, the network can detect local patterns such as edges, corners, or textures. Different
filters learn to detect different patterns.

Q7
right but high number of hidden layers can cause overfitting risk

Q8
RNNs work by processing data step by step while keeping a memory of previous inputs. At
each time step, the RNN takes the current input and the hidden state from the previous
step, combines them, and produces a new hidden state. This hidden state carries information
from the past, which allows RNNs to model sequences like text, speech, or time-series data.

6
Q9
There is no absolute “better” one; it depends on the use case. PyTorch is generally preferred
for research and learning because it is more intuitive, uses dynamic computation graphs, and
is easier to debug. TensorFlow is often preferred in production and deployment because it
has strong tools for scaling, mobile and web deployment, and long-term production support.
In short, PyTorch is better for flexibility and experimentation, while TensorFlow is better
for large-scale and production-ready systems.

Q10
We use activation functions to introduce non-linearity into neural networks. Without ac-
tivation functions, the network would behave like a simple linear model no matter how
many layers it has. Activation functions allow the network to learn complex patterns, make
decisions, and approximate complicated relationships in data.

You might also like