0% found this document useful (0 votes)
81 views4 pages

Predictive Analytics Exam Questions

Predictive AnalyticS

Uploaded by

vidhu.cooky
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views4 pages

Predictive Analytics Exam Questions

Predictive AnalyticS

Uploaded by

vidhu.cooky
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Predictive Analytics

BAZG512/MBAZG512/PDBAZG512
S2-24
EC3 – Comprehensive Exam (Regular)
Full Marks - 40

Q1. In KNN, Discuss the impact of choosing a very small value for 'K' (e.g., K=1) versus a
very large value for 'K' (e.g., K close to the total number of training samples) on the model's
performance, touching upon potential issues like bias, variance, and sensitivity to noise. (3
Marks)

Q2. Answer the following questions with respect to Naïve Bayes Classifier. (8 Marks)
(a) Clearly explain the "naive" assumption that is central to the Naive Bayes Classifier.
(b) How does this "naive" assumption simplify the calculation of probabilities needed for
classification? (1 Marks)
(c) Despite this often unrealistic assumption, provide one reason why Naive Bayes can still
be an effective classifier in many real-world applications.(1 Marks)
(d) Imagine a simplified scenario where a Naive Bayes Classifier is used to determine if a
patient is likely to have a 'Flu' or 'No Flu', based on two symptoms: 'Fever' (Yes/No) and
'Cough' (Yes/No).
From historical data, the following probabilities have been estimated:
P(Fever=Yes | Flu) = 0.8
P(Cough=Yes | Flu) = 0.7
P(Flu) = 0.1 (Prior probability of having the Flu)
P(Fever=Yes | No Flu) = 0.1
P(Cough=Yes | No Flu) = 0.2
P(No Flu) = 0.9 (Prior probability of not having the Flu)
A new patient presents with both Fever=Yes AND Cough=Yes.
You want to calculate P(Flu | Fever=Yes, Cough=Yes) and P(No Flu | Fever=Yes,
Cough=Yes).
Without needing to calculate the final exact posterior probabilities, write down the
expressions for P(Flu | Fever=Yes, Cough=Yes) and P(No Flu | Fever=Yes, Cough=Yes)
using Bayes' theorem and the "naive" assumption. Which condition (Flu or No Flu)
appears more probable for this patient given the symptoms and priors, and briefly explain
your reasoning by comparing the numerators of these expressions. (5 Marks)
Q3.
In a retail environment, Principal Component Analysis (PCA) is applied to a dataset
containing customer behaviour data, focusing on aspects like online shopping frequency,
average order value, customer ratings, and subscription to promotional offers. The loading
vectors for the first two principal components (PC1 and PC2) are provided below. (7 marks)

V1 V2

Online shopping frequency 0.6 0.2


Average order value 0.6 -0.3
Customer ratings 0.7 0.5
Subscription to promotions 0.3 0.8

Interpret the loading vectors concerning PC1 and PC2. Discuss the implications of a data
point with high PC1 and one with high PC2 regarding the customer segment it represents.
Provide insights into the similarities or differences among various characteristics of the
customers indicated by the features.

Q4. In the context of e-commerce, how can the K-means++ algorithm be applied to perform
customer segmentation and improve targeted marketing strategies? Explain what is the
drawback of KMeans algorithm that is solved by K-means++ in this case. (4 marks)

Q5.
(4 + 4 = 8 marks)
ROC curves for 4 different Machine Learning techniques applied on an employee churn
prediction problem is given below. Comment on which model should be selected and why?
Consider the three threshold points marked in the figure as th1 (0.6), th2 (0.5) and th3 (0.4) in
the orange colour curve for the Random forest model. Discuss the three threshold values and
their effect on the model result. Also, suggest your choice of threshold.
Q6.
Answer the following questions with respect to the following figure. (10 Marks)

a) Suppose the above figures shows decision boundaries for KNN and Logistic
regression model applied on a 2D dataset. Answer which decision boundary (A and
B) is for which algorithm (KNN and Logistic Regression). Explain why. [4 Marks]
b) What function is denoted by the following equation? In which machine learning
algorithm (KNN or Logistic regression) is it used? [1 Mark]

1
−(b0 +b1 x 1+b2 x2 )
1+ e

c) What would be the equation of the decision boundary for the ML algorithm referred
to in question b)? Explain [2 Marks]

d) With reference of question b), explain the function and its necessity in the context of
classification problems? [3 Marks]

Common questions

Powered by AI

Using Bayes' theorem and the naive assumption of conditional independence, we express the probabilities for 'Flu' and 'No Flu' given Fever=Yes and Cough=Yes as follows: P(Flu | Fever=Yes, Cough=Yes) ∝ P(Fever=Yes | Flu) × P(Cough=Yes | Flu) × P(Flu), and P(No Flu | Fever=Yes, Cough=Yes) ∝ P(Fever=Yes | No Flu) × P(Cough=Yes | No Flu) × P(No Flu). By comparing the numerators, since P(Fever=Yes | Flu) and P(Cough=Yes | Flu) are both higher than their 'No Flu' counterparts, and P(Flu) is non-zero, the condition 'Flu' appears more probable for the patient .

When selecting a machine learning model using ROC curves for employee churn prediction, one should analyze the area under the curve (AUC) to determine the model's ability to distinguish between classes. A larger AUC suggests a better performance . Additionally, evaluating specific threshold points like th1 (0.6), th2 (0.5), and th3 (0.4) on a particular curve (e.g., the orange curve for a Random Forest model) reveals trade-offs between true positive and false positive rates at each point . The optimal threshold balances sensitivity and specificity. For instance, if the cost of false negatives is high, a lower threshold might be preferred to capture more true positives, even at the expense of more false positives .

The 'naive' assumption in the Naive Bayes classifier is that the features are conditionally independent given the class label. This means the presence or absence of a particular feature does not affect the presence or absence of any other feature given the class . This assumption simplifies probability calculations by allowing the joint probability of the features to be the product of individual probabilities, thus significantly reducing computational complexity .

Choosing a very small value for 'K', such as K=1, in KNN tends to make the model sensitive to noise and can result in high variance. This is because the model effectively memorizes the training data and can easily misclassify a sample if it happens to be near a noisy data point . Conversely, choosing a very large value for 'K' results in a smoother decision boundary, which can reduce variance but increase bias as the model might overlook subtle patterns in the data . Thus, large 'K' values can lead to underfitting as the model generalizes too much, potentially ignoring important local structures .

The logistic function, which maps any real-valued number into the (0, 1) interval, is given by the equation: 1/(1 + e^(-(b0 + b1x1 + b2x2))). It is essential in Logistic Regression as it converts linear combinations of input features into probabilities that can be interpreted as class membership likelihoods . This mapping is critical for binary classification problems, enabling the algorithm to estimate the probability of a sample belonging to a specific class based on input features .

For a customer behavior dataset, a data point with a high PC1 value typically indicates strong associations with variables heavily loaded on PC1, such as online shopping frequency, average order value, and customer ratings, representing customers likely engaged and spending more . On the other hand, a high PC2 value indicates a strong relationship with 'Subscription to promotions' and potentially 'Customer ratings', thus indicating customers who are particularly responsive to promotions, possibly those who look for discounts and offers . This differentiation can help retailers tailor marketing strategies based on the different purchasing profile and promotional responsiveness of customer segments .

Threshold values directly influence the trade-off between sensitivity (true positive rate) and specificity (1 - false positive rate) in a Random Forest model. In an ROC curve, adjusting thresholds like th1 (0.6), th2 (0.5), and th3 (0.4) will shift where the decision boundary is placed, affecting classification outcomes. A higher threshold such as 0.6 makes the model more conservative, increasing specificity which may lead to fewer false positives but might miss out on true positives. Conversely, a lower threshold like 0.4 increases sensitivity, potentially capturing more true positives at the risk of a higher false-positive rate . The choice of threshold should thus reflect the relative costs of different types of classification errors in the specific context of employee churn prediction .

The decision boundary of KNN is typically nonlinear and piecewise defined, reflecting its sensitivity to local neighborhood data variance, as it determines classifications based on localized sample groupings . In contrast, the decision boundary of Logistic Regression is linear, representing a straightforward hyperplane separation as it uses a logistic function to map predictions to probabilities . These boundaries illustrate KNN’s flexibility in complex datasets requiring fine local distinctions, whereas Logistic Regression assumes a more global linear relationship useful for simpler, linearly separable data distributions .

The K-means++ algorithm improves customer segmentation in e-commerce by providing an effective initialization method for cluster centers, which often results in better clustering outputs compared to the standard K-means. It addresses the problem of poor initial cluster centroids assignment, which can lead to suboptimal clustering and convergence to local minima with standard K-means . K-means++ mitigates this by initializing centroids that are distant from one another, ensuring a better spread and representation of the data distribution and potentially leading to improved segmentation and targeted marketing strategies .

Naive Bayes can still be effective in many real-world scenarios because, even with an unrealistic independence assumption, it often provides a good approximation of the joint distribution of features. This effectiveness is due to its robustness to noise and overfitting, particularly in high-dimensional spaces with many features, and its ability to handle both continuous and categorical data efficiently .

You might also like