Predictive Analytics Exam Questions
Predictive Analytics Exam Questions
Using Bayes' theorem and the naive assumption of conditional independence, we express the probabilities for 'Flu' and 'No Flu' given Fever=Yes and Cough=Yes as follows: P(Flu | Fever=Yes, Cough=Yes) ∝ P(Fever=Yes | Flu) × P(Cough=Yes | Flu) × P(Flu), and P(No Flu | Fever=Yes, Cough=Yes) ∝ P(Fever=Yes | No Flu) × P(Cough=Yes | No Flu) × P(No Flu). By comparing the numerators, since P(Fever=Yes | Flu) and P(Cough=Yes | Flu) are both higher than their 'No Flu' counterparts, and P(Flu) is non-zero, the condition 'Flu' appears more probable for the patient .
When selecting a machine learning model using ROC curves for employee churn prediction, one should analyze the area under the curve (AUC) to determine the model's ability to distinguish between classes. A larger AUC suggests a better performance . Additionally, evaluating specific threshold points like th1 (0.6), th2 (0.5), and th3 (0.4) on a particular curve (e.g., the orange curve for a Random Forest model) reveals trade-offs between true positive and false positive rates at each point . The optimal threshold balances sensitivity and specificity. For instance, if the cost of false negatives is high, a lower threshold might be preferred to capture more true positives, even at the expense of more false positives .
The 'naive' assumption in the Naive Bayes classifier is that the features are conditionally independent given the class label. This means the presence or absence of a particular feature does not affect the presence or absence of any other feature given the class . This assumption simplifies probability calculations by allowing the joint probability of the features to be the product of individual probabilities, thus significantly reducing computational complexity .
Choosing a very small value for 'K', such as K=1, in KNN tends to make the model sensitive to noise and can result in high variance. This is because the model effectively memorizes the training data and can easily misclassify a sample if it happens to be near a noisy data point . Conversely, choosing a very large value for 'K' results in a smoother decision boundary, which can reduce variance but increase bias as the model might overlook subtle patterns in the data . Thus, large 'K' values can lead to underfitting as the model generalizes too much, potentially ignoring important local structures .
The logistic function, which maps any real-valued number into the (0, 1) interval, is given by the equation: 1/(1 + e^(-(b0 + b1x1 + b2x2))). It is essential in Logistic Regression as it converts linear combinations of input features into probabilities that can be interpreted as class membership likelihoods . This mapping is critical for binary classification problems, enabling the algorithm to estimate the probability of a sample belonging to a specific class based on input features .
For a customer behavior dataset, a data point with a high PC1 value typically indicates strong associations with variables heavily loaded on PC1, such as online shopping frequency, average order value, and customer ratings, representing customers likely engaged and spending more . On the other hand, a high PC2 value indicates a strong relationship with 'Subscription to promotions' and potentially 'Customer ratings', thus indicating customers who are particularly responsive to promotions, possibly those who look for discounts and offers . This differentiation can help retailers tailor marketing strategies based on the different purchasing profile and promotional responsiveness of customer segments .
Threshold values directly influence the trade-off between sensitivity (true positive rate) and specificity (1 - false positive rate) in a Random Forest model. In an ROC curve, adjusting thresholds like th1 (0.6), th2 (0.5), and th3 (0.4) will shift where the decision boundary is placed, affecting classification outcomes. A higher threshold such as 0.6 makes the model more conservative, increasing specificity which may lead to fewer false positives but might miss out on true positives. Conversely, a lower threshold like 0.4 increases sensitivity, potentially capturing more true positives at the risk of a higher false-positive rate . The choice of threshold should thus reflect the relative costs of different types of classification errors in the specific context of employee churn prediction .
The decision boundary of KNN is typically nonlinear and piecewise defined, reflecting its sensitivity to local neighborhood data variance, as it determines classifications based on localized sample groupings . In contrast, the decision boundary of Logistic Regression is linear, representing a straightforward hyperplane separation as it uses a logistic function to map predictions to probabilities . These boundaries illustrate KNN’s flexibility in complex datasets requiring fine local distinctions, whereas Logistic Regression assumes a more global linear relationship useful for simpler, linearly separable data distributions .
The K-means++ algorithm improves customer segmentation in e-commerce by providing an effective initialization method for cluster centers, which often results in better clustering outputs compared to the standard K-means. It addresses the problem of poor initial cluster centroids assignment, which can lead to suboptimal clustering and convergence to local minima with standard K-means . K-means++ mitigates this by initializing centroids that are distant from one another, ensuring a better spread and representation of the data distribution and potentially leading to improved segmentation and targeted marketing strategies .
Naive Bayes can still be effective in many real-world scenarios because, even with an unrealistic independence assumption, it often provides a good approximation of the joint distribution of features. This effectiveness is due to its robustness to noise and overfitting, particularly in high-dimensional spaces with many features, and its ability to handle both continuous and categorical data efficiently .