Machine Learning Exam Paper GR20D5129
Machine Learning Exam Paper GR20D5129
The formula for entropy in the context of information theory is given by: \( H(S) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i) \), where \( P(x_i) \) is the probability of occurrence for each class \( x_i \) in a dataset \( S \). Information gain is a key metric used in the construction of decision trees. It measures the reduction in entropy or impurity after a dataset is split on an attribute. Information gain is used to choose the attribute that best separates the data into distinct classes, thereby resulting in an optimal decision tree. The chosen attribute is the one that, when divided into branches, results in the most significant reduction in weighted entropy, which is crucial for building a tree that generalizes well to unseen data .
Machine Learning handles sparse data using techniques that accommodate or reduce sparsity, which is characterized by the presence of a large number of zeros in datasets like text or collaborative filtering scenarios . One common approach is the use of Regularization, such as L1 regularization (Lasso), which encourages sparsity in the model coefficients themselves. Another method is Matrix Factorization, better suited for recommendation systems, where sparse matrices are approximated by two lower-dimensional matrices that capture the latent patterns . Feature selection methods help by reducing dimensionality, retaining only the most informative features. Additionally, techniques like Sparse Coding and Compressive Sensing explicitly focus on representing data as a sparse combination of basis elements, thus dealing effectively with high-dimensional and sparse environments . These strategies ensure that sparse data does not compromise the performance, efficiency, and scalability of Machine Learning models.
Mixture Models and Latent Factor Models are both probabilistic in nature but have distinct purposes and methodologies. Mixture Models, such as Gaussian Mixture Models, assume that data points are generated by a mixture of several distributions, each representing a different cluster or group within the data. They are useful for capturing population heterogeneity and are often used for clustering tasks without considering any underlying structure beyond the mixture . On the other hand, Latent Factor Models, such as those used in collaborative filtering, assume that observed data is influenced by unobserved (latent) factors. These models aim to uncover the latent factors responsible for observed correlations and are commonly used in recommendation systems to model interactions between entities, such as users and items . Therefore, while Mixture Models focus on clustering based on data distribution, Latent Factor Models emphasize discovering hidden structures influencing the observable data.
Linear Discriminant Analysis (LDA) is used in Machine Learning primarily for dimensionality reduction and classification. It projects data from a higher-dimensional space to a lower-dimensional space while maintaining separability among classes . LDA maximizes the ratio of between-class variance to the within-class variance in any particular dataset, ensuring that the classes remain as distinct as possible when mapped to a smaller subspace. Unlike PCA, which focuses solely on maximizing variance without regard to class labels, LDA explicitly accounts for the class label information, making it better suited for classification tasks where class separability is essential . Thus, LDA is potent in scenarios where the objective is to find the feature space that best discriminates between known classes.
Active Learning is beneficial in Machine Learning because it enhances learning efficiency by selectively querying the most informative data points for labeling, thereby reducing the overall labeling cost and improving model performance with fewer labeled instances . This is particularly advantageous in scenarios where labeling data is expensive, time-consuming, or requires expert input, such as medical diagnosis or fine-tuning language models where huge labeled datasets are scarce. Active Learning helps against the downsides of random sampling by focusing on data points that are likely to improve the decision boundary or fill knowledge gaps in the model’s current understanding . This selective querying process ensures that the model obtains the most value per label, making it an impactful strategy when resources are constrained.
The R-Square, or R² Score, is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model . It provides a measure of how well the observed outcomes are replicated by the model, based on the proportion of total variation that is explained by the model. The R² Score is important because it offers a quantifiable value to assess the goodness-of-fit of the model, with values closer to 1 indicating better model performance. However, a high R² does not necessarily mean the model is optimal, as it can sometimes increase with more variables without improving model prediction, leading to overfitting . As such, R² needs to be interpreted in the context of the model complexity and the specific characteristics of the data.
K-Means clustering is significant in Machine Learning because it provides a simple yet efficient way to categorize data into distinct groups, facilitating data analysis and summarization . The algorithm works by partitioning the dataset into \( k \) clusters, where each data point belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Key advantages include its efficiency in handling large datasets and its straightforward implementation. However, K-Means has notable limitations. It is sensitive to the initial placement of centroids, which can lead to different results in different runs. The algorithm assumes that clusters are spherical and equally sized, which may not align with the real cluster structure in data. Additionally, it struggles with varying cluster sizes and densities, and is not robust against outliers and noise, highlighting the importance of careful preprocessing and parameter selection .
Hyperparameter Optimization involves the process of finding the best combination of hyperparameters for a Machine Learning model, which are the external configurations not learned from the training data but set prior to the learning process . This optimization is crucial for model performance because hyperparameters can significantly affect the model's predictive power, convergence, and computational efficiency. Poorly chosen hyperparameter values can lead to model underfitting, overfitting, or inefficient learning. Techniques like grid search, random search, and Bayesian optimization are employed to systematically explore the hyperparameter space for optimal settings . Optimizing these values ensures that the model is well-tuned to extract meaningful patterns from the data, thereby enhancing generalization and improving predictive performance on unseen data.
Overfitting occurs in Machine Learning when a model learns not only the training data but also the noise and outliers, making it perform well on the training data but poorly on unseen data . It indicates that the model has become too complex and specific to the training dataset. One common solution to overfitting is to implement regularization techniques, such as adding a penalty for larger coefficients in linear models (L1 or L2 regularization). Regularization helps to keep the model complexity in check and ensures that the model generalizes better to new data by preventing it from fitting the noise.
Reinforcement Learning (RL) differs from Unsupervised Learning in its learning approach. In RL, an agent learns to make decisions by taking actions in an environment to maximize cumulative reward without explicit supervision. The learning is based on the feedback from its actions in the form of rewards or penalties . In contrast, Unsupervised Learning involves finding hidden patterns or intrinsic structures in input data without labeled responses. Here, data is not associated with any output labels, and algorithms attempt to learn the underlying structure without any specific signals for success . Thus, RL focuses on sequential decision-making with performance improvements guided by rewards, while Unsupervised Learning focuses on data organization and understanding inherent patterns.

