0% found this document useful (0 votes)
144 views3 pages

NN Models for Spam Detection and Surfing

The document discusses three examples involving machine learning models: 1) A nearest neighbor model for surfing predictions using a weather dataset. 2) A bag-of-words model for email spam detection trained on five sample emails. 3) A linear regression model to predict oxygen consumption based on astronaut age and heart rate.

Uploaded by

Frank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views3 pages

NN Models for Spam Detection and Surfing

The document discusses three examples involving machine learning models: 1) A nearest neighbor model for surfing predictions using a weather dataset. 2) A bag-of-words model for email spam detection trained on five sample emails. 3) A linear regression model to predict oxygen consumption based on astronaut age and heart rate.

Uploaded by

Frank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Additional exercices

1. The table below lists a dataset that was used to create a nearest neighbour model
that predicts whether it will be a good day to go surfing.

Assuming that the model uses Euclidean distance to find the nearest neighbour,
what prediction will the model return for each of the following query instances.

2. Email spam filtering models often use a bag-of-words representation


for emails. In a bag-of-words representation, the descriptive features that
describe a document (in our case, an email) each represent how many
times a particular word occurs in the document. One descriptive feature
is included for each word in a predefined dictionary. The dictionary is typically
defined as the complete set of words that occur in the training dataset.
The table below lists the bag-of-words representation for the following five
emails and a target feature, SPAM, whether they are spam emails or genuine
emails:

 “money, money, money”


 “free money for free gambling fun”
 “gambling for fun”
 “machine learning for fun, fun, fun”
 “free machine learning”

1
 What target level would a nearest neighbor model using Euclidean distance
return for the following email: “machine learning for free”?
 What target level would a k-NN model with k=3 and using Euclidean distance
return for the same query?
 What target level would a weighted k-NN model with k=5 and using a weighting
scheme of the reciprocal of the squared Euclidean distance between the neighbor
and the query, return for the query?
 What target level would a k-NN model with k=3 and using Manhattan distance
return for the same query?
 There are a lot of zero entries in the spam bag-of-words dataset. This is indicative
of sparse data and is typical for text analytics. Cosine similarity is often a good
choice when dealing with sparse non-binary data. What target level would a 3-
NN model using cosine similarity return for the query?

3. You have been hired by the European Space Agency to build a model that predicts
the amount of oxygen that an astronaut consumes when performing five minutes of
intense physical work. The descriptive features for the model will be the age of the
astronaut and their average heart rate throughout the work. The regression model is

The table below shows a historical dataset that has been collected for this task.

2
 Assuming that the current weights in a multivariate linear regression model
are w[0] = 59.50, w[1] = 0.15, and w[2]=0.60, make a prediction for each
training instance using this model.
 Calculate the sum of squared errors for the set of predictions generated in the
previous question.
 Assuming a learning rate of 0.000002, calculate the weights at the next
iteration of the gradient descent algorithm.
 Calculate the sum of squared errors for a set of predictions generated using
the new set of weights calculated in the previous question.

Common questions

Powered by AI

The potential benefits of using gradient descent in optimizing a multivariate linear regression model include its ability to handle large parameter spaces efficiently, providing iterative updates that converge towards the optimal solution over time. However, drawbacks include the necessity to properly tune hyperparameters such as the learning rate, which impacts convergence speed and stability. A small learning rate might lead to slow convergence, while a large one can overshoot the minimum. Moreover, gradient descent might also get trapped in local minima, especially in non-convex loss surfaces, which could impede finding the global optimum .

The learning rate influences the effectiveness of gradient descent by determining the step size for each iteration towards the minimum loss. A properly set learning rate ensures convergence to the global minimum by balancing convergence speed and stability. A too small learning rate may lead to very slow convergence, exhausting computational resources. Conversely, a too large learning rate may cause the model to diverge or oscillate around the minimum due to overshooting, potentially missing the optimal solution. It is crucial to tune the learning rate for efficient optimization .

A k-NN model using Manhattan distance may outperform one using Euclidean distance when dealing with text classification tasks involving high-dimensional or sparse data. Manhattan distance measures absolute differences between coordinates, offering robustness against the distortions caused by high dimensionality, where Euclidean might misrepresent data by aggregating differences into a single geometric distance. Scenarios with data features having non-uniform scales or erratic distributions could also favor Manhattan distance, as it maintains independent feature granularity compared to Euclidean's overall sensitivity to magnitude variations .

The choice of distance metric significantly affects the results of a k-NN model in spam detection. For instance, using Euclidean distance, the model prioritizes closeness based on overall word frequency magnitudes, while Manhattan distance considers the absolute difference in word counts separately. This can lead to different classifications. Cosine similarity, on the other hand, emphasizes the angle and direction of word frequency vectors, making it better suited in handling sparse and high-dimensional text data typical in email datasets, resulting in different nearest neighbors and potential shifts in classification outcomes .

Cosine similarity is preferred for identifying similarities in sparse data because it computes similarity based on the angle between vectors, rather than their magnitude, making it scale invariant. This is beneficial in high-dimensional, sparse text data where absolute frequencies vary widely but directional patterns remain significant. It effectively discerns subtle similarities in terms of word usage direction, minimizing impact from zero entries, which is a common feature in text representation. Its robustness to random word counts or document lengths makes it especially suitable for text classification tasks .

In a bag-of-words model, sparsity is represented by the presence of many zero entries, indicating that most words from a large vocabulary do not appear in any single document or email. The high dimensionality and numerous zeroes contribute to this sparsity. Cosine similarity is effective in handling sparse data because it measures the cosine of the angle between two vectors (emails), focusing on the direction rather than magnitude. It thus remains robust to varying vector lengths and effectively identifies documents with similar word patterns even amidst sparse data .

A bag-of-words model contributes to spam filtering by transforming text data into numerical vectors, where each element represents the frequency of a word found in the email, allowing algorithms to classify emails based on word occurrence patterns. However, it presents limitations such as losing word order and context, leading to potential misinterpretation of text semantics. It also results in high dimensionality with extensive vocabularies, complicating model efficiency. Furthermore, it typically disregards syntactic nuances like sarcasm or irony, potentially reducing classification accuracy without supplementary methods .

When choosing k in a k-NN model for email classification, considerations include balancing bias and variance. A small k can lead to a model sensitive to noise, capturing too much detail (overfitting). A large k reduces sensitivity to noise but might underfit, masking local patterns. The choice of k should reflect the dataset size and diversity: larger and more varied datasets might benefit from higher k values, offering stable generalizations, while smaller datasets might require lower k values to maintain sensitivity to unique email patterns .

Challenges with using Euclidean distance in text classification include its susceptibility to the curse of dimensionality and sensitivity to scale since it aggregates all dimensions (word occurrences) without regard to individual significance. This can distort the true similarity between high-dimensional and sparse text data. Alternative measures like Manhattan distance consider the absolute difference across dimensions, potentially offering better granularity. Cosine similarity offers another alternative by normalizing vector length, making it invariant to vector magnitude and particularly effective for sparse data by focusing on direction, thus addressing limitations of Euclidean distance .

A weighted k-NN model might be preferred over a standard k-NN model because it accounts for varying degrees of influence among neighbors based on their distance to the query point. In text classification, closer neighbors (with smaller distances) are likely more relevant to the query. By applying weights that decrease with distance (e.g., reciprocal of squared distances), the model places more emphasis on nearer neighbors, potentially improving classification accuracy by reducing the impact of more distant, potentially less relevant neighbors .

You might also like