0% found this document useful (0 votes)

144 views3 pages

NN Models for Spam Detection and Surfing

The document discusses three examples involving machine learning models: 1) A nearest neighbor model for surfing predictions using a weather dataset. 2) A bag-of-words model for email spam detection trained on five sample emails. 3) A linear regression model to predict oxygen consumption based on astronaut age and heart rate.

Uploaded by

Frank

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

144 views3 pages

NN Models for Spam Detection and Surfing

Uploaded by

Frank

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Additional exercices

1. The table below lists a dataset that was used to create a nearest neighbour model
that predicts whether it will be a good day to go surfing.

Assuming that the model uses Euclidean distance to find the nearest neighbour,
what prediction will the model return for each of the following query instances.

2. Email spam filtering models often use a bag-of-words representation

for emails. In a bag-of-words representation, the descriptive features that
describe a document (in our case, an email) each represent how many
times a particular word occurs in the document. One descriptive feature
is included for each word in a predefined dictionary. The dictionary is typically
defined as the complete set of words that occur in the training dataset.
The table below lists the bag-of-words representation for the following five
emails and a target feature, SPAM, whether they are spam emails or genuine
emails:

 “money, money, money”

 “free money for free gambling fun”
 “gambling for fun”
 “machine learning for fun, fun, fun”
 “free machine learning”

1
 What target level would a nearest neighbor model using Euclidean distance
return for the following email: “machine learning for free”?
 What target level would a k-NN model with k=3 and using Euclidean distance
return for the same query?
 What target level would a weighted k-NN model with k=5 and using a weighting
scheme of the reciprocal of the squared Euclidean distance between the neighbor
and the query, return for the query?
 What target level would a k-NN model with k=3 and using Manhattan distance
return for the same query?
 There are a lot of zero entries in the spam bag-of-words dataset. This is indicative
of sparse data and is typical for text analytics. Cosine similarity is often a good
choice when dealing with sparse non-binary data. What target level would a 3-
NN model using cosine similarity return for the query?

3. You have been hired by the European Space Agency to build a model that predicts
the amount of oxygen that an astronaut consumes when performing five minutes of
intense physical work. The descriptive features for the model will be the age of the
astronaut and their average heart rate throughout the work. The regression model is

The table below shows a historical dataset that has been collected for this task.

2
 Assuming that the current weights in a multivariate linear regression model
are w[0] = 59.50, w[1] = 0.15, and w[2]=0.60, make a prediction for each
training instance using this model.
 Calculate the sum of squared errors for the set of predictions generated in the
previous question.
 Assuming a learning rate of 0.000002, calculate the weights at the next
iteration of the gradient descent algorithm.
 Calculate the sum of squared errors for a set of predictions generated using
the new set of weights calculated in the previous question.

Common questions

The potential benefits of using gradient descent in optimizing a multivariate linear regression model include its ability to handle large parameter spaces efficiently, providing iterative updates that converge towards the optimal solution over time. However, drawbacks include the necessity to properly tune hyperparameters such as the learning rate, which impacts convergence speed and stability. A small learning rate might lead to slow convergence, while a large one can overshoot the minimum. Moreover, gradient descent might also get trapped in local minima, especially in non-convex loss surfaces, which could impede finding the global optimum .

The learning rate influences the effectiveness of gradient descent by determining the step size for each iteration towards the minimum loss. A properly set learning rate ensures convergence to the global minimum by balancing convergence speed and stability. A too small learning rate may lead to very slow convergence, exhausting computational resources. Conversely, a too large learning rate may cause the model to diverge or oscillate around the minimum due to overshooting, potentially missing the optimal solution. It is crucial to tune the learning rate for efficient optimization .

A k-NN model using Manhattan distance may outperform one using Euclidean distance when dealing with text classification tasks involving high-dimensional or sparse data. Manhattan distance measures absolute differences between coordinates, offering robustness against the distortions caused by high dimensionality, where Euclidean might misrepresent data by aggregating differences into a single geometric distance. Scenarios with data features having non-uniform scales or erratic distributions could also favor Manhattan distance, as it maintains independent feature granularity compared to Euclidean's overall sensitivity to magnitude variations .

The choice of distance metric significantly affects the results of a k-NN model in spam detection. For instance, using Euclidean distance, the model prioritizes closeness based on overall word frequency magnitudes, while Manhattan distance considers the absolute difference in word counts separately. This can lead to different classifications. Cosine similarity, on the other hand, emphasizes the angle and direction of word frequency vectors, making it better suited in handling sparse and high-dimensional text data typical in email datasets, resulting in different nearest neighbors and potential shifts in classification outcomes .

Cosine similarity is preferred for identifying similarities in sparse data because it computes similarity based on the angle between vectors, rather than their magnitude, making it scale invariant. This is beneficial in high-dimensional, sparse text data where absolute frequencies vary widely but directional patterns remain significant. It effectively discerns subtle similarities in terms of word usage direction, minimizing impact from zero entries, which is a common feature in text representation. Its robustness to random word counts or document lengths makes it especially suitable for text classification tasks .

In a bag-of-words model, sparsity is represented by the presence of many zero entries, indicating that most words from a large vocabulary do not appear in any single document or email. The high dimensionality and numerous zeroes contribute to this sparsity. Cosine similarity is effective in handling sparse data because it measures the cosine of the angle between two vectors (emails), focusing on the direction rather than magnitude. It thus remains robust to varying vector lengths and effectively identifies documents with similar word patterns even amidst sparse data .

A bag-of-words model contributes to spam filtering by transforming text data into numerical vectors, where each element represents the frequency of a word found in the email, allowing algorithms to classify emails based on word occurrence patterns. However, it presents limitations such as losing word order and context, leading to potential misinterpretation of text semantics. It also results in high dimensionality with extensive vocabularies, complicating model efficiency. Furthermore, it typically disregards syntactic nuances like sarcasm or irony, potentially reducing classification accuracy without supplementary methods .

When choosing k in a k-NN model for email classification, considerations include balancing bias and variance. A small k can lead to a model sensitive to noise, capturing too much detail (overfitting). A large k reduces sensitivity to noise but might underfit, masking local patterns. The choice of k should reflect the dataset size and diversity: larger and more varied datasets might benefit from higher k values, offering stable generalizations, while smaller datasets might require lower k values to maintain sensitivity to unique email patterns .

Challenges with using Euclidean distance in text classification include its susceptibility to the curse of dimensionality and sensitivity to scale since it aggregates all dimensions (word occurrences) without regard to individual significance. This can distort the true similarity between high-dimensional and sparse text data. Alternative measures like Manhattan distance consider the absolute difference across dimensions, potentially offering better granularity. Cosine similarity offers another alternative by normalizing vector length, making it invariant to vector magnitude and particularly effective for sparse data by focusing on direction, thus addressing limitations of Euclidean distance .

A weighted k-NN model might be preferred over a standard k-NN model because it accounts for varying degrees of influence among neighbors based on their distance to the query point. In text classification, closer neighbors (with smaller distances) are likely more relevant to the query. By applying weights that decrease with distance (e.g., reciprocal of squared distances), the model places more emphasis on nearer neighbors, potentially improving classification accuracy by reducing the impact of more distant, potentially less relevant neighbors .

Market Basket Analysis: Chips vs. Soda
No ratings yet
Market Basket Analysis: Chips vs. Soda
5 pages
Republic Polytechnic Diploma Offerings
No ratings yet
Republic Polytechnic Diploma Offerings
12 pages
Amici Brief Against Safe Injection Sites
No ratings yet
Amici Brief Against Safe Injection Sites
22 pages
Free Gems Generator for Azar App
No ratings yet
Free Gems Generator for Azar App
3 pages
Insights on Money and Economics
No ratings yet
Insights on Money and Economics
2 pages
Financial Planning for Home and Auto Purchases
100% (4)
Financial Planning for Home and Auto Purchases
33 pages
Understanding Mutual Funds Guide
No ratings yet
Understanding Mutual Funds Guide
17 pages
PVZ Hacking Guide: Unlimited Resources
No ratings yet
PVZ Hacking Guide: Unlimited Resources
27 pages
PHP Script for Data Logging
No ratings yet
PHP Script for Data Logging
1 page
Coin Hack Using Cheat Engine 5.5
No ratings yet
Coin Hack Using Cheat Engine 5.5
1 page
Derivatives Markets Solutions for SOA MFE
No ratings yet
Derivatives Markets Solutions for SOA MFE
25 pages
Derivatives Markets Solutions for SOA MFE
No ratings yet
Derivatives Markets Solutions for SOA MFE
30 pages
MSP VIP Code Generator 2021-2022
No ratings yet
MSP VIP Code Generator 2021-2022
6 pages
Taxpayer Subsidies in Private Money Creation
No ratings yet
Taxpayer Subsidies in Private Money Creation
31 pages
Ontario Trade Equivalency Assessment Guide
No ratings yet
Ontario Trade Equivalency Assessment Guide
20 pages
User-Friendly Ad Blocking Software
No ratings yet
User-Friendly Ad Blocking Software
6 pages
Wegilant Ethical Hacking Workshop Proposal
No ratings yet
Wegilant Ethical Hacking Workshop Proposal
16 pages
Stempython Free c1
No ratings yet
Stempython Free c1
13 pages
Ward 5 Skate-a-Thon Highlights 2013
No ratings yet
Ward 5 Skate-a-Thon Highlights 2013
7 pages
Hackety Hack: Beginner's Ruby Guide
No ratings yet
Hackety Hack: Beginner's Ruby Guide
12 pages
Hack Reactor Full Stack Curriculum Overview
No ratings yet
Hack Reactor Full Stack Curriculum Overview
2 pages
Missouri Assistant Physician Application Guide
No ratings yet
Missouri Assistant Physician Application Guide
16 pages
GATE 2014 Preparation Guide
0% (1)
GATE 2014 Preparation Guide
19 pages
RSA Algorithm Theory and Concepts
No ratings yet
RSA Algorithm Theory and Concepts
7 pages
FCFF Stable Growth Model Overview
100% (1)
FCFF Stable Growth Model Overview
51 pages
FAFSA and Financial Aid Resources
No ratings yet
FAFSA and Financial Aid Resources
2 pages
Present Money System
100% (1)
Present Money System
22 pages
Agile Money For Nothing
No ratings yet
Agile Money For Nothing
47 pages
Iapfree-Compatibility-List Last Update 28 June 2012
No ratings yet
Iapfree-Compatibility-List Last Update 28 June 2012
77 pages
Overview of Money Market Instruments
No ratings yet
Overview of Money Market Instruments
18 pages
Free Toyota Monroney Label Guide
No ratings yet
Free Toyota Monroney Label Guide
1 page
Making Money in De-Fi - 1.3
No ratings yet
Making Money in De-Fi - 1.3
41 pages
Overview of India's Financial Services Dept.
No ratings yet
Overview of India's Financial Services Dept.
5 pages
Overview of Monetary System Reforms
No ratings yet
Overview of Monetary System Reforms
37 pages
Free Job Hunt Guide v1.1 PDF
No ratings yet
Free Job Hunt Guide v1.1 PDF
78 pages
RF Hacking Lab Development Thesis
No ratings yet
RF Hacking Lab Development Thesis
72 pages
Online MetroCard Recharge System
No ratings yet
Online MetroCard Recharge System
36 pages
Google Ads Strategies for Flight Booking
No ratings yet
Google Ads Strategies for Flight Booking
13 pages
Political Ads
No ratings yet
Political Ads
48 pages
Overunity Magnetic Electricity Generator
No ratings yet
Overunity Magnetic Electricity Generator
14 pages
Java Certification Exam Questions Guide
No ratings yet
Java Certification Exam Questions Guide
37 pages
Genneva Malaysia Sdn. Bhd. HYIP Analysis
No ratings yet
Genneva Malaysia Sdn. Bhd. HYIP Analysis
10 pages
The Trickster of New Mullion
No ratings yet
The Trickster of New Mullion
1 page
Usury Laws: Protecting Against Hardship
No ratings yet
Usury Laws: Protecting Against Hardship
42 pages
Greedy Algorithms Overview and Examples
No ratings yet
Greedy Algorithms Overview and Examples
21 pages
Linear Congruential Generator (LCG) Is An
No ratings yet
Linear Congruential Generator (LCG) Is An
12 pages
Strategies to Turn $3K into $100K
No ratings yet
Strategies to Turn $3K into $100K
4 pages
Accountant Salary Overview
No ratings yet
Accountant Salary Overview
14 pages
Best Free Online VLSI Courses Guide
100% (1)
Best Free Online VLSI Courses Guide
8 pages
Infosys HackWithInfy 2025 Registration Open
No ratings yet
Infosys HackWithInfy 2025 Registration Open
1 page
History of Bluetooth Technology
No ratings yet
History of Bluetooth Technology
36 pages
Understanding Number Systems and Types
No ratings yet
Understanding Number Systems and Types
64 pages
How To Pay Income Tax Online Using CIMB Clicks - LiewCF Tech Blog
No ratings yet
How To Pay Income Tax Online Using CIMB Clicks - LiewCF Tech Blog
11 pages
Profitable Sports Betting Strategies
No ratings yet
Profitable Sports Betting Strategies
2 pages
Universal Basic Income Discussion Guide
No ratings yet
Universal Basic Income Discussion Guide
13 pages
Bypassing Scribd Paywall Tricks
No ratings yet
Bypassing Scribd Paywall Tricks
3 pages
Crypto Mega Theses - Multicoin
No ratings yet
Crypto Mega Theses - Multicoin
12 pages
Evaluating Cryptosystems: Paul Kocher
No ratings yet
Evaluating Cryptosystems: Paul Kocher
16 pages
Taxing Illegal Income: Legal Insights
No ratings yet
Taxing Illegal Income: Legal Insights
8 pages
Nearest Neighbor Predictions for Surfing
No ratings yet
Nearest Neighbor Predictions for Surfing
10 pages
PeopleSoft NVision User Guide
No ratings yet
PeopleSoft NVision User Guide
80 pages
Sampling Distribution Exercises for ENSSEA 2025
No ratings yet
Sampling Distribution Exercises for ENSSEA 2025
2 pages
Dependency Injection in Java Explained
No ratings yet
Dependency Injection in Java Explained
17 pages
Polytechnic Mathematics Resource Guide
No ratings yet
Polytechnic Mathematics Resource Guide
44 pages
WinSC: ADCP Command and Deployment Tool
No ratings yet
WinSC: ADCP Command and Deployment Tool
42 pages
AS/400 Message Handling Overview
100% (1)
AS/400 Message Handling Overview
38 pages
Postmortem Report for Year 3 English
No ratings yet
Postmortem Report for Year 3 English
36 pages
Allen Career Institute Timetable 2025
No ratings yet
Allen Career Institute Timetable 2025
14 pages
Thesis vs Project Report Guide
No ratings yet
Thesis vs Project Report Guide
6 pages
Lolo Punzel Video Analysis
No ratings yet
Lolo Punzel Video Analysis
174 pages
Worksheet for Year 3 English Assessment
No ratings yet
Worksheet for Year 3 English Assessment
5 pages
English Regents Exam Prep Guide
No ratings yet
English Regents Exam Prep Guide
8 pages
WP8026ADAM User Manual V1.42A
No ratings yet
WP8026ADAM User Manual V1.42A
5 pages
Enhancing Social Responsibility in Grade 11
No ratings yet
Enhancing Social Responsibility in Grade 11
11 pages
CATIA V5 Shortcut Keys Guide
No ratings yet
CATIA V5 Shortcut Keys Guide
2 pages
Tone and Supernaturalism in Yeats' Poetry
No ratings yet
Tone and Supernaturalism in Yeats' Poetry
2 pages
Elucidating Concepts in Academic Writing
No ratings yet
Elucidating Concepts in Academic Writing
3 pages
Life of Saint Francis of Assisi
No ratings yet
Life of Saint Francis of Assisi
4 pages
Grade 7 English DLL Week 7 Q4
100% (2)
Grade 7 English DLL Week 7 Q4
5 pages
Roman Engraved Gems in Lisbon Museum
No ratings yet
Roman Engraved Gems in Lisbon Museum
74 pages
Overcoming English Class Struggles
No ratings yet
Overcoming English Class Struggles
2 pages
Understanding Relative Clauses
No ratings yet
Understanding Relative Clauses
4 pages
Fractal Movement and Perception Dynamics
No ratings yet
Fractal Movement and Perception Dynamics
14 pages
Goats and Crows: A Decodable Story
No ratings yet
Goats and Crows: A Decodable Story
7 pages
Cryptography Applications in ElGamal and RSA
No ratings yet
Cryptography Applications in ElGamal and RSA
3 pages
Zvirungamutauro muChiShona 4
100% (1)
Zvirungamutauro muChiShona 4
33 pages
Eng101 Midterm MCQs Study Guide
No ratings yet
Eng101 Midterm MCQs Study Guide
20 pages
Se 602
No ratings yet
Se 602
10 pages
Customer Contact Details by Region
No ratings yet
Customer Contact Details by Region
27 pages
Underworld Gods in Ancient Greek Religion Death and Reciprocity 1st Edition Ellie Mackin Roberts Ebook Testbank Solutions All Chapter Text Included
100% (2)
Underworld Gods in Ancient Greek Religion Death and Reciprocity 1st Edition Ellie Mackin Roberts Ebook Testbank Solutions All Chapter Text Included
145 pages

NN Models for Spam Detection and Surfing

Uploaded by

NN Models for Spam Detection and Surfing

Uploaded by

Additional exercices

2. Email spam filtering models often use a bag-of-words representation

 “money, money, money”

Common questions

What are the potential benefits and drawbacks of using gradient descent in optimizing a multivariate linear regression model for predicting astronaut oxygen consumption?

How does the learning rate influence the effectiveness of gradient descent in optimizing a linear regression model's weights?

In what scenarios would a k-NN model using Manhattan distance outperform one using Euclidean distance in text classification tasks?

How does the choice of distance metric affect the results of a k-NN model in determining if an email is spam?

Why is cosine similarity a preferred method for identifying similarities in sparse data, particularly in text classification tasks?

How is the concept of sparsity in data represented within a bag-of-words model, and why is cosine similarity considered effective in handling it?

How does a bag-of-words model contribute to the process of spam filtering, and what limitations does it present?

What considerations must be taken into account when choosing k in a k-NN model for classifying emails?

What challenges might arise when using Euclidean distance in text classification tasks like spam detection, and how do alternative distance measures address these challenges?

Why might a weighted k-NN model be preferred over a standard k-NN model in classifying text data?

You might also like