Capstone Project: Fraud Detection with Machine Learning
What This Is
You’re stepping into the role of a cybersecurity analyst working in a financial institution.
Your job? To help build a machine learning model that can detect fraudulent transactions
— before they cost the company money, reputation, and customer trust.
You’ll follow a complete machine learning pipeline — just like real-world analysts and data
scientists do — from the raw data to a working model. In this stage, we focus on the first half:
understanding and cleaning the data.
This isn’t just coding. This is threat detection, anomaly hunting, and AI modeling all rolled
into one.
Real-World Scenario
A bank’s fraud monitoring system flagged a sharp rise in fraudulent transactions. You're on the
internal threat analysis team. You've been handed 200 recent transactions and asked to build a
model that can automatically flag suspicious ones.
You’ll prepare this model using real-world practices:
Explore messy data
Clean it
Visualize what matters
Train a simple model
Check how well it works
Prepare to deploy it for live prediction
What You’ll Do in This Part (Part 1: Data Preparation)
1. Generate synthetic transaction data
You’ll simulate a real-world dataset — some of it clean, some suspicious.
2. Explore the data
Look at its structure, understand what it tells you, and look for red flags.
3. Visualize key features
Use graphs to uncover insights: outliers, suspicious spikes, fraud patterns.
4. Ask critical questions
Can we trust this data? What needs fixing? Where might fraud hide?
This part is the foundation of everything that follows. If the data is wrong or dirty, your AI will
fail — no matter how fancy the algorithm.
Step 0:Import Libraries and Create the Dataset
import pandas as pd
import numpy as np
import [Link] as plt
# Set random seed for consistency
[Link](42)
# Generate synthetic transaction dataset
n = 200
data = [Link]({
'transaction_id': [Link](1, n + 1),
'amount': [Link](100, 20, size=n).round(2),
'country': [Link](['US', 'MY', 'N/A', 'UK', 'India'], size=n),
'device': [Link](['mobile', 'desktop', 'tablet', 'MOBILE',
'unknown'], size=n),
'is_fraud': [Link]([0, 1], size=n, p=[0.9, 0.1])
})
# Inject some anomalies
[Link][5, 'amount'] = 9999.99 # suspiciously high
[Link][15, 'amount'] = -500 # invalid
[Link][25, 'country'] = None # missing value
[Link][35, 'device'] = 'toaster' # nonsense device
# Preview the data
[Link](10)
Step 1: Initial Exploration
# Structure and types
[Link]()
Question 1: What kind of data types do you see? Are they appropriate?
Question 2: Do you notice any missing or inconsistent data?
Step 2: Summary Statistics
[Link]()
Question 3: Are all the values in the amount column realistic?
Question 4: What does the standard deviation tell you here?
Step 3: Visualize the Distribution of Amounts
[Link](data['amount'], bins=30)
[Link]('Transaction Amount Distribution')
[Link]('Amount')
[Link]('Frequency')
[Link](True)
plt.tight_layout()
[Link]()
Question 5: Do you notice any outliers or spikes in the data?
Question 6: Why could an outlier matter in fraud detection?
Step 4: Fraud vs Non-Fraud Count
data['is_fraud'].value_counts().plot(kind='bar')
[Link]('Fraud vs Non-Fraud Transactions')
[Link]('is_fraud (0 = clean, 1 = fraud)')
[Link]('Number of Transactions')
plt.tight_layout()
[Link]()
Question 7: Is the dataset balanced? What does that mean for training a model?
So far…
Recap of What Just Happened
1. Removed Bad Data
o Cleaned out transactions with invalid amounts (like negative or extreme
values).
o This step mimics what a fraud analyst might do when filtering out system
noise or corrupted logs.
2. Standardized Inputs
o Turned inconsistent or silly values like "toaster" and "N/A" into
standardized, usable categories.
o In real-life, this prevents AI from getting confused by dirty labels.
3. Engineered a New Feature
o We added a new column: high_value, which marks if a transaction is over
200.
o This is often where cyber + data merge: we bring in logic that reflects
analyst intuition. Higher-value transactions are more likely to be targeted
by fraudsters.
Capstone Project: Fraud Detection (Part 2 – Clean and Engineer Data)
# Step 0: Import Libraries
import pandas as pd
import numpy as np
import [Link] as plt
# Set seed for reproducibility
[Link](42)
# Step 1: Create Dataset
n = 200
data = [Link]({
'transaction_id': [Link](1, n + 1),
'amount': [Link](100, 20, size=n).round(2),
'country': [Link](['US', 'MY', 'N/A', 'UK', 'India'], size=n),
'device': [Link](['mobile', 'desktop', 'tablet', 'MOBILE',
'unknown'], size=n),
'is_fraud': [Link]([0, 1], size=n, p=[0.9, 0.1])
})
# Inject anomalies
[Link][5, 'amount'] = 9999.99 # suspiciously high
[Link][15, 'amount'] = -500 # invalid
[Link][25, 'country'] = None # missing
[Link][35, 'device'] = 'toaster' # nonsense device
# Step 2: Clean and Feature Engineer
clean_data = [Link]()
# Remove invalid amounts
clean_data = clean_data[(clean_data['amount'] > 0) & (clean_data['amount'] <
5000)]
# Standardize country
clean_data['country'] = clean_data['country'].fillna('Unknown')
clean_data['country'] = clean_data['country'].replace('N/A', 'Unknown')
# Normalize device field
clean_data['device'] = clean_data['device'].[Link]()
clean_data['device'] = clean_data['device'].replace({'toaster': 'unknown'})
# Feature engineering: High-value flag
clean_data['high_value'] = (clean_data['amount'] > 200).astype(int)
# Preview cleaned data
print(clean_data.head())
Lab Output Interpretation: Cleaned Transaction Data
After completing Part 2 of the capstone project — cleaning and engineering features —
you should now see a clean preview of your dataset that looks like this:
transaction_id amount country device is_fraud high_value
0 1 109.93 MY mobile 0 0
1 2 97.23 MY mobile 0 0
2 3 112.95 US tablet 0 0
3 4 130.46 US unknown 0 0
4 5 95.32 US mobile 1 0
Each row represents a simulated transaction. This is a cleaned, structured version of
messy log data that a real fraud analyst might work with.
You now have clean, structured data — ready for training an AI model.
You’ve practiced feature engineering — turning raw data into something more
meaningful.
You’ve handled messy real-world problems: invalid values, junk device types, missing
countries.
Part 3: Train-Test Split + Model Training
In this section, you're acting like a machine learning engineer inside a cybersecurity
team. You’ve prepped the data — now it’s time to teach the AI to detect fraud using that
data.
You’ll train a model on known data (clean vs fraud), then check if it can make accurate
predictions on unseen data. This is the heart of supervised machine learning — the
model learns from past behavior to predict future risks.
This is where you split the data into:
Training data — for the model to learn patterns.
Testing data — to check if the model actually learned anything useful.
Then, you train a simple supervised ML model (like Logistic Regression) to detect
fraud based on the input features.
Capstone Part 3: Train a Fraud Detection Model (With Cleaned Data)
# Import necessary libraries
import pandas as pd
import numpy as np
import [Link] as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from [Link] import accuracy_score, confusion_matrix,
classification_report
# -------------------------------
# Step 1: Simulate synthetic data
# -------------------------------
[Link](42)
n = 200
data = [Link]({
'transaction_id': [Link](1, n + 1),
'amount': [Link](100, 20, size=n).round(2),
'country': [Link](['US', 'MY', 'N/A', 'UK', 'India'], size=n),
'device': [Link](['mobile', 'desktop', 'tablet', 'MOBILE',
'unknown'], size=n),
'is_fraud': [Link]([0, 1], size=n, p=[0.9, 0.1]) # imbalanced:
only 10% fraud
})
# Inject dirty data for realism
[Link][5, 'amount'] = 9999.99
[Link][15, 'amount'] = -500
[Link][25, 'country'] = None
[Link][35, 'device'] = 'toaster'
# -------------------------------
# Step 2: Clean and engineer features
# -------------------------------
clean_data = [Link]()
# Remove negative amounts
clean_data = clean_data[clean_data['amount'] > 0]
# Lowercase and normalize device names
clean_data['device'] = clean_data['device'].[Link]()
clean_data['device'] = clean_data['device'].replace(['toaster', 'unknown'],
'other')
# Fix missing or invalid country
clean_data['country'] = clean_data['country'].fillna('unknown')
clean_data['country'] = clean_data['country'].replace('N/A', 'unknown')
# Add a derived feature: is this a high value transaction?
clean_data['high_value'] = (clean_data['amount'] > 200).astype(int)
# -------------------------------
# Step 3: Encode categorical variables
# -------------------------------
# One-hot encode country and device
encoded_data = pd.get_dummies(clean_data, columns=['country', 'device'],
drop_first=True)
# -------------------------------
# Step 4: Prepare features and target
# -------------------------------
X = encoded_data.drop(['transaction_id', 'is_fraud'], axis=1)
y = encoded_data['is_fraud']
# -------------------------------
# Step 5: Train-test split
# -------------------------------
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# -------------------------------
# Step 6: Train Logistic Regression
# -------------------------------
# Use class_weight='balanced' to help with imbalanced classes
model = LogisticRegression(class_weight='balanced', max_iter=1000)
[Link](X_train, y_train)
# -------------------------------
# Step 7: Predict and Evaluate
# -------------------------------
y_pred = [Link](X_test)
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred,
zero_division=0))
# -------------------------------
# Step 8: Visualize actual vs predicted frauds
# -------------------------------
[Link](['Actual Frauds', 'Predicted Frauds'], [sum(y_test), sum(y_pred)])
[Link]('Fraud Detection: Actual vs Predicted')
[Link]('Number of Fraudulent Transactions')
[Link](True)
plt.tight_layout()
[Link]()
What the Script Does
This script simulates a real-world fraud detection workflow:
1. Creates Fake Transaction Data
Each row is a transaction with info like:
o Amount
o Device used
o Country
o Whether it was fraud or not (is_fraud)
Dirty data is added on purpose (negative amounts, weird devices, missing
countries).
2. Cleans the Data
Negative or huge amounts are removed.
Devices like toaster are cleaned up and grouped under other.
Missing countries are filled with "unknown".
3. Feature Engineering
A new column high_value is added to mark large transactions — these might be
more suspicious.
Categorical values like country and device are converted into machine-friendly
format using one-hot encoding.
4. Trains a Model
Logistic Regression is used — a classic ML model for binary classification.
Since fraud cases are rare, we use class_weight='balanced' to try to give equal
importance to fraud and non-fraud.
At this point, you might notice the accuracy to be 0.5! that’s really a 50-50%, which is
horrible right?
Let’s make some changes
Import Libraries and Simulate Data
import pandas as pd
import numpy as np
import [Link] as plt
from sklearn.model_selection import train_test_split
from [Link] import RandomForestClassifier
from [Link] import accuracy_score, confusion_matrix,
classification_report
# Reproducibility
[Link](42)
# Create synthetic transaction data
n = 200
data = [Link]({
'transaction_id': [Link](1, n + 1),
'amount': [Link](100, 20, size=n).round(2),
'country': [Link](['US', 'MY', 'N/A', 'UK', 'India'], size=n),
'device': [Link](['mobile', 'desktop', 'tablet', 'MOBILE',
'unknown'], size=n),
'is_fraud': [Link]([0, 1], size=n, p=[0.9, 0.1])
})
# Add anomalies
[Link][5, 'amount'] = 9999.99
[Link][15, 'amount'] = -500
[Link][25, 'country'] = None
[Link][35, 'device'] = 'toaster'
[Link]()
Clean Data and Feature Engineer
df = [Link]()
# Remove negative amounts
df = df[df['amount'] > 0]
# Normalize device
df['device'] = df['device'].[Link]()
df['device'] = df['device'].replace(['toaster', 'unknown'], 'other')
# Fix country values
df['country'] = df['country'].fillna('unknown')
df['country'] = df['country'].replace('N/A', 'unknown')
# Feature: high_value
df['high_value'] = (df['amount'] > 200).astype(int)
# Simulate transaction hour for another feature
df['hour'] = [Link](0, 24, size=len(df))
# Feature: night transaction
df['night_txn'] = df['hour'].apply(lambda x: 1 if x < 6 or x > 22 else 0)
# Feature: amount group (binning)
df['amount_group'] = [Link](df['amount'], bins=[0, 75, 150, 9999],
labels=['low', 'medium', 'high'])
# Encode categorical: device, country, amount_group
df = pd.get_dummies(df, columns=['device', 'country', 'amount_group'],
drop_first=True)
[Link]()
Train/Test Split
# Prepare input features and label
X = [Link](['transaction_id', 'is_fraud'], axis=1)
y = df['is_fraud']
# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
Train Random Forest Model
model = RandomForestClassifier(class_weight='balanced', random_state=42)
[Link](X_train, y_train)
Evaluate the Model
y_pred = [Link](X_test)
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred,
zero_division=0))
Visualize Prediction vs Actual
[Link](['Actual Frauds', 'Predicted Frauds'], [sum(y_test), sum(y_pred)])
[Link]('Fraud Detection: Actual vs Predicted')
[Link]('Count')
[Link](True)
plt.tight_layout()
[Link]()
Better? What changed?
Firstly…
Improved Feature Engineering
We added more meaningful features:
hour: simulates when the transaction happened.
night_txn: flags transactions outside normal hours.
amount_group: bins amount into "low", "medium", "high".
These extra signals help the model learn better patterns of fraud. This is a real-world
strategy — fraud often happens at odd hours or in unusual amount brackets.
Secondly..
We wwitched to a Better Algorithm
Previously: LogisticRegression, which struggles with imbalanced classes.
Now: RandomForestClassifier with class_weight='balanced', which handles class
imbalance much better by giving more weight to minority classes (fraud).
This change alone usually boosts recall and reduces the chance of false negatives
(i.e., missing a fraud).
Thirdly
Proper One-Hot Encoding
Earlier versions had minimal or misaligned encoding.
Now we're using pd.get_dummies() correctly for device, country, and
amount_group, giving the model better inputs.
Also
Zero Warnings
We added zero_division=0 in classification_report() to suppress the annoying metric
warnings when there are no predicted frauds — which is common in imbalanced
datasets.
Finally,
Clean Visual Output
The final chart comparing "Actual Frauds" vs "Predicted Frauds" is simple but effective
for classroom learning and shows the model’s performance visually.
Cool eh… ha!
Shall we try this model.. let’s try
# Step 1: Get the exact feature columns used by the model
feature_cols = model.feature_names_in_
# Step 2: Create a sample input that includes ALL features
sample_dict = {
'amount': 950,
'high_value': 1,
'hour': 2,
'night_txn': 1,
'country_MY': 0,
'country_UK': 1,
'country_unknown': 0,
'device_mobile': 1,
'device_other': 0,
'device_tablet': 0,
'amount_group_medium': 0,
'amount_group_high': 1
}
# Step 3: Build a DataFrame with all expected columns
# Fill any missing columns with 0
for col in feature_cols:
if col not in sample_dict:
sample_dict[col] = 0
# Create DataFrame with correct column order
sample_df = [Link]([{col: sample_dict[col] for col in feature_cols}])
# Step 4: Predict
prediction = [Link](sample_df)
prediction_proba = model.predict_proba(sample_df)
print("Prediction (0 = Not Fraud, 1 = Fraud):", prediction[0])
print("Fraud Probability:", round(prediction_proba[0][1] * 100, 2), "%")
Review
A. Data Understanding & Cleaning
1. Why did we inject anomalies like 9999.99, -500, and "toaster" in the dataset?
What do these represent in a real-world cybersecurity context?
2. What kind of risks arise if you skip the data cleaning step and use the raw
transaction data for training?
3. Explain why we used fillna('unknown') and replaced "N/A" and "toaster" in the
dataset.
B. Feature Engineering
4. What’s the purpose of creating the high_value feature? How does this help the
model detect fraud better?
5. Why did we introduce the hour and night_txn features? What real-world behavior
do these simulate?
6. What is amount_group, and how does binning improve model performance or
interpretability in fraud detection?
7. Can you suggest one more feature that might help detect fraud more accurately,
based on user or transaction behavior?
C. Model Training & Evaluation
8. What is the difference between Logistic Regression and Random Forest? Why
did we switch to Random Forest?
9. Why did we use class_weight='balanced' in both models? What does it fix?
10. You saw warnings about precision being undefined. What caused those
warnings, and how did we fix it in the final version?
11. Our first model had ~50% accuracy. What caused such poor performance even
though accuracy sounds OK?
D. Predictions & Real-World Use
12. How did we ensure that our prediction input (the sample dictionary) had all the
correct features in the correct order? Why is that important?
13. What kind of real-world consequences could occur if our fraud model has high
false negatives? How about high false positives?
14. In production systems, why is it better to return a fraud probability (like 92%)
rather than a hard “yes/no” label?
15. Let’s say the model predicted a transaction as fraud with 95% confidence. What
steps should a real bank take next? Should it auto-block? Escalate to human
review? Log it? Discuss.