0% found this document useful (0 votes)
39 views15 pages

Machine Learning for Fraud Detection

The document outlines a capstone project focused on building a machine learning model for fraud detection in financial transactions. It details the process of data preparation, including data cleaning, feature engineering, and model training using synthetic transaction data. The project culminates in training a machine learning model, specifically Logistic Regression or Random Forest, to accurately predict fraudulent transactions based on the prepared dataset.

Uploaded by

d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views15 pages

Machine Learning for Fraud Detection

The document outlines a capstone project focused on building a machine learning model for fraud detection in financial transactions. It details the process of data preparation, including data cleaning, feature engineering, and model training using synthetic transaction data. The project culminates in training a machine learning model, specifically Logistic Regression or Random Forest, to accurately predict fraudulent transactions based on the prepared dataset.

Uploaded by

d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Capstone Project: Fraud Detection with Machine Learning

What This Is

You’re stepping into the role of a cybersecurity analyst working in a financial institution.
Your job? To help build a machine learning model that can detect fraudulent transactions
— before they cost the company money, reputation, and customer trust.

You’ll follow a complete machine learning pipeline — just like real-world analysts and data
scientists do — from the raw data to a working model. In this stage, we focus on the first half:
understanding and cleaning the data.

This isn’t just coding. This is threat detection, anomaly hunting, and AI modeling all rolled
into one.

Real-World Scenario

A bank’s fraud monitoring system flagged a sharp rise in fraudulent transactions. You're on the
internal threat analysis team. You've been handed 200 recent transactions and asked to build a
model that can automatically flag suspicious ones.

You’ll prepare this model using real-world practices:

 Explore messy data


 Clean it
 Visualize what matters
 Train a simple model
 Check how well it works
 Prepare to deploy it for live prediction

What You’ll Do in This Part (Part 1: Data Preparation)

1. Generate synthetic transaction data


You’ll simulate a real-world dataset — some of it clean, some suspicious.
2. Explore the data
Look at its structure, understand what it tells you, and look for red flags.
3. Visualize key features
Use graphs to uncover insights: outliers, suspicious spikes, fraud patterns.
4. Ask critical questions
Can we trust this data? What needs fixing? Where might fraud hide?
This part is the foundation of everything that follows. If the data is wrong or dirty, your AI will
fail — no matter how fancy the algorithm.

Step 0:Import Libraries and Create the Dataset


import pandas as pd
import numpy as np
import [Link] as plt

# Set random seed for consistency


[Link](42)

# Generate synthetic transaction dataset


n = 200
data = [Link]({
'transaction_id': [Link](1, n + 1),
'amount': [Link](100, 20, size=n).round(2),
'country': [Link](['US', 'MY', 'N/A', 'UK', 'India'], size=n),
'device': [Link](['mobile', 'desktop', 'tablet', 'MOBILE',
'unknown'], size=n),
'is_fraud': [Link]([0, 1], size=n, p=[0.9, 0.1])
})

# Inject some anomalies


[Link][5, 'amount'] = 9999.99 # suspiciously high
[Link][15, 'amount'] = -500 # invalid
[Link][25, 'country'] = None # missing value
[Link][35, 'device'] = 'toaster' # nonsense device

# Preview the data


[Link](10)

Step 1: Initial Exploration


# Structure and types
[Link]()

Question 1: What kind of data types do you see? Are they appropriate?
Question 2: Do you notice any missing or inconsistent data?
Step 2: Summary Statistics

[Link]()

Question 3: Are all the values in the amount column realistic?


Question 4: What does the standard deviation tell you here?

Step 3: Visualize the Distribution of Amounts


[Link](data['amount'], bins=30)
[Link]('Transaction Amount Distribution')
[Link]('Amount')
[Link]('Frequency')
[Link](True)
plt.tight_layout()
[Link]()

Question 5: Do you notice any outliers or spikes in the data?


Question 6: Why could an outlier matter in fraud detection?

Step 4: Fraud vs Non-Fraud Count


data['is_fraud'].value_counts().plot(kind='bar')
[Link]('Fraud vs Non-Fraud Transactions')
[Link]('is_fraud (0 = clean, 1 = fraud)')
[Link]('Number of Transactions')
plt.tight_layout()
[Link]()

Question 7: Is the dataset balanced? What does that mean for training a model?

So far…

Recap of What Just Happened


1. Removed Bad Data
o Cleaned out transactions with invalid amounts (like negative or extreme
values).
o This step mimics what a fraud analyst might do when filtering out system
noise or corrupted logs.
2. Standardized Inputs
o Turned inconsistent or silly values like "toaster" and "N/A" into
standardized, usable categories.
o In real-life, this prevents AI from getting confused by dirty labels.
3. Engineered a New Feature
o We added a new column: high_value, which marks if a transaction is over
200.
o This is often where cyber + data merge: we bring in logic that reflects
analyst intuition. Higher-value transactions are more likely to be targeted
by fraudsters.

Capstone Project: Fraud Detection (Part 2 – Clean and Engineer Data)


# Step 0: Import Libraries
import pandas as pd
import numpy as np
import [Link] as plt

# Set seed for reproducibility


[Link](42)

# Step 1: Create Dataset


n = 200
data = [Link]({
'transaction_id': [Link](1, n + 1),
'amount': [Link](100, 20, size=n).round(2),
'country': [Link](['US', 'MY', 'N/A', 'UK', 'India'], size=n),
'device': [Link](['mobile', 'desktop', 'tablet', 'MOBILE',
'unknown'], size=n),
'is_fraud': [Link]([0, 1], size=n, p=[0.9, 0.1])
})

# Inject anomalies
[Link][5, 'amount'] = 9999.99 # suspiciously high
[Link][15, 'amount'] = -500 # invalid
[Link][25, 'country'] = None # missing
[Link][35, 'device'] = 'toaster' # nonsense device
# Step 2: Clean and Feature Engineer
clean_data = [Link]()

# Remove invalid amounts


clean_data = clean_data[(clean_data['amount'] > 0) & (clean_data['amount'] <
5000)]

# Standardize country
clean_data['country'] = clean_data['country'].fillna('Unknown')
clean_data['country'] = clean_data['country'].replace('N/A', 'Unknown')

# Normalize device field


clean_data['device'] = clean_data['device'].[Link]()
clean_data['device'] = clean_data['device'].replace({'toaster': 'unknown'})

# Feature engineering: High-value flag


clean_data['high_value'] = (clean_data['amount'] > 200).astype(int)

# Preview cleaned data


print(clean_data.head())

Lab Output Interpretation: Cleaned Transaction Data


After completing Part 2 of the capstone project — cleaning and engineering features —
you should now see a clean preview of your dataset that looks like this:

transaction_id amount country device is_fraud high_value


0 1 109.93 MY mobile 0 0
1 2 97.23 MY mobile 0 0
2 3 112.95 US tablet 0 0
3 4 130.46 US unknown 0 0
4 5 95.32 US mobile 1 0

Each row represents a simulated transaction. This is a cleaned, structured version of


messy log data that a real fraud analyst might work with.

You now have clean, structured data — ready for training an AI model.

You’ve practiced feature engineering — turning raw data into something more
meaningful.
You’ve handled messy real-world problems: invalid values, junk device types, missing
countries.

Part 3: Train-Test Split + Model Training

In this section, you're acting like a machine learning engineer inside a cybersecurity
team. You’ve prepped the data — now it’s time to teach the AI to detect fraud using that
data.
You’ll train a model on known data (clean vs fraud), then check if it can make accurate
predictions on unseen data. This is the heart of supervised machine learning — the
model learns from past behavior to predict future risks.
This is where you split the data into:
 Training data — for the model to learn patterns.
 Testing data — to check if the model actually learned anything useful.
Then, you train a simple supervised ML model (like Logistic Regression) to detect
fraud based on the input features.

Capstone Part 3: Train a Fraud Detection Model (With Cleaned Data)


# Import necessary libraries
import pandas as pd
import numpy as np
import [Link] as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from [Link] import accuracy_score, confusion_matrix,
classification_report

# -------------------------------
# Step 1: Simulate synthetic data
# -------------------------------
[Link](42)
n = 200

data = [Link]({
'transaction_id': [Link](1, n + 1),
'amount': [Link](100, 20, size=n).round(2),
'country': [Link](['US', 'MY', 'N/A', 'UK', 'India'], size=n),
'device': [Link](['mobile', 'desktop', 'tablet', 'MOBILE',
'unknown'], size=n),
'is_fraud': [Link]([0, 1], size=n, p=[0.9, 0.1]) # imbalanced:
only 10% fraud
})

# Inject dirty data for realism


[Link][5, 'amount'] = 9999.99
[Link][15, 'amount'] = -500
[Link][25, 'country'] = None
[Link][35, 'device'] = 'toaster'

# -------------------------------
# Step 2: Clean and engineer features
# -------------------------------
clean_data = [Link]()

# Remove negative amounts


clean_data = clean_data[clean_data['amount'] > 0]

# Lowercase and normalize device names


clean_data['device'] = clean_data['device'].[Link]()
clean_data['device'] = clean_data['device'].replace(['toaster', 'unknown'],
'other')

# Fix missing or invalid country


clean_data['country'] = clean_data['country'].fillna('unknown')
clean_data['country'] = clean_data['country'].replace('N/A', 'unknown')

# Add a derived feature: is this a high value transaction?


clean_data['high_value'] = (clean_data['amount'] > 200).astype(int)

# -------------------------------
# Step 3: Encode categorical variables
# -------------------------------
# One-hot encode country and device
encoded_data = pd.get_dummies(clean_data, columns=['country', 'device'],
drop_first=True)

# -------------------------------
# Step 4: Prepare features and target
# -------------------------------
X = encoded_data.drop(['transaction_id', 'is_fraud'], axis=1)
y = encoded_data['is_fraud']

# -------------------------------
# Step 5: Train-test split
# -------------------------------
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# -------------------------------
# Step 6: Train Logistic Regression
# -------------------------------
# Use class_weight='balanced' to help with imbalanced classes
model = LogisticRegression(class_weight='balanced', max_iter=1000)
[Link](X_train, y_train)

# -------------------------------
# Step 7: Predict and Evaluate
# -------------------------------
y_pred = [Link](X_test)

print("Accuracy Score:", accuracy_score(y_test, y_pred))


print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred,
zero_division=0))

# -------------------------------
# Step 8: Visualize actual vs predicted frauds
# -------------------------------
[Link](['Actual Frauds', 'Predicted Frauds'], [sum(y_test), sum(y_pred)])
[Link]('Fraud Detection: Actual vs Predicted')
[Link]('Number of Fraudulent Transactions')
[Link](True)
plt.tight_layout()
[Link]()

What the Script Does


This script simulates a real-world fraud detection workflow:
1. Creates Fake Transaction Data
 Each row is a transaction with info like:
o Amount
o Device used
o Country
o Whether it was fraud or not (is_fraud)
 Dirty data is added on purpose (negative amounts, weird devices, missing
countries).
2. Cleans the Data
 Negative or huge amounts are removed.
 Devices like toaster are cleaned up and grouped under other.
 Missing countries are filled with "unknown".
3. Feature Engineering
 A new column high_value is added to mark large transactions — these might be
more suspicious.
 Categorical values like country and device are converted into machine-friendly
format using one-hot encoding.
4. Trains a Model
 Logistic Regression is used — a classic ML model for binary classification.
 Since fraud cases are rare, we use class_weight='balanced' to try to give equal
importance to fraud and non-fraud.

At this point, you might notice the accuracy to be 0.5! that’s really a 50-50%, which is
horrible right?

Let’s make some changes

Import Libraries and Simulate Data

import pandas as pd
import numpy as np
import [Link] as plt
from sklearn.model_selection import train_test_split
from [Link] import RandomForestClassifier
from [Link] import accuracy_score, confusion_matrix,
classification_report

# Reproducibility
[Link](42)

# Create synthetic transaction data


n = 200
data = [Link]({
'transaction_id': [Link](1, n + 1),
'amount': [Link](100, 20, size=n).round(2),
'country': [Link](['US', 'MY', 'N/A', 'UK', 'India'], size=n),
'device': [Link](['mobile', 'desktop', 'tablet', 'MOBILE',
'unknown'], size=n),
'is_fraud': [Link]([0, 1], size=n, p=[0.9, 0.1])
})

# Add anomalies
[Link][5, 'amount'] = 9999.99
[Link][15, 'amount'] = -500
[Link][25, 'country'] = None
[Link][35, 'device'] = 'toaster'

[Link]()

Clean Data and Feature Engineer

df = [Link]()

# Remove negative amounts


df = df[df['amount'] > 0]

# Normalize device
df['device'] = df['device'].[Link]()
df['device'] = df['device'].replace(['toaster', 'unknown'], 'other')

# Fix country values


df['country'] = df['country'].fillna('unknown')
df['country'] = df['country'].replace('N/A', 'unknown')

# Feature: high_value
df['high_value'] = (df['amount'] > 200).astype(int)

# Simulate transaction hour for another feature


df['hour'] = [Link](0, 24, size=len(df))

# Feature: night transaction


df['night_txn'] = df['hour'].apply(lambda x: 1 if x < 6 or x > 22 else 0)

# Feature: amount group (binning)


df['amount_group'] = [Link](df['amount'], bins=[0, 75, 150, 9999],
labels=['low', 'medium', 'high'])

# Encode categorical: device, country, amount_group


df = pd.get_dummies(df, columns=['device', 'country', 'amount_group'],
drop_first=True)

[Link]()

Train/Test Split

# Prepare input features and label


X = [Link](['transaction_id', 'is_fraud'], axis=1)
y = df['is_fraud']
# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)

Train Random Forest Model

model = RandomForestClassifier(class_weight='balanced', random_state=42)


[Link](X_train, y_train)

Evaluate the Model

y_pred = [Link](X_test)

print("Accuracy Score:", accuracy_score(y_test, y_pred))


print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred,
zero_division=0))

Visualize Prediction vs Actual

[Link](['Actual Frauds', 'Predicted Frauds'], [sum(y_test), sum(y_pred)])


[Link]('Fraud Detection: Actual vs Predicted')
[Link]('Count')
[Link](True)
plt.tight_layout()
[Link]()

Better? What changed?


Firstly…
Improved Feature Engineering
We added more meaningful features:
 hour: simulates when the transaction happened.
 night_txn: flags transactions outside normal hours.
 amount_group: bins amount into "low", "medium", "high".
These extra signals help the model learn better patterns of fraud. This is a real-world
strategy — fraud often happens at odd hours or in unusual amount brackets.
Secondly..

We wwitched to a Better Algorithm


 Previously: LogisticRegression, which struggles with imbalanced classes.
 Now: RandomForestClassifier with class_weight='balanced', which handles class
imbalance much better by giving more weight to minority classes (fraud).
This change alone usually boosts recall and reduces the chance of false negatives
(i.e., missing a fraud).

Thirdly

Proper One-Hot Encoding


Earlier versions had minimal or misaligned encoding.
 Now we're using pd.get_dummies() correctly for device, country, and
amount_group, giving the model better inputs.

Also

Zero Warnings
We added zero_division=0 in classification_report() to suppress the annoying metric
warnings when there are no predicted frauds — which is common in imbalanced
datasets.

Finally,
Clean Visual Output
The final chart comparing "Actual Frauds" vs "Predicted Frauds" is simple but effective
for classroom learning and shows the model’s performance visually.
Cool eh… ha!

Shall we try this model.. let’s try

# Step 1: Get the exact feature columns used by the model


feature_cols = model.feature_names_in_

# Step 2: Create a sample input that includes ALL features


sample_dict = {
'amount': 950,
'high_value': 1,
'hour': 2,
'night_txn': 1,
'country_MY': 0,
'country_UK': 1,
'country_unknown': 0,
'device_mobile': 1,
'device_other': 0,
'device_tablet': 0,
'amount_group_medium': 0,
'amount_group_high': 1
}

# Step 3: Build a DataFrame with all expected columns


# Fill any missing columns with 0
for col in feature_cols:
if col not in sample_dict:
sample_dict[col] = 0

# Create DataFrame with correct column order


sample_df = [Link]([{col: sample_dict[col] for col in feature_cols}])

# Step 4: Predict
prediction = [Link](sample_df)
prediction_proba = model.predict_proba(sample_df)

print("Prediction (0 = Not Fraud, 1 = Fraud):", prediction[0])


print("Fraud Probability:", round(prediction_proba[0][1] * 100, 2), "%")
Review

A. Data Understanding & Cleaning


1. Why did we inject anomalies like 9999.99, -500, and "toaster" in the dataset?
What do these represent in a real-world cybersecurity context?
2. What kind of risks arise if you skip the data cleaning step and use the raw
transaction data for training?
3. Explain why we used fillna('unknown') and replaced "N/A" and "toaster" in the
dataset.

B. Feature Engineering
4. What’s the purpose of creating the high_value feature? How does this help the
model detect fraud better?
5. Why did we introduce the hour and night_txn features? What real-world behavior
do these simulate?
6. What is amount_group, and how does binning improve model performance or
interpretability in fraud detection?
7. Can you suggest one more feature that might help detect fraud more accurately,
based on user or transaction behavior?

C. Model Training & Evaluation


8. What is the difference between Logistic Regression and Random Forest? Why
did we switch to Random Forest?
9. Why did we use class_weight='balanced' in both models? What does it fix?
10. You saw warnings about precision being undefined. What caused those
warnings, and how did we fix it in the final version?
11. Our first model had ~50% accuracy. What caused such poor performance even
though accuracy sounds OK?

D. Predictions & Real-World Use


12. How did we ensure that our prediction input (the sample dictionary) had all the
correct features in the correct order? Why is that important?
13. What kind of real-world consequences could occur if our fraud model has high
false negatives? How about high false positives?
14. In production systems, why is it better to return a fraud probability (like 92%)
rather than a hard “yes/no” label?
15. Let’s say the model predicted a transaction as fraud with 95% confidence. What
steps should a real bank take next? Should it auto-block? Escalate to human
review? Log it? Discuss.

Common questions

Powered by AI

Imbalance in datasets, such as having significantly fewer fraudulent instances compared to non-fraudulent ones, can bias a machine learning model to predict the majority class more often, ignoring the minority class. This results in high overall accuracy but poor performance on fraud detection. Techniques like using class_weight='balanced' in Logistic Regression or RandomForestClassifier give more weight to fraud instances, enabling the model to treat these cases as equally important during training, which helps reduce false negatives .

Feature engineering simplifies complex raw data into a form that can reveal insights to machine learning algorithms. By creating features like 'high_value', which flags transactions over a certain amount, or 'night_txn', which identifies transactions occurring at unusual hours, we provide additional context that can signify fraud. These features align with real-world patterns where fraud often involves large transactions or occurs at odd hours, thus helping the model to better discriminate between fraudulent and non-fraudulent behavior .

Cleaning and standardizing transaction data is crucial because raw data often contains errors, outliers, and inconsistencies that can mislead a machine learning model. For instance, anomalies such as negative transaction amounts or nonsensical device types (e.g., 'toaster') represent impossible or unrealistic data points that can distort the learning process. Ensuring data quality prevents models from learning spurious patterns that do not generalize to real-world scenarios, thereby improving model accuracy and reliability .

The switch from Logistic Regression to Random Forest addressed the issue of class imbalance and model performance. While Logistic Regression is simpler, it struggles with class imbalances common in fraud datasets and often results in higher false-negative rates. In contrast, Random Forest can better handle imbalanced classes by incorporating decision trees that focus on difficult-to-classify, minority class samples through mechanisms like class_weight='balanced'. This change improves recall and reduces false negatives, enhancing the model's ability to detect fraud .

A high false-negative rate in a fraud detection model means that many fraudulent transactions go undetected. This can result in significant financial losses, damage to the bank's reputation, and potential legal implications as undetected fraudulent activity continues unchecked. It undermines customer trust, potentially leading to customer attrition, and can have a considerable impact on a financial institution's operations and compliance with regulatory requirements .

One-hot encoding is crucial for converting categorical variables into a numerical format that machine learning models can process. In this project, it was applied to variables like 'country', 'device', and 'amount_group'. By transforming these categorical features into binary vectors, models can better interpret the nuances and carry out calculations necessary to identify patterns in the data, thus improving the model's ability to capture complex relationships between inputs and fraud status .

While a 50% accuracy score might initially seem acceptable, it likely indicates poor model performance in the context of fraud detection due to class imbalance. Accuracy alone can be misleading because it does not capture the model's ability to detect fraud correctly. For instance, in a highly imbalanced dataset, a model might achieve high accuracy by predicting the majority class (non-fraud) effectively but can still miss critical fraudulent transactions. Emphasizing metrics like recall, precision, and F1-score, especially tailored for detecting minority classes, is essential to ensure effective fraud detection, minimizing false negatives while maintaining an acceptable false positive rate .

Synthetic data allows the creation of controlled datasets where the distribution and characteristics of fraud can be customized, thereby helping to train models without requiring sensitive real transaction data. This facilitates experimentation and can help ensure privacy. However, synthetic data may not capture all the nuances and variability of real-world data, which can lead to models that perform well in controlled environments but poorly in deployment. It is critical that synthetic data closely mimics real-world scenarios to build effective models .

Providing a fraud probability, such as 92%, allows for more nuanced decision-making. It enables a bank to prioritize cases for human review based on confidence levels, reducing the total number of false positives and allowing for efficient resource allocation. This probabilistic approach aids in managing risk thresholds more dynamically, helping decide whether to block a transaction immediately or monitor further. It contributes to better user experiences and trust, as customers do not face unwarranted disruptions while maintaining stringent security measures .

The procedural step involved ensuring the sample input included all relevant features used during model training, with missing values set to default values like zero. This guarantees consistency in the feature space the model expects, minimizing the potential for errors during prediction. It is vital to provide the model with input data containing the same feature structure and order as used during training to maintain continuity and ensure accurate predictions. Misalignment can lead to incorrect predictions and unreliable model performance .

You might also like