0% found this document useful (0 votes)
14 views3 pages

Breast Cancer Decision Tree Analysis

The document outlines a process for analyzing the breast cancer dataset using Python libraries such as pandas, seaborn, and scikit-learn. It includes steps for data loading, checking for missing values, visualizing correlations, scaling features, splitting data into training and testing sets, training a Decision Tree Classifier, and evaluating its accuracy. The model achieved an accuracy of approximately 96.1% and includes a visualization of the decision tree.

Uploaded by

gheffley.0001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

Breast Cancer Decision Tree Analysis

The document outlines a process for analyzing the breast cancer dataset using Python libraries such as pandas, seaborn, and scikit-learn. It includes steps for data loading, checking for missing values, visualizing correlations, scaling features, splitting data into training and testing sets, training a Decision Tree Classifier, and evaluating its accuracy. The model achieved an accuracy of approximately 96.1% and includes a visualization of the decision tree.

Uploaded by

gheffley.0001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

# Install necessary packages (if running in Colab)

!pip install seaborn

# Import required libraries


import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from [Link] import load_breast_cancer
from [Link] import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from [Link] import accuracy_score, confusion_matrix, classification_report

# Load built-in dataset


data = load_breast_cancer()
df = [Link]([Link], columns=data.feature_names)
y = [Link]([Link])

# Check for missing values


print("Missing values:\n", [Link]().sum())

# Correlation matrix
[Link](figsize=(15, 11))
[Link]([Link](), annot=False, cmap='coolwarm')
[Link]('Correlation Matrix')
[Link]()

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Train-test split
x_train, x_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.27, random_state=42)

# Decision tree model


tree = DecisionTreeClassifier(random_state=42, criterion='entropy', max_depth=4)
[Link](x_train, y_train)
y_pred = [Link](x_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy * 100)

# Visualize decision tree


[Link](figsize=(12, 8))
plot_tree(tree, filled=True, class_names=data.target_names, feature_names=data.feature_names, rounded=True, fontsize=8)
[Link]('Decision Tree Visualization')
[Link]()
Requirement already satisfied: seaborn in /usr/local/lib/python3.11/dist-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /usr/local/lib/python3.11/dist-packages (from seaborn) (2.0.2)
Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.11/dist-packages (from seaborn) (2.2.2)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/local/lib/python3.11/dist-packages (from seaborn) (3.10.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2
Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (11.1.0
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seabor
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2->seaborn) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2->seaborn) (2025.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3
Missing values:
mean radius 0
mean texture 0
mean perimeter 0
mean area 0
mean smoothness 0
mean compactness 0
mean concavity 0
mean concave points 0
mean symmetry 0
mean fractal dimension 0
radius error 0
texture error 0
perimeter error 0
area error 0
smoothness error 0
compactness error 0
concavity error 0
concave points error 0
symmetry error 0
fractal dimension error 0
worst radius 0
worst texture 0
worst perimeter 0
worst area 0
worst smoothness 0
worst compactness 0
worst concavity 0
worst concave points 0
worst symmetry 0
worst fractal dimension 0
dtype: int64
Accuracy: 96.1038961038961

Common questions

Powered by AI

A decision tree classifier determines the optimal split at each node by evaluating all possible splits and selecting the one that results in the maximum information gain, which helps in classifying the data more effectively. In this document, the criterion used for split evaluation is 'entropy', which measures the impurity of node. Using 'entropy' as a criterion ensures that each split results in the maximum reduction in entropy (or uncertainty) of the dataset at that node.

Feature scaling, such as standardization, is important in preparing data for machine learning models because it ensures that each feature contributes equally to the result, preventing features with larger ranges from disproportionately impacting the model. In the document, feature scaling is implemented using the `StandardScaler` from the `sklearn.preprocessing` module, which standardizes the dataset by transforming it to have a mean of 0 and a standard deviation of 1 before splitting into training and testing sets.

Using the 'entropy' criterion in decision tree classification allows for focusing on maximizing the information gain at each split, making it more sensitive to capturing informative and pure splits compared to the 'gini' impurity criterion. While both approaches aim to refine node purity, 'entropy' might lead to trees that better handle imbalances between class distributions when higher precision is needed. This can be advantageous in analyzing datasets like breast cancer, where accurate classification and understanding subtle distinctions between classes are crucial.

Utilizing the Sklearn library's built-in breast cancer dataset offers the advantages of standardized, clean data with no missing values, facilitating quick prototyping and model testing. However, reliance on such datasets can pose challenges such as limited diversity, potential bias not reflective of real-world scenarios, and a lack of control over data quality and representation. Thus, while effective for benchmarking and learning, results from such datasets should be cautiously interpreted and supplemented by additional datasets for practical applications.

A train-test split is performed to assess the model's ability to generalize to new, unseen data by dividing the data into a training set for building the model and a test set for evaluation. The test size of 27% ensures a sufficiently large test set for reliable evaluation without compromising the amount of data available for training. This specific percentage balances the need for a robust assessment of model performance and the training set size needed to effectively train the model.

Restricting the maximum depth of the decision tree to 4 limits the tree's complexity, which can prevent overfitting by not allowing it to learn overly specific patterns that do not generalize well to unseen data. However, setting such a limit might result in underfitting if the tree is too shallow to capture the true complexity of the data. In this document, the depth of 4 was likely chosen to balance bias and variance, mitigating overfitting while maintaining a reasonable model accuracy of approximately 96%.

A correlation matrix provides insights into the strength and direction of linear relationships between features in the dataset. By examining the correlation matrix generated in the document with a heatmap, key relationships and potential multicollinearity among features can be identified, which can inform data preprocessing and feature selection strategies. This is useful because it highlights which features might be redundant or unnecessary for model training, ensuring a more streamlined and efficient model.

The document explicitly checks and reports that there are no missing values in the dataset, which implies that the data is already clean and ready for further processing without needing imputation or removal of samples. Handling missing values effectively is crucial as they can introduce bias or skew the training process, but in this case, their absence avoids such issues, contributing positively to the model's performance reliability.

The model's accuracy, reported as approximately 96.1%, indicates the proportion of correctly predicted instances out of the total test cases. This high accuracy suggests that the decision tree model is effective at distinguishing between malignant and benign breast cancer cases in the dataset. However, interpretation should also consider potential class imbalances, which might not be reflected adequately by accuracy alone, necessitating other metrics like precision, recall, and F1-score for a comprehensive evaluation.

The `plot_tree` function in Scikit-learn produces a graphical representation of the decision tree, displaying its structure, including nodes, branches, and split conditions. This visualization enhances understanding by providing a clear, intuitive view of how decisions are made within the model—identifying which features are used for splits, the value thresholds, and the classification at leaf nodes. Such visualization aids in interpreting model reasoning, justifying outcomes, and communicating model logic to non-technical stakeholders.

You might also like