0% found this document useful (0 votes)
50 views27 pages

Lung Cancer Data Analysis Project

Uploaded by

Ekansh Girdhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views27 pages

Lung Cancer Data Analysis Project

Uploaded by

Ekansh Girdhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

INDEX

1. Introduction
2. Features of the Project
3. Code(Syntax)
4. Output
5. Bibliography
6. Conclusion
CERTIFICATE

This is to certify that the Project entitled,


PROJECT ON DATA ANALYSIS OF THE LUNG
CANCER is a bonded work done by EKANSH
GIRDHAR of class XII-A AND ANUJ SALIL of class
XII-B, session 2024-25, and has been carried out
under my supervision and guidance.

……………………………..
……………………………
(Signature of Teacher) (Signature of
External Examiner)

…………………………….
(Signature of Principal)
ACKNOWLEDGEMENTS

I would like to express my sincere gratitude


towards my project guide and Informatics
Practices teacher, Ms. Pooja Thakur for
guiding me throughout the making of this
project. Her valuable advice helped me
immensely in the successful completion of
this project. I would also like to thank our
school principal, Ms. Meenu Kanwar, for her
coordination in extending every possible
support.
INTRODUCTION

Lung cancer is one of the most common and


serious types of cancer, causing a large number
of deaths worldwide. Understanding the disease,
its causes, and how it progresses is essential for
improving diagnosis, treatment, and survival
rates. However, the complexity of lung cancer
makes it a challenge for doctors and researchers.

This project uses a detailed dataset of 23,658


lung cancer patients to study the disease. The
data includes basic information like age, gender,
and ethnicity, as well as lifestyle details such as
smoking history and the number of years a
person smoked. It also contains important
medical details like the size and location of the
tumor, the stage of cancer, and the type of
treatment received. Blood test results and other
health indicators, such as hemoglobin, glucose,
and creatinine levels, provide additional insights.

By analyzing this information, the project aims to


find patterns and connections that can help
doctors make better decisions about diagnosis
and treatment. These insights could lead to
better care for patients and contribute to the
ongoing fight against lung cancer.

The dataset used for this project is obtained from


[Link] and is given ahead
I have done the analysis using Python Pandas on
windows machine but the project can be run on
any machine supporting Python and Pandas.
Aside from pandas, matplotlib python module has
also been used for visualization of this dataset.

FEATURES OF THE PROJECT

The whole project is majorly divided into four


major parts of reading, analysis and visualization.

It includes:
[Link] the data of csv file
[Link] the Dataframe Statistics
[Link] Analysis (Detailed)
[Link] Analysis (Comprehensive)
[Link] graphs using the data from csv file
(data visualization)
CODE
import pandas as pd
import numpy as np
import [Link] as plt
df = pd.read_csv("C:\\Users\\anujs\\Downloads\\[Link]",nrows=250)
while(True):
print("Main Menu")
print("1. Fetch data")
print("2. Dataframe Statistics")
print("3. Detailed and comprehensive Data Analysis")
print("4. Data Visualization")
print("5. Exit")
ch=int(input("Enter your choice:"))
if ch==1:
print(df)
elif ch==2:
while (True):
print("Dataframe Statistics Menu")
print("1. Display the Transpose")
print("2. Display all column names")
print("3. Display the indexes")
print("4. Display the shape")
print("5. Display the dimension")
print("6. Display the data types of all columns")
print("7. Display the size")
print("8. Modify Dataset")
print("9. Exit")
ch2=int(input("Enter choice"))
if ch2==1:
print(df.T)
elif ch2==2:
print([Link])
elif ch2==3:
print([Link])
elif ch2==4:
print([Link])
elif ch2==5:
print([Link])
elif ch2==6:
print([Link])
elif ch2==7:
print([Link])
elif ch2==8:
while(True) :
print( "Modify the Dataset")
print( "1. Add a column")
print( "2. Add a row")
print( "3. Remove a Row")
print( "4. Remove a Column")
print( "5. Exit")
choice=int(input( "Enter your choice: "))
if choice==1:
i=input("Enter the column name")
b=input("Enter the values needed to be added")
df[i]=b
print(df[i])
elif choice==2:
i=input("Enter the row name")
b=input("Enter the values needed to be added")
[Link][i]=b
print([Link][i])
elif choice==3:
a=input("Enter the row needed to be deleted")
[Link](a,inplace=True)
print(a," row deleted")
elif choice==4:
a=input("Enter the column needed to be deleted")
[Link](a,axis=1,inplace=True)
print(a," column deleted")
elif choice==5:
break
elif ch2==9:
break
elif ch==3:
while(True):
print('Data Extraction and Analysis:')
print('1. Display all Records')
print('2. Extract patients data')
print('3. Data Analysis')
print('4. Exit')
ch3=int(input("Enter choice"))
if ch3==1:
print('Display Data:')
print(df)
elif ch3==2:
while(True):
print('Extract Data:')
print('1. Top 5 Records')
print('2. Bottom 5 Records')
print('3. Specific number of records from the top')
print('4. Specific number of records from the bottom')
print('5. Sort Based on Category')
print('6. Exit')
e=int(input("Enter choice"))
if e==1:
print([Link](5))
elif e==2:
print([Link](5))
elif e==3:
i_top=int(input("Enter no of records to be extracted from
the top"))
print([Link](i_top))
elif e==4:
i_bottom=int(input("Enter no of records to be extracted
from the bottom"))
print([Link](i_bottom))
elif e==5:
while (True):
print('Sort based on category:')
print('1. Gender')
print('2. Ethnicity')
print('3. Smoking History of patient')
print('4. Smoking History in family')
print('5. Tumour Location')
print('6. Treatment Used')
print('7. Exit')
oi=int(input("Enter choice"))
if oi==1:
gender=int(input('Gender: 1-Male | 2-Female' ))
if gender==1:
print(df[[Link]=="Male"])
elif gender==2:
print(df[[Link]=="Female"])
elif oi==2:
while(True):
print('Ethnicity:')
print('1-Hispanic')
print('2-Caucasian')
print('3-African American')
print('4-Asian')
print('5-Other')
print('6. Exit')
ethnicity=int(input("Enter choice"))
if ethnicity==1:
print(df[[Link]=="Hispanic"])
elif ethnicity==2:
print(df[[Link]=="Caucasian"])
elif ethnicity==3:
print(df[[Link]=="African American"])
elif ethnicity==4:
print(df[[Link]=="Asian"])
elif ethnicity==5:
print(df[[Link]=="Other"])
elif ethnicity==6:
break
elif oi==3:
while(True):
print('Smoking History of patient:')
print('1-Current Smoker')
print('2-Never Smoked')
print('3-Former Smoker')
print('4. Exit')
smoke=int(input("Enter choice"))
if smoke==1:
print(df[df.Smoking_History=="Current
Smoker"])
elif smoke==2:
print(df[df.Smoking_History=="Never Smoked"])
elif smoke==3:
print(df[df.Smoking_History=="Former
Smoker"])
elif smoke==4:
break
elif oi==4:
history=int(input('Smoking History: 1-Yes | 2-No'))
if history==1:
print(df[df.Family_History=="Yes"])
elif history==2:
print(df[df.Family_History=="No"])
elif oi==5:
while(True):
print('Tumour location:')
print('1-Upper Lobe')
print('2-Middle Lobe')
print('3-Lower Lobe')
print('4. Exit')
loc=int(input("Enter choice"))
if loc==1:
print(df[df.Tumor_Location=="Upper Lobe"])
elif loc==2:
print(df[df.Tumor_Location=="Middle Lobe"])
elif loc==3:
print(df[df.Tumor_Location=="Lower Lobe"])
elif loc==4:
break
elif oi==6:
while(True):
print('Treatment Used:')
print('1-Surgery')
print('2-Radiation Therapy')
print('3-Chemotherapy')
print('4-Targeted Therapy')
print('5. Exit')
method=int(input("Enter choice"))
if method==1:
print(df[[Link]=="Surgery"])
elif method==2:
print(df[[Link]=="Radiation Therapy"])
elif method==3:
print(df[[Link]=="Chemotherapy"])
elif method==4:
print(df[[Link]=="Targeted Therapy"])
elif method==5:
break
elif oi==7:
break
elif e==6:
break

if ch3==3:
while(True):
print('Data Analysis:')
print('1. Count of all ethnic groups')
print('2. Average age of patients')
print('3. Oldest patient')
print('4. Youngest patient')
print('5. Maximum no. of smoking packs consumed by a
patient yearly')
print('6. Minimum no. of smoking packs consumed by a
patient yearly')
print('7. No. of patients having a history of smoking in both
personal and family')
print('8. Exit')
data = int(input('Enter the Choice'))
if data==1:
m=[Link]("Ethnicity")
print([Link]())
elif data==2:
print([Link]())
elif data==3:
print([Link]())
elif data==4:
print([Link]())
elif data==5:
print(df.Smoking_Pack_Years.max())
elif data==6:
print(df.Smoking_Pack_Years.min())
elif data==7:
c=0
p=len([Link])
for x in range(p):
if ([Link][x,"Smoking_History"] in ("Current
Smoker","Former Smoker")) and [Link][x,"Family_History"]=="Yes":
c+=1
print(c)
elif data==8:
break
elif ch3==4:
break

elif ch==4:
while(True):
print("Line Plot Sub Menu")
print("1. graph between male or female paitents ")
print("2. pie chart different ethinicity")
print('3. graph b/w Smoking history')
print('4. family history')
print('5. histogram for age category')
print("6. Exit")
ch4=int(input("Enter choice"))
if ch4==1:
gender= df['Gender'].value_counts()
[Link](kind='bar')
[Link]('Gender Distribution')
[Link]('Gender')
[Link]('Number of Patients')
[Link]()
elif ch4==2:
m = [Link]("Ethnicity").size()
[Link]([Link],labels=[Link])
[Link]("Distribution by Ethnicity")
[Link]()
elif ch4==3:
smoking_history= df['Smoking_History'].value_counts()
smoking_history.plot(kind='bar')
[Link]('Distribution of Smoking History')
[Link]('Smoking History')
[Link]('Number of Patients')
[Link](rotation=45)
[Link]()
elif ch4==4:
family_history = df['Family_History'].value_counts()
family_history.plot(kind='bar', color='lightgreen',
edgecolor='black')
[Link]('Distribution of Family History')
[Link]('Family History')
[Link]('Number of Patients')
[Link](rotation=45)
[Link]()
elif ch4==5:
[Link](df['Age'].dropna(), rwidth=0.8)
[Link]('Distribution by Age')
[Link]('Age')
[Link]('Number of Patients')
[Link]()
elif ch==5:
break
OUTPUT
1. Fetch Data

2. Dataframe Statistics
a. Display the Transpose

b. Display all column names

c. Display the indexes

d. Display the shape


e. Display the dimension

f. Display the data types of all columns

g. Display the size

h. Modify Dataset
(i) Add a column
(ii) Add a row

(iii) Remove a row

(iv) Remove a column

3. Detailed and comprehensive Data Analysis


a. Display all records

b. Extract patient data


(i) Top 5 Records

(ii) Bottom 5 Records

(iii) Specific number of records from the top

(iv) Specific number of records from the bottom

(v) Sort Based on Category


a. Gender

b. Ethnicity

c. Smoking History of patient


d. Smoking History in family

e. Tumour Location

f. Treatment Used

4. Comprehensive Data Analysis


(i) Count of all ethnic groups

(ii) Average age of patients

(iii) Oldest patient

(iv) Youngest patient

(v) Maximum no. of smoking packs consumed by a patient yearly

(vi) Minimum no. of smoking packs consumed by a patient yearly

(vii) No. of patients having a history of smoking in both personal and


family

4. Data Visualization
a. Graph between Male or Female paitents

b. Pie chart Different Ethnicity

c. Graph b/w Smoking History

d. Family History
e. Histogram for Age Category
BIBLIOGRAPHY

The following books/websites helped me in


the completion of this project.

• Informatics Practices NCERT


• Informatics Practices by Preeti Arora
• [Link]
• [Link]
• [Link]
CONCLUSION

Through this project, I have garnered an


extensive knowledge of the Pandas, and
PyPlot modules and their workings. I have
also understood the importance of data
analysis through its practical application.

Common questions

Powered by AI

In this project, matplotlib is used to create bar graphs that visually represent the distribution of smoking histories among patients. This helps to quickly identify patterns, such as the proportion of current versus former smokers, which can influence understanding of the risk factors and effectiveness of interventions for different smoking groups .

Utilizing a dataset from Kaggle allows for access to a large, diverse pool of information which is crucial for conducting robust analyses. Kaggle data typically comes with good documentation and a community of data scientists who can provide insights and verify findings, enhancing this project's credibility in identifying trends in lung cancer diagnosis and treatment .

Data visualization is crucial for interpreting complex datasets like those related to lung cancer, as it can reveal trends and patterns that are not immediately apparent. This project uses the Python modules matplotlib and pandas for data visualization, creating graphs and charts that facilitate understanding of patient demographics, treatment methods, and disease characteristics .

Including lifestyle details such as smoking history provides context for understanding the etiology and progression of lung cancer. This project highlights the role of smoking in tumor development and helps assess the effectiveness of various treatments on smokers versus non-smokers. Such data enables researchers to focus on preventative measures and individualized treatment plans .

Lung cancer is complex, making diagnosis and treatment challenging due to its diverse causes and progression patterns. This project helps address these difficulties by analyzing a detailed dataset of 23,658 lung cancer patients to identify patterns and connections that can inform better clinical decisions. The analysis focuses on factors such as age, gender, lifestyle, tumor characteristics, and treatment types to improve diagnosis and treatment strategies .

By allowing modifications like adding or removing rows and columns, the project ensures the dataset remains relevant to ongoing analyses. This capability can be crucial for hypothesis testing, where new variables need to be introduced, or redundant data removed. It increases flexibility in examining how different factors affect patient outcomes, providing a dynamic framework for exploratory data analysis .

The project employs Python's pandas library to read, manipulate, and analyze lung cancer datasets, allowing for operations such as fetching data, displaying statistics, and extracting insights. Pandas enables detailed and comprehensive data analysis by providing functions to assess patient demographics, smoking history, and treatment modalities, which helps in identifying meaningful patterns in patient data .

Demographic data such as age, gender, and ethnicity can reveal patterns regarding the prevalence and characteristics of lung cancer in specific populations. This project's analysis can uncover trends like which demographics are more susceptible to smoking-related lung cancer or how certain ethnic groups respond to specific treatments. These insights are critical for tailored public health interventions and personalized medicine .

Cross-referencing diverse sources, such as books and online resources, supports comprehensive learning and problem-solving. For this project, consulting resources like NCERT and online platforms like stackoverflow.com allowed for troubleshooting coding issues and gaining theoretical insights, which were essential for effective implementation and interpretation of data analysis techniques .

While Pandas is powerful for data manipulation, it may limit analysis due to its in-memory data processing, which can consume lots of RAM, especially with large datasets. This could restrict the project's scalability and performance. Moreover, Pandas provides limited machine learning functionalities compared to more advanced packages like SciKit-Learn, potentially affecting predictive analysis outcomes .

You might also like