0% found this document useful (0 votes)
9 views76 pages

Data Exploration & Visualization Manual

Uploaded by

naveenmg2108
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views76 pages

Data Exploration & Visualization Manual

Uploaded by

naveenmg2108
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DEPARTMENT OF ARTIFICIAL

INTELLIGENCE AND DATA SCIENCE

AD3301 – DATA EXPLORATION AND


VISUALIZATION MANUAL

III Semester [Link]. Degree Course


[Regulation – 2021]

As per the Prescribed Syllabus by


ANNA UNIVERSITY, CHENNAI - 600025
DEPARTMENT OF ARTIFICIAL
INTELLIGENCE AND DATA SCIENCE

MANUAL
SUBJECT CODE : AD3301
DATA EXPLORATION AND
SUBJECT NAME :
VISUALIZATION
REGULATION : 2021

ACADEMIC YEAR : 2025 - 2026

YEAR / SEMESTER : II / III

BATCH : 2024 - 2028

Prepared by Verified by Approved by


Name : Name : Name :

Date : Date : Date :

i
INSTITUTION VISION
Emerge as a Premier Institute, producing globally competent engineers.

INSTITUTION MISSION
IM1 : Achieve Academic diligence through effective and innovative teaching-
learning processes, using ICT Tools.
IM2 : Make students employable through rigorous career guidance and training
programs.
IM3 : Strengthen Industry Institute Interaction through MOUs and Collaborations.
IM4 : Promote Research & Development by inculcating creative thinking through
innovative projects incubation.

DEPARTMENT VISION
To revolutionize the quality of AI technology by creating a state-of-the-art environment where
innovation, sustainability, and social impact converge.

DEPARTMENT MISSION
DM1 : To develop and implement AI solutions that prioritize human values, ethics and social
responsibilities.
DM2 : To provide cutting-edge AI research, driving innovation and advancing the state-of-
the-art in AI technology.
DM3 : To provide industrial standards by means of collaborations for artificial intelligence
and data science.
DM4 : To provide an excellent infrastructure that keeps up with modern trends and
technologies for professional entrepreneurship.

ii
PROGRAM OUTCOMES (POs)
PO1: Engineering Knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals and an engineering specialization to the solution of complex engineering
problems.

PO2: Problem Analysis: Identify, formulate, review research literature and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences and engineering sciences.

PO3: Design/Development of Solutions: Design solutions for complex engineering


problems and design system components or processes that meet the specified needs
with appropriate consideration for the public health and safety and the cultural, societal
and environmental considerations.

PO4: Conduct Investigations of Complex Problems: Use research-based knowledge and


research methods including design of experiments, analysis and interpretation of data
and synthesis of the information to provide valid conclusions.

PO5: Modern Tool Usage: Create, select and apply appropriate techniques, resources and
modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.

PO6: The Engineer and Society: Apply reasoning informed by the contextual knowledge
to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.

PO7: Environment and Sustainability: Understand the impact of the professional


engineering solutions in societal and environmental contexts and demonstrate the
knowledge of and need for sustainable development.

PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.

PO9: Individual and Team Work: Function effectively as an individual and as a member
or leader in diverse teams and in multidisciplinary settings.

PO10: Communication: Communicate effectively on complex engineering activities with the


engineering community and with society at large, such as, being able to comprehend
and write effective reports and design documentation, make effective presentations and
give and receive clear instructions.

PO11: Project Management and Finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary
environments.

PO12: Life-Long Learning: Recognize the need for and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.

iii
PROGRAM SPECIFIC OBJECTIVES (PSOs)
Create, select and apply the knowledge of AI and Data Science to solve societal
PSO1 : problems.

Develop data analytics and data visualization skills, skills pertaining to


PSO2 : knowledge acquisition, knowledge representation and knowledge engineering,
and hence be capable of coordinating complex projects.

PROGRAM EDUCATIONAL OBJECTIVES (PEOs)


Apply the knowledge of basic sciences, mathematics, Artificial Intelligence, data
PEO1 : science and statistics to build a system that requires in analysis of huge volumes
of data.
Product Development: Design a model using Artificial Intelligence to solve the
PEO2 : critical problems in real world.

Higher Studies: To enable the students to think logically and pursue life-long
PEO3 : learning and collaborate with an ethical attitude in a multidisciplinary team.

iv
GENERAL INSTRUCTIONS TO THE STUDENTS

1. No food or drink is to be brought into the lab, including gum and candy.

2. No cell phones or electronic devices at the lab stations (i pods, MP3 Players, etc.,).

3. Students must proceed immediately to their assigned position.

4. Students should maintain silence during the Lab session.

5. Students must inspect their position for possible damage and report immediately to the faculty
any damage that may be found.
6. Students must follow the faculty’s instructions explicitly concerning all uses of the lab
equipment.
7. Leave your shoes in the shoes rack before entering into the lab

8. Shut down the computer properly before leaving from the lab.

9. Do not bring college bag inside the lab.

10. Arrange your chairs properly before leaving from the lab.

11. Students must wear lab coat during lab session.

v
AD3301 - DATA EXPLORATION AND VISUALIZATION

L T P C
3 0 2 4

OBJECTIVES:
• To outline an overview of exploratory data analysis.
• To implement data visualization using Matplotlib.
• To perform univariate data exploration and analysis.
• To apply bivariate data exploration and analysis.
• To use Data exploration and visualization techniques for multivariate and time series
data.

COURSE OUTCOMES:
At the end of this course, the students will be able to:
CO205.1 Analyze the difference between Exploratory Data Analysis (EDA) and
classical/Bayesian analysis to effectively interpret and manipulate data.
CO205.2 Create customized and advanced visualizations using Matplotlib and Seaborn to
effectively communicate data insights.
CO205.3 Evaluate the distribution and variability of single-variable data using numerical
summaries and scaling techniques.
CO205.4 Analyze the relationships between two variables using contingency tables and
scatterplots to interpret bivariate data.
CO205.5 Apply the fundamentals of time series analysis (TSA) to clean, index, and visualize
time-based data.

LIST OF EXPERIMENTS:
1. Install the data Analysis and Visualization tool: R/ Python /Tableau Public/ Power BI.
2. Perform exploratory data analysis (EDA) on with datasets like email data set. Export all your
emails as a dataset, import them inside a pandas data frame, visualize them and get different
insights from the data.
3. Working with Numpy arrays, Pandas data frames, Basic plots using Matplotlib.
4. Explore various variable and row filters in R for cleaning data. Apply various plot features in R
on sample data sets and visualize.
5. Perform Time Series Analysis and apply the various visualization techniques.
6. Perform Data Analysis and representation on a Map using various Map data sets with Mouse
Rollover effect, user interaction, etc.
7. Build cartographic visualization for multiple datasets involving various countries of the world;
states and districts in India etc.
8. Perform EDA on Wine Quality Data Set.
9. Use a case study on a data set and apply the various EDA and visualization techniques and
present an analysis report.
vi
CO – PO & PSO MAPPING
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2

CO205.1 3 2 1 1 3 - - - 2 2 - 2 2 3

CO205.2 3 3 3 2 3 - - - 2 2 - 2 3 3

CO205.3 3 3 2 2 3 - - - 2 2 - 2 3 3

CO205.4 3 3 3 2 3 - - - 2 2 - 2 3 3

CO205.5 3 3 3 3 3 - - - 2 2 - 3 3 3

Avg 3 2.8 2.4 2 3 - - - 2 2 - 2.2 2.8 3

vii
INDEX WITH MAPPING OF CO

S. Name of the Experiment / CO Page Faculty


Date Marks
No. Exercise Mapped No Signature

INSTALLING OF THE DATA


1. ANALYSIS AND CO1
VISUALIZATION TOOL.

EXPLORATORY DATA
2. ANALYSIS (EDA) ON WITH CO1
DATASETS.

WORKING WITH NUMPY


ARRAYS, PANDAS DATA
3. CO2
FRAMES, BASIC PLOTS USING
MATPLOTLIB.

EXPLORING VARIOUS
VARIABLE AND ROW
4. CO2
FILTERS IN R FOR CLEANING
DATA.

DATA ANALYSIS AND


5. CO2
REPRESENTATION ON A MAP.

BUILDING CARTOGRAPHIC
6. CO2
VISUALIZATION.

PERFORMING EDA ON WINE


7. CO2
QUALITY DATA SET.

TIME SERIES ANALYSIS


USING VARIOUS
8. CO5
VISUALIZATION
TECHNIQUES.

VISUALIZING VARIOUS EDA


TECHNIQUES AS CASE CO3,
9
STUDY FOR CO4
IRIS DATASET.

CBS: ADVANCED DATA


ANALYSIS AND
10 CO2
VISUALIZATION ON A
DATASET.

viii
Ex. No. 1 INSTALLING OF THE DATA ANALYSIS AND
Date VISUALIZATION TOOL

AIM:
To install data analysis and visualization tools such as R/ Python /Tableau Public/ Power BI.

PROCEDURE:
R:
R is a programming language and software environment specifically designed for statistical
computing and graphics.

Windows:
• Download R from the official website: [Link]
• Run the installer and follow the installation instructions.

macOS:
• Download R for macOS from the same official website.
• Open the .pkg file and follow the installation instructions

Linux:
• You can install R using your distribution's package manager.

For Ubuntu/Debian:
sudo apt-get install r-base

Python:
Python is a versatile programming language widely used for data analysis and visualization.

Windows:
• Download Python from: [Link]
• Run the installer and check the "Add Python to PATH" option during installation.
• Install data analysis libraries (e.g., NumPy, pandas, matplotlib) using pip:
pip install numpy pandas matplotlib seaborn

macOS:
• macOS comes with Python 2.x pre-installed. For data analysis, install the latest Python
version from the official site or use Homebrew:
brew install python
• Use pip or a virtual environment (venv) or package manager like conda to install libraries.

1
Linux:
• Most Linux distributions come with Python pre-installed.
• Install or update Python using your package manager (example for Ubuntu):
sudo apt-get install python3 python3-pip
• Use pip or conda to manage data analysis libraries.

Tableau Public:
Tableau Public is a free version of Tableau for creating and sharing interactive visualizations.
• Go to: [Link]
• Download and install Tableau Public by following the on-screen instructions.

Power BI:
Power BI is a business analytics tool by Microsoft used for creating dashboards and reports.
• Visit: [Link]
• Download Power BI Desktop and follow the installation instructions.

Anaconda
Anaconda is a free, open-source distribution of Python and R for data science and machine
learning. It comes with popular libraries like NumPy, pandas, matplotlib, scikit-learn, and Jupyter
Notebook.

Windows/macOS:
• Go to the official Anaconda website:
[Link]
• Click “Download” based on your operating system (Windows or macOS) and the
appropriate Python version (usually Python 3.x).
• Run the Installer:
o For Windows: Double-click the .exe file.
o For macOS: Open the .pkg file.
• Accept the license agreement.
• Choose installation for "Just Me" (recommended).
• Select the destination folder.
• Make sure to check “Add Anaconda to my PATH environment variable” (optional, but
helpful) or use Anaconda Prompt to run Python.
• Verify Installation: Open Anaconda Navigator or launch Anaconda Prompt and type:
conda list

Linux:
• Download the .sh installer for Linux from the same website.
• Open Terminal and run:
bash [Link]-Linux-x86_64.sh

2
• Follow the prompts, agree to the license, and confirm installation path.
• After installation:
source ~/.bashrc
conda list

Using Google Colab


Google Colab (Colaboratory) is a free, cloud-based Jupyter Notebook environment. It allows you
to write and execute Python code in your browser without installing anything.
Steps to Use Google Colab:
• Visit: [Link]
• Sign in with a Google Account.
• Click on “New Notebook” to start a new file.
• Open existing notebooks from Google Drive, GitHub, or upload a .ipynb file.
• Write Python code in the code cells, and press Shift + Enter to run.

To install packages in a Colab notebook: !pip install package_name


Example: !pip install seaborn
Save your work: Click File → Save a copy in Drive to store the notebook.

RESULT:
Thus, data analysis and visualization tool has been installed successfully.

3
VIVA QUESTIONS

1. What is R, and where is it primarily used?


Asked by: TCS, Cognizant
R is a programming language and environment specifically built for statistical computing and
graphics. It is used extensively in academia, data science, and analytics for statistical modeling
and visualization.

2. How do you install R on a Windows system?


Asked by: Wipro
Go to [Link] download the appropriate version for Windows, run
the installer, and follow the setup instructions.

3. What is Anaconda, and why is it used in data science?


Asked by: Infosys, Capgemini
Anaconda is a distribution of Python and R for data science and machine learning. It includes tools
like Jupyter Notebook and libraries such as pandas, NumPy, and scikit-learn, which simplify
package and environment management.

4. How do you install Python libraries like pandas and matplotlib?


Asked by: HCL Technologies
Use the pip command:
pip install pandas matplotlib
This will download and install the libraries from the Python Package Index.

5. What is the difference between Google Colab and Jupyter Notebook?


Asked by: Accenture
Google Colab is a cloud-based Jupyter Notebook environment that allows free access to compute
resources (including GPUs) and does not require any installation, whereas Jupyter Notebook runs
locally.

6. How do you install Tableau Public?


Asked by: Cognizant
Visit [Link] download the installer, and follow the on-screen instructions to
install Tableau Public on your system.

7. What is Power BI, and how do you install it?


Asked by: Wipro, TCS
Power BI is a business intelligence tool from Microsoft used for data visualization and dashboard
creation. Download Power BI Desktop from [Link] and follow the
installation steps.

4
8. How can you verify if Anaconda is installed properly?
Asked by: Capgemini
Open the Anaconda Prompt or terminal and type conda list. If it displays a list of packages, then
Anaconda is installed successfully.

9. Can you use both Python and R in Anaconda? How?


Asked by: Infosys
Yes, Anaconda supports both languages. You can create environments for each using conda create
-n myenv python=3.9 or conda create -n r-env r-base and activate them using conda activate
myenv.

10. How do you install Python on macOS and manage libraries?


Asked by: TCS, Accenture
Install Python using Homebrew with brew install python. Then manage libraries using pip (pip
install package_name) or create virtual environments with venv or use Anaconda.

5
Ex. No. 2 EXPLORATORY DATA ANALYSIS (EDA) ON WITH
Date DATASETS

AIM:
To perform exploratory data analysis (EDA) on with datasets like email data set.

ALGORITHM:
Step 1: Import the necessary Python libraries.
Step 2: Load the email dataset into a Pandas DataFrame.
Step 3: Display the basic structure and content of the dataset.
Step 4: Check for and handle any missing or null values.
Step 5: Perform descriptive statistics.
Step 6: Visualize the distribution of categories using a pie chart.

PROGRAM & OUTPUT:


# Step 1: Import necessary libraries
!pip install plotly
import pandas as pd
import [Link] as plt
import [Link] as px
import seaborn as sns
# Step 2: Load the dataset
# Link:
# [Link]
df = pd.read_csv('spam_ham_dataset.csv')
Dataset: spam_ham_dataset
Link:[Link]
dataset?select=spam_ham_dataset.csv

6
# Step 3: Basic data info
print([Link]())

<class '[Link]'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 5171 non-null int64
1 label 5171 non-null object
2 text 5171 non-null object
3 label_num 5171 non-null int64
dtypes: int64(2), object(2)
memory usage: 161.7+ KB
None

print([Link]())

Unnamed: 0 label text \


0 605 ham Subject: enron methanol ; meter # : 988291\r\n...
1 2349 ham Subject: hpl nom for january 9 , 2001\r\n( see...
2 3624 ham Subject: neon retreat\r\nho ho ho , we ' re ar...
3 4685 spam Subject: photoshop , windows , office . cheap ...
4 2030 ham Subject: re : indian springs\r\nthis deal is t...

label_num
0 0
1 0
2 0
3 1
4 0

# Step 4: Descriptive statistics


print([Link]())
Unnamed: 0 label_num
count 5171.000000 5171.000000
mean 2585.000000 0.289886
std 1492.883452 0.453753
min 0.000000 0.000000
25% 1292.500000 0.000000
50% 2585.000000 0.000000
75% 3877.500000 1.000000
max 5170.000000 1.000000

# Step 5: Check for missing values


print([Link]().sum())

7
Unnamed: 0 0
label 0
text 0
label_num 0
dtype: int64

# Step 6: Drop Unnecessary Columns


[Link](columns=['Unnamed: 0'], inplace=True)

# Step 7: Generate Bar Chart


[Link](x='label', data=df, palette='pastel')
[Link]('Count of Ham and Spam Emails')
[Link]('Label')
[Link]('Count')
[Link]()

# Step 8: Count the label categories:


category_ct = df['label'].value_counts()

# Step 9: Create and customize the pie chart


fig = [Link]( values=category_ct.values, names=category_ct.index,
color_discrete_sequence=[Link], title='Pie Graph: Spam or Ham')

fig.update_traces(hoverinfo='label+percent', textinfo='label+value+percent',
textfont_size=15, marker=dict(line=dict(color='green', width=2)))
[Link]()

8
RESULT:
Thus, the exploratory data analysis (EDA) on with datasets like email data set has been performed
successfully.

9
VIVA QUESTIONS

1. What is Exploratory Data Analysis (EDA)?


Asked by: TCS, Wipro
EDA is the process of analyzing datasets to summarize their main characteristics using statistical
and visual techniques before applying modeling or algorithms.

2. What type of dataset is used in this EDA example?


Asked by: Cognizant
The dataset used is a labeled email dataset called spam_ham_dataset.csv, which contains both
spam and ham (non-spam) emails.

3. Why is [Link]() used in the analysis?


Asked by: Infosys
[Link]() provides an overview of the dataset including column names, non-null counts, and data
types.

4. What does [Link]() show?


Asked by: Accenture
It shows descriptive statistics such as count, mean, standard deviation, min, and max values for
numerical columns.

5. Why do we drop the column ‘Unnamed: 0’?


Asked by: Capgemini
It is an unnecessary index column that does not provide useful information for analysis, so it is
dropped to clean the dataset.

6. What is the purpose of using [Link]() in this context?


Asked by: IBM
[Link]() is used to visualize the number of spam and ham emails in the dataset using a bar
chart.

7. Why is Plotly used in this program?


Asked by: Zoho
Plotly is used to create interactive visualizations, such as a pie chart showing the distribution of
spam and ham emails.

8. What is the meaning of label and label_num in this dataset?


Asked by: CTS (Cognizant Technology Solutions)
label is a categorical column with values like 'spam' or 'ham'; label_num is the corresponding
numeric form, where 1 = spam and 0 = ham.

10
9. What does [Link]().sum() do?
Asked by: TCS
It checks for missing or null values in the dataset and returns the total count for each column.

10. How does a pie chart help in EDA?


Asked by: Infosys
A pie chart helps in understanding the proportion of different classes (spam vs. ham) visually,
making class imbalance easy to detect.

11
Ex. No. 3
WORKING WITH NUMPY ARRAYS,
PANDAS DATA FRAMES, BASIC PLOTS USING
Date MATPLOTLIB

AIM:
To perform basic operations using NumPy arrays, manipulate data using Pandas DataFrames, and
create basic plots using Matplotlib.

ALGORITHM:
A. NumPy Arrays:
Step 1: Import the NumPy library.
Step 2: Create 1D and 2D arrays using [Link], [Link], and [Link].
Step 3: Perform arithmetic operations, statistical analysis, reshaping, and broadcasting.
Step 4: Apply functions like mean, median, std, linspace, dot, and matrix multiplication.

B. Pandas DataFrames:
Step 1: Import the Pandas library.
Step 2: Create a DataFrame from a dictionary and a CSV file.
Step 3: Perform data selection, slicing, filtering, sorting, grouping, and aggregation.
Step 4: Add new columns, drop columns, handle missing values, and apply lambda functions.

C. Basic Plots using Matplotlib:


Step 1: Import [Link] as plt.
Step 2: Create different types of plots: line, bar, histogram, scatter, and pie chart.
Step 3: Customize plots with titles, labels, legends, colors, and grid.
Step 4: Display multiple subplots using [Link].

PROGRAM & OUTPUT:


A. NumPy Arrays:
# Import numpy library
import numpy as np

# Create arrays
arr1 = [Link]([1, 2, 3, 4, 5])
arr2 = [Link]([[1, 2], [3, 4]])
print("Array 1:", arr1)
print("Array 2:", arr2)

Array 1: [1 2 3 4 5]
Array 2: [[1 2]
[3 4]]
12
# Basic operations
print("Mean:", [Link](arr1))
print("Sum:", [Link](arr1))
print("Standard Deviation:", [Link](arr1))
print("Square Root:", [Link](arr1))
print("Exponential:", [Link](arr1))
Mean: 3.0
Sum: 15
Standard Deviation: 1.4142135623730951
Square Root: [1. 1.41421356 1.73205081 2. 2.23606798]
Exponential: [ 2.71828183 7.3890561 20.08553692 54.59815003 148.4131591
]

# Indexing and slicing


print("First Element:", arr1[0])
print("Sub-array:", arr1[1:4])

First Element: 1
Sub-array: [2 3 4]

# Array concatenation
combined = [Link]([arr1, [Link]([6, 7, 8, 9])])
print("Combined Array:", combined)

Combined Array: [1 2 3 4 5 6 7 8 9]

# Reshape and Matrix multiplication


reshaped = [Link]([Link](1, 7), (2, 3))
print("Reshaped Array:\n", reshaped)

Reshaped Array:
[[1 2 3]
[4 5 6]]

B. Pandas DataFrames
# Import pandas library
import pandas as pd
# Create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'Marks': [88, 92, 79, 85, 90],
'City': ['Chennai', 'Mumbai', 'Chennai', 'Delhi', 'Mumbai']
}
13
df = [Link](data)
print("\nDataFrame:\n", df)

DataFrame:
Name Age Marks City
0 Alice 25 88 Chennai
1 Bob 30 92 Mumbai
2 Charlie 35 79 Chennai
3 David 40 85 Delhi
4 Eva 45 90 Mumbai

# Accessing a specific column


print(df['Name'])
0 Alice
1 Bob
2 Charlie
3 David
4 Eva
Name: Name, dtype: object

# Filter and aggregation


print("\nMarks > 85:\n", df[df['Marks'] > 85])
print("\nGroup by City:\n", [Link]('City')['Marks'].mean())

Marks > 85:


Name Age Marks City
0 Alice 25 88 Chennai
1 Bob 30 92 Mumbai
4 Eva 45 90 Mumbai

Group by City:
City
Chennai 83.5
Delhi 85.0
Mumbai 91.0
Name: Marks, dtype: float64

# Add and drop columns


df['Passed'] = df['Marks'] > 80
[Link](columns=['Passed'], inplace=True)
print(df)

Name Age Marks City


0 Alice 25 88 Chennai
1 Bob 30 92 Mumbai
2 Charlie 35 79 Chennai
14
3 David 40 85 Delhi
4 Eva 45 90 Mumbai

# Handle missing data


[Link][2, 'Marks'] = None
print("DataFrame - Before Handling missing data: \n", df)
df['Marks'].fillna(df['Marks'].mean(), inplace=True)
print("\nDataFrame - After Handling missing data: \n", df)

DataFrame - Before Handling missing data:


Name Age Marks City
0 Alice 25 88.0 Chennai
1 Bob 30 92.0 Mumbai
2 Charlie 35 NaN Chennai
3 David 40 85.0 Delhi
4 Eva 45 90.0 Mumbai

DataFrame - After Handling missing data:


Name Age Marks City
0 Alice 25 88.00 Chennai
1 Bob 30 92.00 Mumbai
2 Charlie 35 88.75 Chennai
3 David 40 85.00 Delhi
4 Eva 45 90.00 Mumbai

# Save and reload


df.to_csv('[Link]', index=False)
df_new = pd.read_csv('[Link]')
print("\nCSV Loaded:\n", df_new)

CSV Loaded:
Name Age Marks City
0 Alice 25 88.00 Chennai
1 Bob 30 92.00 Mumbai
2 Charlie 35 88.75 Chennai
3 David 40 85.00 Delhi
4 Eva 45 90.00 Mumbai

C. Basic Plots using Matplotlib


#import matplotlib library
import [Link] as plt

# Line Plot
x = [Link](0, 10, 100)
y = [Link](x)
[Link](figsize=(6, 4))
15
[Link](x, y, label='Sine', color='blue')
[Link]("Line Plot")
[Link]("X")
[Link]("sin(X)")
[Link]()
[Link](True)
[Link]()

# Bar Chart
[Link](df['Name'], df['Marks'], color='green')
[Link]("Bar Chart - Marks")
[Link]("Name")
[Link]("Marks")
[Link]()

16
# Histogram
[Link](df['Marks'], bins=5, color='yellow')
[Link]("Histogram - Marks Distribution")
[Link]("Marks")
[Link]("Frequency")
[Link]()

# Scatter Plot
[Link](df['Age'], df['Marks'], color='red')
[Link]("Scatter - Age vs Marks")
[Link]("Age")
[Link]("Marks")
[Link]()

17
# Pie Chart
city_counts = df['City'].value_counts()
[Link](city_counts, labels=city_counts.index, autopct='%1.1f%%', startangle=90)
[Link]("Pie Chart - City Distribution")
[Link]('equal')
[Link]()

# Subplots
fig, axs = [Link](1, 2, figsize=(10, 4))
axs[0].plot(x, [Link](x), label='Cosine', color='purple')
axs[1].plot(x, [Link](x), label='Tangent', color='brown')
axs[0].set_title("Cosine Plot")
axs[1].set_title("Tangent Plot")
for ax in axs:
[Link](True)
[Link]()
plt.tight_layout()
[Link]()

RESULT:
Thus, the operations using NumPy arrays, Pandas DataFrames, and Matplotlib plots were
successfully implemented.
18
VIVA QUESTIONS

1. What is the significance of converting categorical labels into numerical form?


Asked by: Deloitte
Numerical labels allow algorithms and statistical operations to process the data efficiently, as
most models require numeric input.

2. Why is label_num important in this dataset?


Asked by: Mindtree
It simplifies spam classification by converting textual labels ('spam', 'ham') into numerical values
(1, 0) for analysis and model input.

3. What does [Link]() do in the context of this dataset?


Asked by: Hexaware
It creates a bar chart showing the frequency of spam and ham messages, helping identify data
imbalance.

4. Why is it necessary to install Plotly separately?


Asked by: CTS (Cognizant)
Plotly is not included by default in many Python environments, so it needs manual installation
using pip to enable advanced interactive plotting.

5. What type of insights can you get from a pie chart of email labels?
Asked by: Bosch
It shows the percentage distribution of spam vs. ham emails, helping visualize class proportions
clearly.

6. What does [Link]() help us understand?


Asked by: L&T Infotech
It displays the first few rows of the dataset, giving a quick glimpse of data format, structure, and
values.

7. Why is data visualization crucial before applying ML models?


Asked by: Tech Mahindra
It helps detect patterns, correlations, or anomalies that might affect model performance or require
preprocessing.

8. What is class imbalance, and how can it affect results?


Asked by: Accenture
Class imbalance occurs when one class significantly outnumbers others, potentially leading to
biased model predictions.

19
9. Why do we use drop(columns=[...]) in Pandas?
Asked by: Wipro
To remove irrelevant or redundant columns that do not contribute meaningful insights to the
analysis.

10. Can we apply machine learning directly after EDA?


Asked by: Capgemini
No, further steps like feature engineering, normalization, and data splitting are usually needed
before modeling.

20
Ex. No. 4 EXPLORING VARIOUS VARIABLE AND ROW
Date FILTERS IN R FOR CLEANING DATA

AIM:
To explore and apply variable and row-level filtering techniques in R for effective data cleaning
and basic visualization.

PROCEDURE TO INSTALL AND WORK WITH R AND R STUDIO


• Install R:
o Download from the official CRAN website: [Link]
o Install RStudio (an IDE for R): [Link]
o Install R first, then RStudio.
o After installation, open RStudio from your Start menu (Windows).

• Using the RStudio Interface


o RStudio has four main panels:
▪ Top-Left: Script Editor
▪ Bottom-Left: Console (where R code runs)
▪ Top-Right: Environment / History
▪ Bottom-Right: Plots / Files / Packages / Help

• Create a New Script


o Go to File → New File → R Script (or press Ctrl + Shift + N).
o This opens a new editor window (top-left panel).
• Save Your Script
o Click File → Save As and name it like filename.R.

• Run R Code
o You have three options:
▪ To run all from terminal: Use the command rscript filename.R.
▪ To run line-by-line from Console: Place your cursor on a line and press
Ctrl + Enter.
▪ To run all from Console: Highlight all code and press Ctrl + Enter, or
click "Run" in the top right of the editor.

6. Install Required Packages (Optional)


• If your code uses external libraries like ggplot2, install them with:
[Link]("ggplot2")
• Then load it:
library(ggplot2)

21
7. View Output
• The output will appear in the Console.
• Plots will appear in the Plots tab (bottom-right).

ALGORITHM:
1. Data Preparation
a. Load required R packages.
b. Create a sample dataset using [Link]().
c. Display the raw data for reference.

2. Variable-Based Filtering
a. Filter by a specific condition (e.g., Age > 30).
b. Filter using multiple conditions (e.g., Age > 30 & Gender == "Male").

3. Row-Based Filtering
a. Remove duplicate rows using unique().
b. Remove missing values using [Link]().

4. Data Visualization
a. Load the ggplot2 package.
b. Create the following visualizations using the cleaned dataset:
c. Scatterplot: Age vs. Score, colored by Gender.
d. Histogram: Distribution of Age.
e. Bar chart: Gender distribution.

PROGRAM:
# Load necessary library
library(ggplot2)

# Step 1: Create a Sample Dataset


[Link](123)
data <- [Link](
ID = 1:10,
Age = sample(18:60, 10, replace = TRUE),
Gender = sample(c("Male", "Female"), 10, replace = TRUE),
Score = sample(1:100, 10)
)

# Print the dataset


print("Original Data:")
print(data)

22
# Step 2: Variable Filtering
# 2.1 Filter: Age > 30
filtered_data1 <- data[data$Age > 30, ]
print("Filtered Data (Age > 30):")
print(filtered_data1)

# 2.2 Filter: Age > 30 and Gender == "Male"


filtered_data2 <- data[data$Age > 30 & data$Gender == "Male", ]
print("Filtered Data (Age > 30 & Gender == Male):")
print(filtered_data2)

# Step 3: Row Filtering


# 3.1 Remove duplicate rows based on selected columns
cleaned_data1 <- unique(data[, c("ID", "Age", "Gender")])
print("Cleaned Data (Duplicates Removed):")
print(cleaned_data1)

# 3.2 Remove rows with missing values (NA)


data$Score[5] <- NA
cleaned_data2 <- [Link](data)
print("Cleaned Data (Missing Values Removed):")
print(cleaned_data2)

# Step 4: Data Visualization using ggplot2


# 4.1 Scatterplot of Age vs Score, colored by Gender
ggplot(data = cleaned_data2, aes(x = Age, y = Score, color = Gender)) +
geom_point(size = 3) +
labs(title = "Scatterplot of Age vs. Score", x = "Age", y = "Score") +
theme_minimal()

# 4.2 Histogram of Age


ggplot(data = cleaned_data2, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "blue", alpha = 0.6) +
labs(title = "Histogram of Age", x = "Age", y = "Frequency") +
theme_minimal()

# 4.3 Bar Chart of Gender Distribution


ggplot(data = cleaned_data2, aes(x = Gender)) +
geom_bar(fill = "green", alpha = 0.7) +
labs(title = "Gender Distribution", x = "Gender", y = "Count") +
theme_minimal()

23
OUTPUT:
[1] "Original Data:"
ID Age Gender Score
1 1 48 Male 99
2 2 32 Female 72
3 3 31 Male 26
4 4 20 Male 7
5 5 59 Male 42
6 6 60 Male 9
7 7 54 Female 83
8 8 31 Male 36
9 9 42 Male 78
10 10 43 Male 81

[1] "Filtered Data (Age > 30):"


ID Age Gender Score
1 1 48 Male 99
2 2 32 Female 72
3 3 31 Male 26
5 5 59 Male 42
6 6 60 Male 9
7 7 54 Female 83
8 8 31 Male 36
9 9 42 Male 78
10 10 43 Male 81
[1] "Filtered Data (Age > 30 & Gender == Male):"
ID Age Gender Score
1 1 48 Male 99
3 3 31 Male 26
5 5 59 Male 42
6 6 60 Male 9
8 8 31 Male 36
9 9 42 Male 78
10 10 43 Male 81

[1] "Cleaned Data (Duplicates Removed):"


ID Age Gender
1 1 48 Male
2 2 32 Female
3 3 31 Male
4 4 20 Male
5 5 59 Male
6 6 60 Male
7 7 54 Female
8 8 31 Male
9 9 42 Male
10 10 43 Male
24
[1] "Cleaned Data (Missing Values Removed):"
ID Age Gender Score
1 1 48 Male 99
2 2 32 Female 72
3 3 31 Male 26
4 4 20 Male 7
6 6 60 Male 9
7 7 54 Female 83
8 8 31 Male 36
9 9 42 Male 78
10 10 43 Male 81

25
RESULT:
Thus, various variable and row filters in R were successfully explored and applied for data
cleaning, and the cleaned dataset was visualized using multiple plot types.

26
VIVA QUESTIONS

1. What function is used to remove missing values from a dataset in R?


Asked by: TCS
The [Link]() function removes rows with missing values in R.

2. How do you remove duplicate rows from a dataframe in R?


Asked by: Cognizant
You can remove duplicates using the unique() function in R.

3. What is the purpose of using subset() in R?


Asked by: Infosys
The subset() function filters data based on specified conditions like Age > 30.

4. Which function is used to create a dataframe manually in R?


Asked by: Wipro
The [Link]() function is used to create a dataset manually in R.

5. What is the role of the ggplot2 library in R?


Asked by: Capgemini
ggplot2 is used for creating advanced and customizable data visualizations in R.

6. How can we display the structure of a dataframe in R?


Asked by: L&T Infotech
The str() function is used to view the structure of a dataframe.

7. What R function allows you to install a package from CRAN?


Asked by: HCL Technologies
[Link]("package_name") installs a package from CRAN.

8. How do you filter records where Age > 30 and Gender is Male?
Asked by: Deloitte
Using subset(df, Age > 30 & Gender == "Male").

9. Where do plots appear in RStudio after running a visualization command?


Asked by: Hexaware
Plots appear in the "Plots" tab in the bottom-right panel of RStudio.

27
10. Why is data cleaning necessary before visualization?
Asked by: Mindtree
Data cleaning ensures accuracy and clarity in analysis and visualization by removing
inconsistencies, duplicates, and missing values.

28
Ex. No. 5 DATA ANALYSIS AND REPRESENTATION ON A
Date MAP

AIM:
To visualize data on a world map using interactive features like mouse rollover and zoom using
Python’s Plotly library.

ALGORITHM:
Step 1: Open Google Colab and import necessary libraries.
Step 2: Prepare sample data (e.g., country names and some values).
Step 3: Use [Link] to create a choropleth map.
Step 4: Add features like tooltips and color scales.
Step 5: Display the interactive map.

PROGRAM & OUTPUT:


# Step 1: Install plotly
!pip install plotly

# Step 2: Import libraries


import [Link] as px
import pandas as pd

# Step 3: Sample data for countries


data = {
'Country': ['India', 'United States', 'China', 'Brazil', 'Australia'],
'Value': [80, 90, 75, 60, 70]
}
print(data)

{'Country': ['India', 'United States', 'China', 'Brazil', 'Australia'],


'Value': [80, 90, 75, 60, 70]}

# Step 4: Create DataFrame


df = [Link](data)
df
Country Value

0 India 80

1 United States 90

29
Country Value

2 China 75

3 Brazil 60

4 Australia 70

# Step 5: Create choropleth map


fig = [Link](
df,
locations='Country', # Name of the country
locationmode='country names', # Country names as keys
color='Value', # Data to show
hover_name='Country', # Show country name on hover
color_continuous_scale='YlOrRd', # Color scale
title='Interactive World Map with Data Values'
)

# Step 6: Show the map


[Link]()

30
RESULT:
Thus, an interactive data representation on a world map using Plotly with user interaction features
is executed successfully.

31
VIVA QUESTIONS

1. What is a choropleth map?


Asked by: TCS
A choropleth map is a type of map where areas are shaded or colored in proportion to a data
variable, such as population or GDP.

2. Why do we use [Link] for map visualizations?


Asked by: Wipro
[Link] provides simple syntax to create interactive plots and maps with features like
zoom, hover tooltips, and color scales.

3. What does locationmode='country names' mean in Plotly?


Asked by: Capgemini
It tells Plotly to match the location using the actual names of countries rather than ISO codes or
abbreviations.

4. Can we use custom datasets for mapping in Plotly?


Asked by: Infosys
Yes, custom datasets with country names or ISO codes can be used to generate interactive maps.

5. What type of data is best suited for a choropleth map?


Asked by: Cognizant
Quantitative data that can be aggregated by geographical regions like countries or states is best
suited.

6. How is interactivity implemented in Plotly maps?


Asked by: HCL
Plotly automatically supports interactivity like hover effects, zooming, and panning in web-based
plots.

7. Why is a color scale important in a choropleth map?


Asked by: Hexaware
The color scale visually differentiates values across regions, helping users understand the
distribution easily.

8. What happens if a country name in the dataset is not recognized?


Asked by: L&T Infotech
That country will not be displayed on the map, and no error will be raised unless explicitly
handled.

32
9. What is the purpose of hover_name in [Link]?
Asked by: Mindtree
It sets the label that appears when the user hovers over a region on the map.

10. How can you change the map's color theme in Plotly?
Asked by: Accenture
You can change it by modifying the color_continuous_scale parameter using predefined or
custom color scales.

33
Ex. No. 6
BUILDING CARTOGRAPHIC VISUALIZATION
Date

AIM:
To build the cartographic visualization using multiple datasets from various countries and states
on a world map.

ALGORITHM:
Step 1: Install and import required libraries.
Step 2: Prepare data for cities from different countries and states.
Step 3: Convert city coordinates to GeoDataFrame.
Step 4: Load the world map from an online shapefile.
Step 5: Plot the world map and overlay the city points.
Step 6: Customize with labels, colors, and title.
Step 7: Display the cartographic visualization.

PROGRAM & OUTPUT:


#Step 1: Install geopandas
pip install geopandas matplotlib shapely Descartes

# Step 2: Import required libraries


import pandas as pd
import geopandas as gpd
import [Link]
import [Link] as plt

# Step 3: Sample cities from India, USA, Australia


df = [Link]({
'city': [ 'Delhi', 'Mumbai', 'New York', 'Los Angeles', 'Sydney', 'Melbourne'],
'country': [ 'India', 'India', 'USA', 'USA', 'Australia', 'Australia'],
'latitude': [ 28.6139, 19.0760, 40.7128, 34.0522, -33.8688, -37.8136],
'longitude': [ 77.2090, 72.8777, -74.0060, -118.2437, 151.2093, 144.9631]})
df

country latitude longitude


city

0 Delhi India 28.6139 77.2090

1 Mumbai India 19.0760 72.8777

2 New York USA 40.7128 -74.0060

3 Los Angeles USA 34.0522 -118.2437

4 Sydney Australia -33.8688 151.2093

34
country latitude longitude
city

5 Melbourne Australia -37.8136 144.9631

# Step 4: Convert DataFrame to GeoDataFrame


gdf = [Link](
[Link](['latitude', 'longitude'], axis=1),
crs="EPSG:4326",
geometry=[[Link](xy) for xy in zip([Link], [Link])]
)

# Step 5: Load world map


world =
gpd.read_file("[Link]
[Link]")

# Step 6: Plot the world map


fig, ax = [Link](figsize=(15, 10))
base = [Link](ax=ax, color='lightgray', edgecolor='black')

# Step 7: Plot the cities


[Link](ax=base, marker='o', color='red', markersize=100)

# Step 8: Add labels to each city


for x, y, label in zip([Link].x, [Link].y, gdf['city']):
[Link](x + 1, y, label, fontsize=9, fontweight='bold')

# Step 9: Add title


[Link]("Cartographic Visualization of Major Cities across 3 Countries", fontsize=16)
[Link]('off')
[Link]()

35
RESULT:
Thus, the cartographic visualization for cities from multiple datasets involving various countries
and states was successfully built using Python.

36
VIVA QUESTIONS

1. What is cartographic visualization?


Asked by: Cognizant
Cartographic visualization is the process of representing spatial data on a map to analyze and
interpret geographical patterns.

2. What is GeoPandas used for?


Asked by: TCS
GeoPandas is used to work with geospatial data in Python, enabling spatial joins, plotting, and
conversion between tabular and map formats.

3. Why do we use GeoDataFrame instead of a normal DataFrame?


Asked by: Infosys
A GeoDataFrame allows the inclusion of geometry (like points, lines, or polygons), enabling
mapping and spatial analysis which a regular DataFrame cannot do.

4. What does crs="EPSG:4326" mean in GeoPandas?


Asked by: Capgemini
It sets the coordinate reference system to WGS 84, which is a standard GPS-based coordinate
system for global locations.

5. How do we create point geometries from coordinates?


Asked by: Wipro
By using [Link]() on longitude and latitude pairs, we convert coordinates into
mappable point objects.

6. What does the read_file() function in GeoPandas do?


Asked by: Accenture
It loads shapefiles or geojson files containing map boundaries, such as country outlines, into a
GeoDataFrame.

7. Why do we use [Link]('off') in map visualizations?


Asked by: Tech Mahindra
It removes the x and y axes to make the map visually cleaner and more focused on spatial features.

8. What is the difference between a shapefile and a GeoJSON file?


Asked by: HCL
Both store geospatial data, but shapefiles are binary and used in GIS systems, while GeoJSON is
a human-readable JSON format suitable for web applications.

37
9. How can cities be overlaid on a world map?
Asked by: IBM
Cities are represented as point geometries in a GeoDataFrame and plotted over a world base map
using GeoPandas plotting functions.

10. Can we visualize boundaries of states within a country using GeoPandas?


Asked by: LTI (L&T Infotech)
Yes, by using appropriate shapefiles containing state-level data, we can visualize and customize
boundaries within a country.

38
Ex. No. 7
PERFORMING EDA ON WINE QUALITY DATA SET
Date

AIM:
To perform Exploratory Data Analysis (EDA) on the Wine Quality dataset using Python libraries.

ALGORITHM:
Step 1: Load libraries like pandas, numpy, [Link], and seaborn.
Step 2: Read the [Link] dataset using pandas.read_csv().
Step 3: Show the first few rows using head().
Step 4: Use info() and describe() for structure and summary.
Step 5: Check for missing values using isnull().sum().
Step 6: Use value_counts() to see how many wines fall into each quality category.
Step 7: Use [Link]() to visualize the distribution of all numerical features.
Step 8: Plot boxplots using [Link]() to detect outliers in features.
Step 9: Use [Link]() to show correlation between features.
Step 10: Create a boxplot of alcohol against quality to study how alcohol affects wine quality.

PROGRAM & OUTPUT:


# Step 1: Import libraries
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns

# Step 2: Load the dataset


#Link:
#[Link]
df = pd.read_csv("[Link]")

# Step 3: Display data information


print("First 5 rows:")
print([Link]())

First 5 rows:
fixed acidity volatile acidity citric acid residual sugar chlorides
\
0 7.4 0.70 0.00 1.9 0.076
1 7.8 0.88 0.00 2.6 0.098
2 7.8 0.76 0.04 2.3 0.092
3 11.2 0.28 0.56 1.9 0.075
39
4 7.4 0.70 0.00 1.9 0.076

free sulfur dioxide total sulfur dioxide density pH sulphates \


0 11.0 34.0 0.9978 3.51 0.56
1 25.0 67.0 0.9968 3.20 0.68
2 15.0 54.0 0.9970 3.26 0.65
3 17.0 60.0 0.9980 3.16 0.58
4 11.0 34.0 0.9978 3.51 0.56

alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5

print("\nDataset Info:")
print([Link]())

Dataset Info:
<class '[Link]'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
None

print("\nSummary Statistics:")
print([Link]())

Summary Statistics:
fixed acidity volatile acidity citric acid residual sugar \
count 1599.000000 1599.000000 1599.000000 1599.000000

40
mean 8.319637 0.527821 0.270976 2.538806
std 1.741096 0.179060 0.194801 1.409928
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.390000 0.090000 1.900000
50% 7.900000 0.520000 0.260000 2.200000
75% 9.200000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000

chlorides free sulfur dioxide total sulfur dioxide density \


count 1599.000000 1599.000000 1599.000000 1599.000000
mean 0.087467 15.874922 46.467792 0.996747
std 0.047065 10.460157 32.895324 0.001887
min 0.012000 1.000000 6.000000 0.990070
25% 0.070000 7.000000 22.000000 0.995600
50% 0.079000 14.000000 38.000000 0.996750
75% 0.090000 21.000000 62.000000 0.997835
max 0.611000 72.000000 289.000000 1.003690

pH sulphates alcohol quality


count 1599.000000 1599.000000 1599.000000 1599.000000
mean 3.311113 0.658149 10.422983 5.636023
std 0.154386 0.169507 1.065668 0.807569
min 2.740000 0.330000 8.400000 3.000000
25% 3.210000 0.550000 9.500000 5.000000
50% 3.310000 0.620000 10.200000 6.000000
75% 3.400000 0.730000 11.100000 6.000000
max 4.010000 2.000000 14.900000 8.000000

# Step 4: Check shape and null values


print("Shape of dataset:", [Link])

Shape of dataset: (1599, 12)

print("\nColumn Names:", [Link])

Column Names: Index(['fixed acidity', 'volatile acidity', 'citric acid',


'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur
dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'],
dtype='object')

print("\nData Types:\n", [Link])

Data Types:
fixed acidity float64
volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64

41
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object

print("\nMissing Values:\n", [Link]().sum())


Missing Values:
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
# Step 5: Quality value counts
print("\nWine Quality Distribution:\n", df['quality'].value_counts())

Wine Quality Distribution:


quality
5 681
6 638
7 199
4 53
8 18
3 10
Name: count, dtype: int64

# Step 6: Plot histogram for all features


[Link](bins=15, figsize=(15, 10), color='skyblue')
[Link]("Feature Distributions", fontsize=16)
plt.tight_layout()
[Link]()

42
# Step 7: Boxplot of features
[Link](figsize=(12, 6))
[Link](data=df, orient="h")
[Link]("Boxplots of Features")
[Link]()

# Step 8: Correlation heatmap


[Link](figsize=(10, 8))

43
[Link]([Link](numeric_only=True), annot=True, cmap="coolwarm",
linewidths=0.5)
[Link]("Feature Correlation Heatmap")
[Link]()

# Step 9: Quality-wise alcohol distribution


[Link](figsize=(8, 5))
[Link](x='quality', y='alcohol', data=df)
[Link]("Alcohol Content vs Wine Quality")
[Link]()

44
RESULT:
Thus, the EDA was successfully performed on the Wine Quality dataset and various plots and
summary statistics were generated.

45
VIVA QUESTIONS

1. What is the purpose of performing EDA on a dataset?


Asked by: Infosys
EDA helps understand the underlying patterns, detect anomalies, and summarize the main
characteristics of the data before applying machine learning.

2. Why do we use describe() in pandas during EDA?


Asked by: TCS
The describe() function gives statistical summaries like mean, standard deviation, and quartiles
for each numerical column.

3. What kind of data is found in the wine quality dataset?


Asked by: Wipro
The dataset includes physicochemical features of red wine like acidity, pH, alcohol, and a quality
score between 0 and 10.

4. How do histograms help in EDA?


Asked by: Cognizant
Histograms show the distribution of numerical features, helping detect skewness or unusual data
concentrations.

5. Why is a heatmap used in correlation analysis?


Asked by: Capgemini
A heatmap visually displays the correlation between variables, allowing easy identification of
highly related features.

6. What does the boxplot of alcohol vs quality show?


Asked by: IBM
It shows how alcohol content varies across different wine quality levels and helps detect if higher
alcohol indicates better quality.

7. What is the meaning of [Link]().sum() in the program?


Asked by: Tech Mahindra
It checks each column for missing values and returns the total count of nulls in each column.

8. What can be inferred from the value counts of wine quality?


Asked by: HCL
Most wines are rated 5 or 6, indicating a class imbalance which should be considered in
modeling.

46
9. Why are seaborn and matplotlib used together?
Asked by: Accenture
Seaborn offers higher-level visualization tools built on matplotlib, making plots more attractive
and customizable.

10. Which features in this dataset are most correlated with wine quality?
Asked by: L&T Infotech
Typically, alcohol and volatile acidity show a strong correlation with wine quality based on the
heatmap.

47
Ex. No. 8 TIME SERIES ANALYSIS USING VARIOUS
Date VISUALIZATION TECHNIQUES

AIM:
To perform a simple time series analysis on a dataset using line plots and autocorrelation to observe
trends and seasonality.

ALGORITHM:
Step 1: Import required libraries.
Step 2: Load the dataset and parse the "Month" column as a date.
Step 3: Set the "Month" column as index for time-based plotting.
Step 4: Visualize the number of air passengers over time using a line plot.
Step 5: Use autocorrelation to check seasonality or repeating patterns.

PROGRAM & OUTPUT:


# Step 1: Import necessary libraries
import pandas as pd
import [Link] as plt
from [Link] import autocorrelation_plot

# Step 2: Load the dataset and parse dates


# Link:
#[Link]
df = pd.read_csv("[Link]", parse_dates=['Month'], index_col='Month')
[Link]()

#Passengers

Month

1949-01-01 112

1949-02-01 118

1949-03-01 132

1949-04-01 129

1949-05-01 121

# Step 3: Plotting function


def plot_df(df, x, y, title="", xlabel='Date', ylabel='Passengers', dpi=100):

48
[Link](figsize=(16, 5), dpi=dpi)
[Link](x, y, color='tab:blue')
[Link](title)
[Link](xlabel)
[Link](ylabel)
[Link]()
[Link]()
# Step 4: Line plot to show trend over time
plot_df(df, x=[Link], y=df['#Passengers'], title='Monthly Air Passengers (1949 - 1960)')

# Step 5: Autocorrelation plot to see repeating patterns


autocorrelation_plot(df['#Passengers'])
[Link]("Autocorrelation of Air Passenger Data")
[Link]()
[Link]()

RESULT:
Thus, the Air Passengers dataset was successfully visualized using a line plot, and the
autocorrelation plot revealed seasonality in air travel patterns.

49
VIVA QUESTIONS

1. What is a time series?


Asked by: TCS
A time series is a sequence of data points collected or recorded at regular time intervals, such as
monthly or yearly.

2. Why do we parse the 'Month' column as a date in time series analysis?


Asked by: Infosys
Parsing as date enables proper time-based indexing and visualization, allowing pandas to
recognize and handle it as a datetime object.

3. What does an autocorrelation plot show?


Asked by: Cognizant
It shows how the current value of the series is related to its past values, helping to detect seasonality
and patterns.

4. Why do we set 'Month' as the index of the DataFrame?


Asked by: Capgemini
Setting 'Month' as the index allows easy plotting and resampling operations based on time.

5. What trend is observed in the air passenger data?


Asked by: Wipro
There is an increasing trend in the number of passengers over the years from 1949 to 1960.

6. How is seasonality identified in time series data?


Asked by: IBM
Seasonality can be identified through autocorrelation plots and visual inspection of repeated cycles
in the line plot.

7. Why do we use line plots in time series?


Asked by: HCL
Answer: Line plots help visualize trends, fluctuations, and seasonality over time.

8. Which function is used to create the autocorrelation plot in this program?


Asked by: Accenture
The function used is autocorrelation_plot() from the [Link] module.

9. What is the purpose of dpi in the plot function?


Asked by: L&T Infotech
DPI (dots per inch) controls the resolution of the plot, making it clearer when displayed or printed.

50
10. What does a strong autocorrelation at lag 12 indicate in this dataset?
Asked by: Tech Mahindra
It indicates a yearly seasonal pattern, meaning passenger numbers tend to repeat annually.

51
Ex. No. 9 VISUALIZING VARIOUS EDA TECHNIQUES AS
Date CASE STUDY FOR IRIS DATASET

AIM:
To apply various Exploratory Data Analysis (EDA) and Visualization techniques on the Iris
dataset to understand the patterns and relationships between features.

ALGORITHM:
Step 1: Import required libraries.
Step 2: Load the Iris dataset using seaborn.
Step 3: Display basic dataset info, statistics, and check for missing values.
Step 4: Visualize distributions using histograms and box plots.
Step 5: Use pairplot and heatmap to study relationships and correlations.
Step 6: Create a violin plot to compare feature distributions across species.

PROGRAM & OUTPUT:


# Step 1: Import required libraries
import pandas as pd
import seaborn as sns
import [Link] as plt

# Step 2: Load the dataset


# Link:
#[Link]
df = pd.read_csv("[Link]")

# Step 3: Basic dataset information


print("First 5 rows:")
print([Link]())

First 5 rows:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

print("\nDataset Info:")
print([Link]())

52
Dataset Info:
<class '[Link]'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
None

print("\nSummary Statistics:")
print([Link]())

Summary Statistics:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 75.500000 5.843333 3.054000 3.758667 1.198667
std 43.445368 0.828066 0.433594 1.764420 0.763161
min 1.000000 4.300000 2.000000 1.000000 0.100000
25% 38.250000 5.100000 2.800000 1.600000 0.300000
50% 75.500000 5.800000 3.000000 4.350000 1.300000
75% 112.750000 6.400000 3.300000 5.100000 1.800000
max 150.000000 7.900000 4.400000 6.900000 2.500000

# Step 4: Histograms
[Link](figsize=(10, 8), color='skyblue')
[Link]("Histograms of Iris Features")
plt.tight_layout()
[Link]()

53
# Step 5: Box plots
[Link](figsize=(10, 6))
[Link](data=df, orient='h')
[Link]("Boxplot of All Features")
[Link]()

# Step 6: Pair plot to show relationships


[Link](df, hue='SepalLengthCm', diag_kind='kde')

54
[Link]("Pair Plot of Iris Features", y=1.02)
[Link]()

# Step 7: Heatmap for correlation (numeric columns only)


[Link](figsize=(8, 6))
numeric_df = df.select_dtypes(include='number') # filter only numeric columns
[Link](numeric_df.corr(), annot=True, cmap="coolwarm")
[Link]("Feature Correlation Heatmap")
[Link]()

55
# Step 8: Violin plot for feature vs species
[Link](figsize=(10, 6))
[Link](x='SepalLengthCm', y='PetalLengthCm', data=df)
[Link]("Petal Length by Species (Violin Plot)")
[Link]()

RESULT:
Thus, the Iris dataset was successfully analyzed using various EDA techniques.

56
VIVA QUESTIONS

1. Why is the 'Id' column often dropped during EDA?


Asked by: TCS.
The 'Id' column is not a feature; it holds no predictive or analytical value and is usually removed
before analysis.

2. What kind of insights can we derive from a boxplot?


Asked by: Wipro
A boxplot shows the spread, median, and outliers of data, helping to detect skewness or anomalies.

3. What does diag_kind='kde' mean in a pairplot?


Asked by: Capgemini
It means the diagonal plots will be kernel density estimates instead of histograms, giving a
smoother distribution curve.

4. How can feature correlation help in dimensionality reduction?


Asked by: HCL
Highly correlated features may be redundant and can be removed or combined to reduce
dimensionality.

5. Why is Species not included in a correlation heatmap?


Asked by: Zoho
Because Species is a categorical column, and correlation requires numeric values only.

6. How does a violin plot differ from a boxplot?


Asked by: Infosys
A violin plot includes both the boxplot and a rotated KDE plot, showing the full distribution more
clearly.

7. What can we infer if SepalLength and PetalLength have a high correlation?


Asked by: CTS
It suggests that as SepalLength increases, PetalLength also tends to increase, showing a linear
relationship.

8. What are the benefits of using seaborn for visualization?


Asked by: Accenture
Seaborn offers high-level functions, built-in themes, and simpler syntax for complex plots like
pairplots and violin plots.

57
9. How can we confirm class imbalance in the Iris dataset?
Asked by: Cognizant
Use df['Species'].value_counts() to check the number of records per class and verify if they are
balanced.

10. What does a symmetrical violin plot indicate?


Asked by: Mindtree
It indicates that the distribution is approximately normal and consistent on both sides of the
median.

58
Ex. No. 10 CBS: ADVANCED DATA ANALYSIS AND
Date VISUALIZATION ON A DATASET

AIM:
To perform advanced data analysis and visualization techniques on the Titanic dataset using
Python libraries.

ALGORITHM:
Step 1: Import libraries: Import pandas, numpy, matplotlib, and seaborn.
Step 2: Load dataset: Read the Titanic dataset using pd.read_csv().
Step 3: Explore data: Display head, info, and describe to understand the dataset.
Step 4: Handle missing data: Fill missing values in Age, Embarked, and Cabin columns.
Step 5: Create new features: Add FamilySize and categorize age groups.
Step 6: Visualize data: Plot countplots, boxplots, histograms, pairplots, and heatmap.
Step 7: Display plots: Use [Link]() to display all the generated visualizations.

PROGRAM & OUTPUT:


# Step 1: Import Libraries
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
[Link](style="whitegrid")
[Link]({'[Link]': (10, 6)})

# Load dataset
#Link:
#[Link]
df = pd.read_csv("[Link]")

# Step 3: Display Basic Info


print("First 5 rows:\n", [Link]())
First 5 rows:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

59
Name Sex Age SibSp
\
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

print("\nDataset Info:")
[Link]()
Dataset Info:
<class '[Link]'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

print("\nMissing values:\n", [Link]().sum())


Missing values:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
60
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

# Step 4: Data Cleaning


df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df['Cabin'] = df['Cabin'].fillna("Unknown")

# Step 5: Feature Engineering


df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['AgeGroup'] = [Link](df['Age'], bins=[0, 12, 18, 35, 60, 100], labels=["Child", "Teen",
"YoungAdult", "Adult", "Senior"])

# Step 6: EDA Visualizations


# 1. Survival Count
[Link](x='Survived', data=df)
[Link]("Survival Count (0 = No, 1 = Yes)")
[Link]()

# 2. Survival by Sex
[Link](x='Sex', hue='Survived', data=df)
[Link]("Survival by Gender")
[Link]()

61
# 3. Age Distribution by Survival
[Link](data=df, x='Age', hue='Survived', kde=True, bins=30)
[Link]("Age Distribution by Survival")
[Link]()

# 4. Boxplot of Fare by Survival


[Link](x='Survived', y='Fare', data=df)
[Link]("Fare Paid by Survival Status")
[Link]()

62
# 5. Family Size vs Survival
[Link](x='FamilySize', hue='Survived', data=df)
[Link]("Family Size and Survival")
[Link]()

# 6. Heatmap of Correlation
[Link](figsize=(10, 8))
[Link](df[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize']].corr(),
annot=True, cmap='coolwarm')
63
[Link]("Correlation Heatmap")
[Link]()

# 7. Pair Plot of Numerical Features


[Link](df[['Survived', 'Age', 'Fare', 'FamilySize']], hue='Survived')
[Link]("Pairplot of Features vs Survival", y=1.02)
[Link]()

64
RESULT:
Thus, advanced data analysis and visualization on the Titanic dataset is successfully performed.

65
VIVA QUESTIONS

1. Why is the median used to fill missing Age values?


Asked by: TCS
The median is less affected by extreme values and skewed data, making it more robust for filling
missing Age values.

2. What is the purpose of creating a FamilySize feature?


Asked by: Infosys
It captures the total number of family members onboard, which may affect survival chances and
help the model detect group survival patterns.

3. How does the [Link]() function work when creating AgeGroup?


Asked by: Zoho
[Link]() bins continuous data into categorical intervals, like age ranges, to convert numerical data
into grouped labels like Child, Teen, etc.

4. What does the correlation heatmap reveal in the Titanic dataset?


Asked by: Wipro
It shows the strength and direction of relationships between numeric features, e.g., Fare and
Survived may show a positive correlation.

5. Why are countplots preferred for categorical features like Sex and Survived?
Asked by: Capgemini
Countplots visually show the frequency distribution of categories and how they relate to a target
variable.

6. What insight does the boxplot of Fare by Survived give?


Asked by: HCL
It shows the fare distribution for survivors vs. non-survivors, indicating that passengers who paid
higher fares had better survival chances.

7. Why do we use kde=True in a histogram for Age?


Asked by: CTS
It overlays a Kernel Density Estimate curve, offering a smooth view of the data distribution beyond
the bars of the histogram.

8. What does a pairplot help visualize in EDA?


Asked by: IBM
Pairplots help visualize pairwise relationships between numerical variables, along with class-based
(e.g., Survived) separations.

66
9. Why is Cabin filled with “Unknown” instead of a statistical value?
Asked by: Mindtree
Because Cabin contains mostly missing values and is a categorical feature, "Unknown" is used as
a placeholder instead of an arbitrary guess.

10. What is the use of [Link]({'[Link]': (10, 6)})?


Asked by: Accenture
It sets the default figure size for all plots, ensuring better visibility and consistent plot dimensions
throughout the analysis.

67

You might also like