Data Exploration & Visualization Manual
Data Exploration & Visualization Manual
MANUAL
SUBJECT CODE : AD3301
DATA EXPLORATION AND
SUBJECT NAME :
VISUALIZATION
REGULATION : 2021
i
INSTITUTION VISION
Emerge as a Premier Institute, producing globally competent engineers.
INSTITUTION MISSION
IM1 : Achieve Academic diligence through effective and innovative teaching-
learning processes, using ICT Tools.
IM2 : Make students employable through rigorous career guidance and training
programs.
IM3 : Strengthen Industry Institute Interaction through MOUs and Collaborations.
IM4 : Promote Research & Development by inculcating creative thinking through
innovative projects incubation.
DEPARTMENT VISION
To revolutionize the quality of AI technology by creating a state-of-the-art environment where
innovation, sustainability, and social impact converge.
DEPARTMENT MISSION
DM1 : To develop and implement AI solutions that prioritize human values, ethics and social
responsibilities.
DM2 : To provide cutting-edge AI research, driving innovation and advancing the state-of-
the-art in AI technology.
DM3 : To provide industrial standards by means of collaborations for artificial intelligence
and data science.
DM4 : To provide an excellent infrastructure that keeps up with modern trends and
technologies for professional entrepreneurship.
ii
PROGRAM OUTCOMES (POs)
PO1: Engineering Knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals and an engineering specialization to the solution of complex engineering
problems.
PO2: Problem Analysis: Identify, formulate, review research literature and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences and engineering sciences.
PO5: Modern Tool Usage: Create, select and apply appropriate techniques, resources and
modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
PO6: The Engineer and Society: Apply reasoning informed by the contextual knowledge
to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
PO9: Individual and Team Work: Function effectively as an individual and as a member
or leader in diverse teams and in multidisciplinary settings.
PO11: Project Management and Finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary
environments.
PO12: Life-Long Learning: Recognize the need for and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.
iii
PROGRAM SPECIFIC OBJECTIVES (PSOs)
Create, select and apply the knowledge of AI and Data Science to solve societal
PSO1 : problems.
Higher Studies: To enable the students to think logically and pursue life-long
PEO3 : learning and collaborate with an ethical attitude in a multidisciplinary team.
iv
GENERAL INSTRUCTIONS TO THE STUDENTS
1. No food or drink is to be brought into the lab, including gum and candy.
2. No cell phones or electronic devices at the lab stations (i pods, MP3 Players, etc.,).
5. Students must inspect their position for possible damage and report immediately to the faculty
any damage that may be found.
6. Students must follow the faculty’s instructions explicitly concerning all uses of the lab
equipment.
7. Leave your shoes in the shoes rack before entering into the lab
8. Shut down the computer properly before leaving from the lab.
10. Arrange your chairs properly before leaving from the lab.
v
AD3301 - DATA EXPLORATION AND VISUALIZATION
L T P C
3 0 2 4
OBJECTIVES:
• To outline an overview of exploratory data analysis.
• To implement data visualization using Matplotlib.
• To perform univariate data exploration and analysis.
• To apply bivariate data exploration and analysis.
• To use Data exploration and visualization techniques for multivariate and time series
data.
COURSE OUTCOMES:
At the end of this course, the students will be able to:
CO205.1 Analyze the difference between Exploratory Data Analysis (EDA) and
classical/Bayesian analysis to effectively interpret and manipulate data.
CO205.2 Create customized and advanced visualizations using Matplotlib and Seaborn to
effectively communicate data insights.
CO205.3 Evaluate the distribution and variability of single-variable data using numerical
summaries and scaling techniques.
CO205.4 Analyze the relationships between two variables using contingency tables and
scatterplots to interpret bivariate data.
CO205.5 Apply the fundamentals of time series analysis (TSA) to clean, index, and visualize
time-based data.
LIST OF EXPERIMENTS:
1. Install the data Analysis and Visualization tool: R/ Python /Tableau Public/ Power BI.
2. Perform exploratory data analysis (EDA) on with datasets like email data set. Export all your
emails as a dataset, import them inside a pandas data frame, visualize them and get different
insights from the data.
3. Working with Numpy arrays, Pandas data frames, Basic plots using Matplotlib.
4. Explore various variable and row filters in R for cleaning data. Apply various plot features in R
on sample data sets and visualize.
5. Perform Time Series Analysis and apply the various visualization techniques.
6. Perform Data Analysis and representation on a Map using various Map data sets with Mouse
Rollover effect, user interaction, etc.
7. Build cartographic visualization for multiple datasets involving various countries of the world;
states and districts in India etc.
8. Perform EDA on Wine Quality Data Set.
9. Use a case study on a data set and apply the various EDA and visualization techniques and
present an analysis report.
vi
CO – PO & PSO MAPPING
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2
CO205.1 3 2 1 1 3 - - - 2 2 - 2 2 3
CO205.2 3 3 3 2 3 - - - 2 2 - 2 3 3
CO205.3 3 3 2 2 3 - - - 2 2 - 2 3 3
CO205.4 3 3 3 2 3 - - - 2 2 - 2 3 3
CO205.5 3 3 3 3 3 - - - 2 2 - 3 3 3
vii
INDEX WITH MAPPING OF CO
EXPLORATORY DATA
2. ANALYSIS (EDA) ON WITH CO1
DATASETS.
EXPLORING VARIOUS
VARIABLE AND ROW
4. CO2
FILTERS IN R FOR CLEANING
DATA.
BUILDING CARTOGRAPHIC
6. CO2
VISUALIZATION.
viii
Ex. No. 1 INSTALLING OF THE DATA ANALYSIS AND
Date VISUALIZATION TOOL
AIM:
To install data analysis and visualization tools such as R/ Python /Tableau Public/ Power BI.
PROCEDURE:
R:
R is a programming language and software environment specifically designed for statistical
computing and graphics.
Windows:
• Download R from the official website: [Link]
• Run the installer and follow the installation instructions.
macOS:
• Download R for macOS from the same official website.
• Open the .pkg file and follow the installation instructions
Linux:
• You can install R using your distribution's package manager.
For Ubuntu/Debian:
sudo apt-get install r-base
Python:
Python is a versatile programming language widely used for data analysis and visualization.
Windows:
• Download Python from: [Link]
• Run the installer and check the "Add Python to PATH" option during installation.
• Install data analysis libraries (e.g., NumPy, pandas, matplotlib) using pip:
pip install numpy pandas matplotlib seaborn
macOS:
• macOS comes with Python 2.x pre-installed. For data analysis, install the latest Python
version from the official site or use Homebrew:
brew install python
• Use pip or a virtual environment (venv) or package manager like conda to install libraries.
1
Linux:
• Most Linux distributions come with Python pre-installed.
• Install or update Python using your package manager (example for Ubuntu):
sudo apt-get install python3 python3-pip
• Use pip or conda to manage data analysis libraries.
Tableau Public:
Tableau Public is a free version of Tableau for creating and sharing interactive visualizations.
• Go to: [Link]
• Download and install Tableau Public by following the on-screen instructions.
Power BI:
Power BI is a business analytics tool by Microsoft used for creating dashboards and reports.
• Visit: [Link]
• Download Power BI Desktop and follow the installation instructions.
Anaconda
Anaconda is a free, open-source distribution of Python and R for data science and machine
learning. It comes with popular libraries like NumPy, pandas, matplotlib, scikit-learn, and Jupyter
Notebook.
Windows/macOS:
• Go to the official Anaconda website:
[Link]
• Click “Download” based on your operating system (Windows or macOS) and the
appropriate Python version (usually Python 3.x).
• Run the Installer:
o For Windows: Double-click the .exe file.
o For macOS: Open the .pkg file.
• Accept the license agreement.
• Choose installation for "Just Me" (recommended).
• Select the destination folder.
• Make sure to check “Add Anaconda to my PATH environment variable” (optional, but
helpful) or use Anaconda Prompt to run Python.
• Verify Installation: Open Anaconda Navigator or launch Anaconda Prompt and type:
conda list
Linux:
• Download the .sh installer for Linux from the same website.
• Open Terminal and run:
bash [Link]-Linux-x86_64.sh
2
• Follow the prompts, agree to the license, and confirm installation path.
• After installation:
source ~/.bashrc
conda list
RESULT:
Thus, data analysis and visualization tool has been installed successfully.
3
VIVA QUESTIONS
4
8. How can you verify if Anaconda is installed properly?
Asked by: Capgemini
Open the Anaconda Prompt or terminal and type conda list. If it displays a list of packages, then
Anaconda is installed successfully.
5
Ex. No. 2 EXPLORATORY DATA ANALYSIS (EDA) ON WITH
Date DATASETS
AIM:
To perform exploratory data analysis (EDA) on with datasets like email data set.
ALGORITHM:
Step 1: Import the necessary Python libraries.
Step 2: Load the email dataset into a Pandas DataFrame.
Step 3: Display the basic structure and content of the dataset.
Step 4: Check for and handle any missing or null values.
Step 5: Perform descriptive statistics.
Step 6: Visualize the distribution of categories using a pie chart.
6
# Step 3: Basic data info
print([Link]())
<class '[Link]'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 5171 non-null int64
1 label 5171 non-null object
2 text 5171 non-null object
3 label_num 5171 non-null int64
dtypes: int64(2), object(2)
memory usage: 161.7+ KB
None
print([Link]())
label_num
0 0
1 0
2 0
3 1
4 0
7
Unnamed: 0 0
label 0
text 0
label_num 0
dtype: int64
fig.update_traces(hoverinfo='label+percent', textinfo='label+value+percent',
textfont_size=15, marker=dict(line=dict(color='green', width=2)))
[Link]()
8
RESULT:
Thus, the exploratory data analysis (EDA) on with datasets like email data set has been performed
successfully.
9
VIVA QUESTIONS
10
9. What does [Link]().sum() do?
Asked by: TCS
It checks for missing or null values in the dataset and returns the total count for each column.
11
Ex. No. 3
WORKING WITH NUMPY ARRAYS,
PANDAS DATA FRAMES, BASIC PLOTS USING
Date MATPLOTLIB
AIM:
To perform basic operations using NumPy arrays, manipulate data using Pandas DataFrames, and
create basic plots using Matplotlib.
ALGORITHM:
A. NumPy Arrays:
Step 1: Import the NumPy library.
Step 2: Create 1D and 2D arrays using [Link], [Link], and [Link].
Step 3: Perform arithmetic operations, statistical analysis, reshaping, and broadcasting.
Step 4: Apply functions like mean, median, std, linspace, dot, and matrix multiplication.
B. Pandas DataFrames:
Step 1: Import the Pandas library.
Step 2: Create a DataFrame from a dictionary and a CSV file.
Step 3: Perform data selection, slicing, filtering, sorting, grouping, and aggregation.
Step 4: Add new columns, drop columns, handle missing values, and apply lambda functions.
# Create arrays
arr1 = [Link]([1, 2, 3, 4, 5])
arr2 = [Link]([[1, 2], [3, 4]])
print("Array 1:", arr1)
print("Array 2:", arr2)
Array 1: [1 2 3 4 5]
Array 2: [[1 2]
[3 4]]
12
# Basic operations
print("Mean:", [Link](arr1))
print("Sum:", [Link](arr1))
print("Standard Deviation:", [Link](arr1))
print("Square Root:", [Link](arr1))
print("Exponential:", [Link](arr1))
Mean: 3.0
Sum: 15
Standard Deviation: 1.4142135623730951
Square Root: [1. 1.41421356 1.73205081 2. 2.23606798]
Exponential: [ 2.71828183 7.3890561 20.08553692 54.59815003 148.4131591
]
First Element: 1
Sub-array: [2 3 4]
# Array concatenation
combined = [Link]([arr1, [Link]([6, 7, 8, 9])])
print("Combined Array:", combined)
Combined Array: [1 2 3 4 5 6 7 8 9]
Reshaped Array:
[[1 2 3]
[4 5 6]]
B. Pandas DataFrames
# Import pandas library
import pandas as pd
# Create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'Marks': [88, 92, 79, 85, 90],
'City': ['Chennai', 'Mumbai', 'Chennai', 'Delhi', 'Mumbai']
}
13
df = [Link](data)
print("\nDataFrame:\n", df)
DataFrame:
Name Age Marks City
0 Alice 25 88 Chennai
1 Bob 30 92 Mumbai
2 Charlie 35 79 Chennai
3 David 40 85 Delhi
4 Eva 45 90 Mumbai
Group by City:
City
Chennai 83.5
Delhi 85.0
Mumbai 91.0
Name: Marks, dtype: float64
CSV Loaded:
Name Age Marks City
0 Alice 25 88.00 Chennai
1 Bob 30 92.00 Mumbai
2 Charlie 35 88.75 Chennai
3 David 40 85.00 Delhi
4 Eva 45 90.00 Mumbai
# Line Plot
x = [Link](0, 10, 100)
y = [Link](x)
[Link](figsize=(6, 4))
15
[Link](x, y, label='Sine', color='blue')
[Link]("Line Plot")
[Link]("X")
[Link]("sin(X)")
[Link]()
[Link](True)
[Link]()
# Bar Chart
[Link](df['Name'], df['Marks'], color='green')
[Link]("Bar Chart - Marks")
[Link]("Name")
[Link]("Marks")
[Link]()
16
# Histogram
[Link](df['Marks'], bins=5, color='yellow')
[Link]("Histogram - Marks Distribution")
[Link]("Marks")
[Link]("Frequency")
[Link]()
# Scatter Plot
[Link](df['Age'], df['Marks'], color='red')
[Link]("Scatter - Age vs Marks")
[Link]("Age")
[Link]("Marks")
[Link]()
17
# Pie Chart
city_counts = df['City'].value_counts()
[Link](city_counts, labels=city_counts.index, autopct='%1.1f%%', startangle=90)
[Link]("Pie Chart - City Distribution")
[Link]('equal')
[Link]()
# Subplots
fig, axs = [Link](1, 2, figsize=(10, 4))
axs[0].plot(x, [Link](x), label='Cosine', color='purple')
axs[1].plot(x, [Link](x), label='Tangent', color='brown')
axs[0].set_title("Cosine Plot")
axs[1].set_title("Tangent Plot")
for ax in axs:
[Link](True)
[Link]()
plt.tight_layout()
[Link]()
RESULT:
Thus, the operations using NumPy arrays, Pandas DataFrames, and Matplotlib plots were
successfully implemented.
18
VIVA QUESTIONS
5. What type of insights can you get from a pie chart of email labels?
Asked by: Bosch
It shows the percentage distribution of spam vs. ham emails, helping visualize class proportions
clearly.
19
9. Why do we use drop(columns=[...]) in Pandas?
Asked by: Wipro
To remove irrelevant or redundant columns that do not contribute meaningful insights to the
analysis.
20
Ex. No. 4 EXPLORING VARIOUS VARIABLE AND ROW
Date FILTERS IN R FOR CLEANING DATA
AIM:
To explore and apply variable and row-level filtering techniques in R for effective data cleaning
and basic visualization.
• Run R Code
o You have three options:
▪ To run all from terminal: Use the command rscript filename.R.
▪ To run line-by-line from Console: Place your cursor on a line and press
Ctrl + Enter.
▪ To run all from Console: Highlight all code and press Ctrl + Enter, or
click "Run" in the top right of the editor.
21
7. View Output
• The output will appear in the Console.
• Plots will appear in the Plots tab (bottom-right).
ALGORITHM:
1. Data Preparation
a. Load required R packages.
b. Create a sample dataset using [Link]().
c. Display the raw data for reference.
2. Variable-Based Filtering
a. Filter by a specific condition (e.g., Age > 30).
b. Filter using multiple conditions (e.g., Age > 30 & Gender == "Male").
3. Row-Based Filtering
a. Remove duplicate rows using unique().
b. Remove missing values using [Link]().
4. Data Visualization
a. Load the ggplot2 package.
b. Create the following visualizations using the cleaned dataset:
c. Scatterplot: Age vs. Score, colored by Gender.
d. Histogram: Distribution of Age.
e. Bar chart: Gender distribution.
PROGRAM:
# Load necessary library
library(ggplot2)
22
# Step 2: Variable Filtering
# 2.1 Filter: Age > 30
filtered_data1 <- data[data$Age > 30, ]
print("Filtered Data (Age > 30):")
print(filtered_data1)
23
OUTPUT:
[1] "Original Data:"
ID Age Gender Score
1 1 48 Male 99
2 2 32 Female 72
3 3 31 Male 26
4 4 20 Male 7
5 5 59 Male 42
6 6 60 Male 9
7 7 54 Female 83
8 8 31 Male 36
9 9 42 Male 78
10 10 43 Male 81
25
RESULT:
Thus, various variable and row filters in R were successfully explored and applied for data
cleaning, and the cleaned dataset was visualized using multiple plot types.
26
VIVA QUESTIONS
8. How do you filter records where Age > 30 and Gender is Male?
Asked by: Deloitte
Using subset(df, Age > 30 & Gender == "Male").
27
10. Why is data cleaning necessary before visualization?
Asked by: Mindtree
Data cleaning ensures accuracy and clarity in analysis and visualization by removing
inconsistencies, duplicates, and missing values.
28
Ex. No. 5 DATA ANALYSIS AND REPRESENTATION ON A
Date MAP
AIM:
To visualize data on a world map using interactive features like mouse rollover and zoom using
Python’s Plotly library.
ALGORITHM:
Step 1: Open Google Colab and import necessary libraries.
Step 2: Prepare sample data (e.g., country names and some values).
Step 3: Use [Link] to create a choropleth map.
Step 4: Add features like tooltips and color scales.
Step 5: Display the interactive map.
0 India 80
1 United States 90
29
Country Value
2 China 75
3 Brazil 60
4 Australia 70
30
RESULT:
Thus, an interactive data representation on a world map using Plotly with user interaction features
is executed successfully.
31
VIVA QUESTIONS
32
9. What is the purpose of hover_name in [Link]?
Asked by: Mindtree
It sets the label that appears when the user hovers over a region on the map.
10. How can you change the map's color theme in Plotly?
Asked by: Accenture
You can change it by modifying the color_continuous_scale parameter using predefined or
custom color scales.
33
Ex. No. 6
BUILDING CARTOGRAPHIC VISUALIZATION
Date
AIM:
To build the cartographic visualization using multiple datasets from various countries and states
on a world map.
ALGORITHM:
Step 1: Install and import required libraries.
Step 2: Prepare data for cities from different countries and states.
Step 3: Convert city coordinates to GeoDataFrame.
Step 4: Load the world map from an online shapefile.
Step 5: Plot the world map and overlay the city points.
Step 6: Customize with labels, colors, and title.
Step 7: Display the cartographic visualization.
34
country latitude longitude
city
35
RESULT:
Thus, the cartographic visualization for cities from multiple datasets involving various countries
and states was successfully built using Python.
36
VIVA QUESTIONS
37
9. How can cities be overlaid on a world map?
Asked by: IBM
Cities are represented as point geometries in a GeoDataFrame and plotted over a world base map
using GeoPandas plotting functions.
38
Ex. No. 7
PERFORMING EDA ON WINE QUALITY DATA SET
Date
AIM:
To perform Exploratory Data Analysis (EDA) on the Wine Quality dataset using Python libraries.
ALGORITHM:
Step 1: Load libraries like pandas, numpy, [Link], and seaborn.
Step 2: Read the [Link] dataset using pandas.read_csv().
Step 3: Show the first few rows using head().
Step 4: Use info() and describe() for structure and summary.
Step 5: Check for missing values using isnull().sum().
Step 6: Use value_counts() to see how many wines fall into each quality category.
Step 7: Use [Link]() to visualize the distribution of all numerical features.
Step 8: Plot boxplots using [Link]() to detect outliers in features.
Step 9: Use [Link]() to show correlation between features.
Step 10: Create a boxplot of alcohol against quality to study how alcohol affects wine quality.
First 5 rows:
fixed acidity volatile acidity citric acid residual sugar chlorides
\
0 7.4 0.70 0.00 1.9 0.076
1 7.8 0.88 0.00 2.6 0.098
2 7.8 0.76 0.04 2.3 0.092
3 11.2 0.28 0.56 1.9 0.075
39
4 7.4 0.70 0.00 1.9 0.076
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
print("\nDataset Info:")
print([Link]())
Dataset Info:
<class '[Link]'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
None
print("\nSummary Statistics:")
print([Link]())
Summary Statistics:
fixed acidity volatile acidity citric acid residual sugar \
count 1599.000000 1599.000000 1599.000000 1599.000000
40
mean 8.319637 0.527821 0.270976 2.538806
std 1.741096 0.179060 0.194801 1.409928
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.390000 0.090000 1.900000
50% 7.900000 0.520000 0.260000 2.200000
75% 9.200000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000
Data Types:
fixed acidity float64
volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64
41
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object
42
# Step 7: Boxplot of features
[Link](figsize=(12, 6))
[Link](data=df, orient="h")
[Link]("Boxplots of Features")
[Link]()
43
[Link]([Link](numeric_only=True), annot=True, cmap="coolwarm",
linewidths=0.5)
[Link]("Feature Correlation Heatmap")
[Link]()
44
RESULT:
Thus, the EDA was successfully performed on the Wine Quality dataset and various plots and
summary statistics were generated.
45
VIVA QUESTIONS
46
9. Why are seaborn and matplotlib used together?
Asked by: Accenture
Seaborn offers higher-level visualization tools built on matplotlib, making plots more attractive
and customizable.
10. Which features in this dataset are most correlated with wine quality?
Asked by: L&T Infotech
Typically, alcohol and volatile acidity show a strong correlation with wine quality based on the
heatmap.
47
Ex. No. 8 TIME SERIES ANALYSIS USING VARIOUS
Date VISUALIZATION TECHNIQUES
AIM:
To perform a simple time series analysis on a dataset using line plots and autocorrelation to observe
trends and seasonality.
ALGORITHM:
Step 1: Import required libraries.
Step 2: Load the dataset and parse the "Month" column as a date.
Step 3: Set the "Month" column as index for time-based plotting.
Step 4: Visualize the number of air passengers over time using a line plot.
Step 5: Use autocorrelation to check seasonality or repeating patterns.
#Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
48
[Link](figsize=(16, 5), dpi=dpi)
[Link](x, y, color='tab:blue')
[Link](title)
[Link](xlabel)
[Link](ylabel)
[Link]()
[Link]()
# Step 4: Line plot to show trend over time
plot_df(df, x=[Link], y=df['#Passengers'], title='Monthly Air Passengers (1949 - 1960)')
RESULT:
Thus, the Air Passengers dataset was successfully visualized using a line plot, and the
autocorrelation plot revealed seasonality in air travel patterns.
49
VIVA QUESTIONS
50
10. What does a strong autocorrelation at lag 12 indicate in this dataset?
Asked by: Tech Mahindra
It indicates a yearly seasonal pattern, meaning passenger numbers tend to repeat annually.
51
Ex. No. 9 VISUALIZING VARIOUS EDA TECHNIQUES AS
Date CASE STUDY FOR IRIS DATASET
AIM:
To apply various Exploratory Data Analysis (EDA) and Visualization techniques on the Iris
dataset to understand the patterns and relationships between features.
ALGORITHM:
Step 1: Import required libraries.
Step 2: Load the Iris dataset using seaborn.
Step 3: Display basic dataset info, statistics, and check for missing values.
Step 4: Visualize distributions using histograms and box plots.
Step 5: Use pairplot and heatmap to study relationships and correlations.
Step 6: Create a violin plot to compare feature distributions across species.
First 5 rows:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
print("\nDataset Info:")
print([Link]())
52
Dataset Info:
<class '[Link]'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
None
print("\nSummary Statistics:")
print([Link]())
Summary Statistics:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 75.500000 5.843333 3.054000 3.758667 1.198667
std 43.445368 0.828066 0.433594 1.764420 0.763161
min 1.000000 4.300000 2.000000 1.000000 0.100000
25% 38.250000 5.100000 2.800000 1.600000 0.300000
50% 75.500000 5.800000 3.000000 4.350000 1.300000
75% 112.750000 6.400000 3.300000 5.100000 1.800000
max 150.000000 7.900000 4.400000 6.900000 2.500000
# Step 4: Histograms
[Link](figsize=(10, 8), color='skyblue')
[Link]("Histograms of Iris Features")
plt.tight_layout()
[Link]()
53
# Step 5: Box plots
[Link](figsize=(10, 6))
[Link](data=df, orient='h')
[Link]("Boxplot of All Features")
[Link]()
54
[Link]("Pair Plot of Iris Features", y=1.02)
[Link]()
55
# Step 8: Violin plot for feature vs species
[Link](figsize=(10, 6))
[Link](x='SepalLengthCm', y='PetalLengthCm', data=df)
[Link]("Petal Length by Species (Violin Plot)")
[Link]()
RESULT:
Thus, the Iris dataset was successfully analyzed using various EDA techniques.
56
VIVA QUESTIONS
57
9. How can we confirm class imbalance in the Iris dataset?
Asked by: Cognizant
Use df['Species'].value_counts() to check the number of records per class and verify if they are
balanced.
58
Ex. No. 10 CBS: ADVANCED DATA ANALYSIS AND
Date VISUALIZATION ON A DATASET
AIM:
To perform advanced data analysis and visualization techniques on the Titanic dataset using
Python libraries.
ALGORITHM:
Step 1: Import libraries: Import pandas, numpy, matplotlib, and seaborn.
Step 2: Load dataset: Read the Titanic dataset using pd.read_csv().
Step 3: Explore data: Display head, info, and describe to understand the dataset.
Step 4: Handle missing data: Fill missing values in Age, Embarked, and Cabin columns.
Step 5: Create new features: Add FamilySize and categorize age groups.
Step 6: Visualize data: Plot countplots, boxplots, histograms, pairplots, and heatmap.
Step 7: Display plots: Use [Link]() to display all the generated visualizations.
# Load dataset
#Link:
#[Link]
df = pd.read_csv("[Link]")
59
Name Sex Age SibSp
\
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
print("\nDataset Info:")
[Link]()
Dataset Info:
<class '[Link]'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
# 2. Survival by Sex
[Link](x='Sex', hue='Survived', data=df)
[Link]("Survival by Gender")
[Link]()
61
# 3. Age Distribution by Survival
[Link](data=df, x='Age', hue='Survived', kde=True, bins=30)
[Link]("Age Distribution by Survival")
[Link]()
62
# 5. Family Size vs Survival
[Link](x='FamilySize', hue='Survived', data=df)
[Link]("Family Size and Survival")
[Link]()
# 6. Heatmap of Correlation
[Link](figsize=(10, 8))
[Link](df[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize']].corr(),
annot=True, cmap='coolwarm')
63
[Link]("Correlation Heatmap")
[Link]()
64
RESULT:
Thus, advanced data analysis and visualization on the Titanic dataset is successfully performed.
65
VIVA QUESTIONS
5. Why are countplots preferred for categorical features like Sex and Survived?
Asked by: Capgemini
Countplots visually show the frequency distribution of categories and how they relate to a target
variable.
66
9. Why is Cabin filled with “Unknown” instead of a statistical value?
Asked by: Mindtree
Because Cabin contains mostly missing values and is a categorical feature, "Unknown" is used as
a placeholder instead of an arbitrary guess.
67