0% found this document useful (0 votes)

15 views3 pages

Data Exploration & Visualization Syllabus

The document outlines the syllabus for the Data Exploration and Visualization course for B.Sc. Computer Science students, covering topics such as NumPy arrays, data manipulation with Pandas, grouping and aggregation, data visualization with Matplotlib and Seaborn, and interactive visualizations with Plotly. It includes a week-by-week breakdown of units, chapters, and reference materials, along with suggestive practice questions using various datasets. The course aims to equip students with essential skills in data analysis and visualization techniques.

Uploaded by

whyytrishh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views3 pages

Data Exploration & Visualization Syllabus

Uploaded by

whyytrishh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

[Link]. Prog.

Computer Science Sem IV

DSE 02a: Data Exploration and Visualization

(Admission 2022 onwards)

TOPICS/UNITS Chapter Ref

Week 1 Unit 1(10): Creating and Manipulating Ch4:4.1 (upto pg 103), Usage [1]
to 2 NumPy arrays: creating arrays, indexing and of rand(), nrand() and
slicing, mathematical operations with NumPy randint() functions of NumPy
arrays

Week 3 Unit 2(15): Data Manipulation with Ch 5: 5.1, 5.2(upto pg 149), 5.3 [1]
to 5 Pandas: Series and DataFrame objects;
importing and exporting data from various Ch 6: 6.1 (pg 169-172, 175)
file formats into pandas DataFrame; Data Ch 7: 7.1, 7.2 (upto pg 202, 205-
selection and filtering- indexing, slicing, 206)
conditional filtering using boolean indexing; Ch 8: 8.1 (pg 221-223), 8.2 (pg
Data Cleaning- handling missing data in 227-231), 8.3 (pg 243-245)
Pandas and outlier detection; Data
Manipulation-sorting, reshaping, merging.
Week 6 Unit 3(5): Grouping and Aggregation with Ch 10: 10.1(upto pg 293), 10.2, [1]
to 9 Pandas: Grouping data using Pandas, 10.3 (upto pg 303), 10.4
applying aggregation functions such as sum,
mean, count, [Link] grouped data, using pivot
tables and cross-tabulation for data
summarization

Week Unit 4(10): Data Visualization with Matplotlib Ch 9: 9.1 (pg 253-264, 267), 9.2 [1]
10 to 13 and Seaborn: Introduction to Matplotlib and
Seaborn to plot data using figures and
subplots, Plots - Line plots, scatter plots, and bar
plots, Visualizing distributions using histogram
and box plots, Customizing plot aesthetics and
adding annotations

Week Unit 5(5): Interactive Visualizations with Chapter-8 (upto topic- Use of
bar charts in Plotly) [4]
14 to 15 Plotly: Introduction to Plotly library for
interactive visualization; Creating interactive line
plots, scatter plots, and bar plots; Adding
interactivity with hover effects, zooming, and
panning
References

1. McKinney W. Python for Data Analysis: Data Wrangling with Pandas, NumPy and IPython. 2nd edition.
O’Reilly Media, 2018.

2. Molin S. Hands-On Data Analysis with Pandas, Packt Publishing, 2019.

3. VanderPlas, J. Python data science handbook: Essential tools for working with data. " O'Reilly Media,
Inc.", 2nd edition.

4. Rahman K. Python Data Visualization Essentials Guide: Become a Data Visualization expert by building
strong proficiency in Pandas, Matplotlib, Seaborn, Plotly, Numpy, and Bokeh, BPB 2021

Additional References:

1. Chen D. Y, Pandas for Everyone: Python Data Analysis, Pearson, 2018.

Online references/material:
1. [Link]
2. [Link]

Suggestive Practice Questions:

Use data set of your choice from Open Data Portal (https:// [Link]/, UCI repository) or
load from scikit, seaborn library for the following exercises to practice the concepts learnt.

1. Write a program using the NumPy library to perform the following tasks:

A. Generate a 5x2 integer array with values ranging from 50 to 100, where each element has a
difference of 5. Reshape the resulting array to a size of 10x1.
B. Create a 1D random array with values ranging from 1 to 100. Calculate various statistical
measures such as minimum, maximum, mean, median, standard deviation, number of unique
values, count of unique values, and the most frequent value in the array.
C. Create a 5x5 identity matrix where all the diagonal elements are set to the value 5.
D. Consider a dataset containing the heights (in centimeters) and weights (in kilograms) of 20
individuals. Your task is to perform various operations using the NumPy library to analyze the
data.
a. Create a NumPy array called "heights" with the following height values: [165, 170,
175, 168, 172, 180, 160, 169, 176, 171, 174, 182, 158, 167, 173, 179, 163, 166, 177,
181]. Create a NumPy array called "weights" with the following weight values: [60, 65,
70, 75, 80, 85, 55, 58, 63, 68, 72, 77, 50, 62, 67, 74, 52, 57, 69, 73].
b. Create a new NumPy array called "combined" by stacking the heights and weights
arrays such that the shape of the resulting array is 20 x 2.
c. Calculate and print the mean height and weight of the individuals in the dataset.
d. Find and print the index of the shortest and tallest individuals in the dataset.
e. Sort the array based on height on the individuals.
f. Swap the positions of the two columns in the array.
g. Retrieve records of individuals having weight below 70kg.

2. Write a program using the Pandas library to perform the following operations on the penguins dataset from
the Seaborn library:
A. Load the penguins dataset into a Pandas dataframe.
B. Determine the number of observations/records and the number of attributes in the dataframe.
C. Display the names of the attributes, row indexes, and data types of each attribute in the dataframe.
D. Display the first 5 and last 5 records of the dataframe.
E. Retrieve the values of the second column for the third and fourth records.
F. Display a summary of the data distribution for all attributes in the dataframe.
G. Compute the pairwise correlation between all attributes in the dataframe.

3. Consider the Titanic dataset, which contains information about passengers on board the Titanic, including
their age, gender, passenger class, survival status, and other attributes. Write a program using the Pandas
library to perform the following operations on the Titanic dataset:
A. Load the Titanic dataset into a Pandas DataFrame.
B. Check for any duplicate records and missing values in the dataset and handle them appropriately.
C. Calculate and display the total number of passengers who survived and those who did not.
D. Filter the DataFrame to select only the records of passengers who were under the age of 18.
E. Calculate the average age for passengers belonging to each of the passenger class.
F. Create a new column in the DataFrame called "Family Size" that represents the total number of family
members (including the passenger) on board.
G. Calculate the correlation between age and fare attributes of the dataset.
H. Create a contingency table that shows the count of passengers based on their survival status (survived
or not) and passenger class (first, second, or third class). for titanic dataset

4. Utilize the iris dataset from the Sklearn library to generate various visual representations of the data using
the Matplotlib and or Seaborn libraries with proper legends and labels. Perform the following tasks:

A. Create a scatter plot to visualize the relationship between petal length and petal width for different
instances of iris flowers.
B. Generate histograms to display the data distribution of each of the four attributes in the iris dataset.
C. Construct a pie chart to illustrate the frequency count of each flower type in the iris dataset.
D. Create a pair plot that showcases the relationship between every pair of attributes in the iris dataset
(only seaborn library).

5. Create the visualizations of question 4 (A and C part) using plotly library.

Contributors:

Common questions

Creating new columns in a dataset, like "Family Size" in the Titanic dataset, involves feature engineering, where new relevant attributes are derived to capture additional information from existing data. "Family Size" combines passenger family members, potentially correlating to survival odds and thus adding depth to analysis. Such engineered features can improve model building by offering richer and more nuanced insights, enhancing predictive capabilities by introducing new patterns or relationships into the dataset .

Scatter plots in the iris dataset visualize relationships between two attributes, such as petal length and petal width, highlighting correlations and distribution patterns. Pair plots extend this by comprehensively visualizing relationships between each pair of attributes across all samples in the dataset. This can reveal trends, clusters, and the presence of any anomalies, helping to understand how different attributes relate and vary across different species within the iris dataset .

NumPy arrays allow efficient data manipulation through various operations such as indexing, slicing, and performing mathematical operations element-wise. NumPy's mathematical functions, like sum(), mean(), and max(), can operate over entire arrays or along a specific axis. Additionally, NumPy includes functions for generating random numbers, like rand() for uniform distribution, nrand() for normal distribution, and randint() for random integers. Arrays can be reshaped using the reshape method, enabling conversion between different dimensions as required for data processing .

Data cleaning in Pandas involves handling missing values, which can be achieved using functions like fillna() to replace them with a specified value or method (e.g., forward fill) and dropna() to remove incomplete records. Outlier detection involves identifying values that deviate significantly from the dataset they belong to. Pandas can use conditions combined with statistical methods like z-scores or IQR to detect outliers. These tools allow for effective preprocessing, preparing data for analysis and visualization .

Creating interactive plots using Plotly involves first importing necessary modules and then defining data and layout for the plot, such as scatter or bar plots. Functions like add_trace or plot can be used to render the visualizations. Interactive features like hover effects provide additional data insights by displaying information when a user hovers over an element, while zooming allows users to focus on and explore data details. These features enhance user engagement, providing dynamic visual feedback and improving data exploration and understanding .

Boolean indexing in Pandas involves selecting subsets of data by applying a boolean condition, returning a DataFrame where the condition is True. Conditional filtering refines this by using conditions on one or more columns to filter data more precisely. This enhances data transformation by allowing flexible and intuitive slice-and-dice operations, crucial for exploring datasets and focusing analysis on relevant data points easily .

Statistical measures like mean, median, and standard deviation are fundamental to understanding a dataset's characteristics. The mean provides the average, useful for summarizing data with a single value representing the central tendency. The median indicates the middle value, ideal for skewed distributions as it's not affected by outliers. Standard deviation measures variability, giving insights into data spread or dispersion. These measures are essential for comparing datasets, identifying trends, and conducting inferential statistics to draw conclusions .

Pivot tables in Pandas summarize data by transforming it into a 2D table. They aggregate data based on some criteria using aggregation functions such as sum, mean, or count. Cross-tabulation, enabled by the crosstab function, allows comparison of categorical data, similar to pivot tables but focused on counting occurrences. These methods are typically used to condense large datasets, provide quick insights, and reveal patterns or trends across different categories or variables .

Matplotlib is a comprehensive library for creating static, interactive, and animated visualizations in Python. It is highly customizable but might require complex code for sophisticated plots. Seaborn, built on top of Matplotlib, provides a high-level interface for drawing attractive statistical graphs. It simplifies the creation of complex plots like heatmaps, violin plots, and pair plots. Matplotlib's strength lies in its flexibility, whereas Seaborn excels in providing aesthetically pleasing statistical graphics. They complement each other by allowing a user to leverage Matplotlib's complexity with Seaborn's simplicity to create both detailed and beautiful visualizations .

Handling duplicate records and missing values is crucial to maintain data accuracy and integrity. Duplicates can lead to skewed analytical results by overrepresenting some data points. Missing values can affect model performance if not addressed. Effective strategies include using drop_duplicates() for duplicate removal and fillna() or dropna() for handling missing data, depending on the context. Proper assessment and cleaning ensure that datasets accurately reflect the underlying phenomena and that insights or predictions based on such data are reliable .

Understanding the DXV File Format
No ratings yet
Understanding the DXV File Format
3 pages
Data Analysis & Visualization with Python
No ratings yet
Data Analysis & Visualization with Python
3 pages
Python Data Analysis with NumPy and Pandas
No ratings yet
Python Data Analysis with NumPy and Pandas
14 pages
Data Analysis and Visualization Course
No ratings yet
Data Analysis and Visualization Course
4 pages
Data Analysis & Visualization with Python
No ratings yet
Data Analysis & Visualization with Python
4 pages
Pandas Operations and Data Analysis Guide
No ratings yet
Pandas Operations and Data Analysis Guide
19 pages
TE DS Lab Manual (317534)
No ratings yet
TE DS Lab Manual (317534)
107 pages
Solrisa Menu for Data Science Basics
No ratings yet
Solrisa Menu for Data Science Basics
155 pages
Data Analysis Lab with NumPy & Pandas
No ratings yet
Data Analysis Lab with NumPy & Pandas
7 pages
Part A LM1
No ratings yet
Part A LM1
22 pages
Machine Learning with Pandas & Seaborn
No ratings yet
Machine Learning with Pandas & Seaborn
15 pages
Data Visualization with Seaborn Insights
No ratings yet
Data Visualization with Seaborn Insights
19 pages
DSDBA - Lab - Manual SKY
No ratings yet
DSDBA - Lab - Manual SKY
141 pages
DSBDAL - Assignment No 1
No ratings yet
DSBDAL - Assignment No 1
24 pages
Data Exploration & Visualization Guide
No ratings yet
Data Exploration & Visualization Guide
26 pages
Part A Assignment No 1 PDF
No ratings yet
Part A Assignment No 1 PDF
24 pages
EDA Fundamentals and Visualization Techniques
No ratings yet
EDA Fundamentals and Visualization Techniques
48 pages
Variance and Data Analysis with NumPy
No ratings yet
Variance and Data Analysis with NumPy
28 pages
Data Analysis & Visualization with Python
No ratings yet
Data Analysis & Visualization with Python
27 pages
Data Analysis with Python and R
No ratings yet
Data Analysis with Python and R
28 pages
Data Science Basics with Pandas and Python
No ratings yet
Data Science Basics with Pandas and Python
22 pages
DSBDA Lab Manual for SPPU 2019
No ratings yet
DSBDA Lab Manual for SPPU 2019
155 pages
Data Wrangling with Python Assignment
No ratings yet
Data Wrangling with Python Assignment
145 pages
EDA Techniques with Python Libraries
No ratings yet
EDA Techniques with Python Libraries
47 pages
NumPy, Pandas, Matplotlib, Seaborn Tasks
No ratings yet
NumPy, Pandas, Matplotlib, Seaborn Tasks
2 pages
DEV manual-AD3301 Sem3
No ratings yet
DEV manual-AD3301 Sem3
19 pages
AD3301 - Dev Lab Manual
No ratings yet
AD3301 - Dev Lab Manual
27 pages
Convert Categorical to Quantitative in Python
No ratings yet
Convert Categorical to Quantitative in Python
23 pages
Python Data Science Essentials
No ratings yet
Python Data Science Essentials
27 pages
Day 2 Workshop
No ratings yet
Day 2 Workshop
16 pages
Lab 1
No ratings yet
Lab 1
12 pages
Data Visualization with Seaborn
No ratings yet
Data Visualization with Seaborn
3 pages
NumPy and Pandas Data Science Codes
No ratings yet
NumPy and Pandas Data Science Codes
18 pages
Data Frame Operations and Visualizations
No ratings yet
Data Frame Operations and Visualizations
23 pages
AI & Data Science Lab Manual
No ratings yet
AI & Data Science Lab Manual
27 pages
Data Analysis with Python: NumPy & Pandas
No ratings yet
Data Analysis with Python: NumPy & Pandas
76 pages
DVA Lab Manual: Python Data Analysis
No ratings yet
DVA Lab Manual: Python Data Analysis
20 pages
Python Data Analysis & Visualization Guide
No ratings yet
Python Data Analysis & Visualization Guide
43 pages
Pandas Data Analysis Techniques
No ratings yet
Pandas Data Analysis Techniques
7 pages
Business Analytics I: Pandas & NumPy Insights
No ratings yet
Business Analytics I: Pandas & NumPy Insights
11 pages
Data Analysis Lab: Python & Visualization
No ratings yet
Data Analysis Lab: Python & Visualization
11 pages
Python Data Visualization Guide
No ratings yet
Python Data Visualization Guide
26 pages
Salary and Experience Data Analysis
No ratings yet
Salary and Experience Data Analysis
9 pages
Analyzing Iris Dataset with Pandas
No ratings yet
Analyzing Iris Dataset with Pandas
3 pages
Informatics Practices Practical File
No ratings yet
Informatics Practices Practical File
6 pages
Data Toolkit: Python Libraries & Visualizations
No ratings yet
Data Toolkit: Python Libraries & Visualizations
2 pages
Python Data Analysis & Visualization Lab
No ratings yet
Python Data Analysis & Visualization Lab
6 pages
Python Data Analysis Syllabus
No ratings yet
Python Data Analysis Syllabus
75 pages
NumPy and Pandas Programming Examples
No ratings yet
NumPy and Pandas Programming Examples
15 pages
Data Visualization with Matplotlib & Seaborn
No ratings yet
Data Visualization with Matplotlib & Seaborn
10 pages
Matplotlib and Seaborn Visualization Guide
No ratings yet
Matplotlib and Seaborn Visualization Guide
26 pages
Decision Tree Implementation in Python
No ratings yet
Decision Tree Implementation in Python
43 pages
NumPy, Pandas, and Matplotlib Guide
No ratings yet
NumPy, Pandas, and Matplotlib Guide
11 pages
NumPy and Pandas Data Science Codes
No ratings yet
NumPy and Pandas Data Science Codes
2 pages
Install Anaconda and Use Numpy/Pandas
No ratings yet
Install Anaconda and Use Numpy/Pandas
53 pages
Python Data Retrieval and Processing
No ratings yet
Python Data Retrieval and Processing
57 pages
RIMC January 2025 Informatics Practices
No ratings yet
RIMC January 2025 Informatics Practices
59 pages
Advanced Probabilistic Models Overview
No ratings yet
Advanced Probabilistic Models Overview
15 pages
S4 History and Political Education Topics
No ratings yet
S4 History and Political Education Topics
2 pages
Reading Comprehension Strategies
No ratings yet
Reading Comprehension Strategies
17 pages
ME464 Robotics Course Overview
No ratings yet
ME464 Robotics Course Overview
16 pages
Database Schema for Parking Management
No ratings yet
Database Schema for Parking Management
2 pages
English Dialog Practice for Class VII
No ratings yet
English Dialog Practice for Class VII
3 pages
Clinical Legal Education Workshop 2024
No ratings yet
Clinical Legal Education Workshop 2024
6 pages
Understanding Kingdom Monera
No ratings yet
Understanding Kingdom Monera
2 pages
Understanding Organizational Development
No ratings yet
Understanding Organizational Development
9 pages
Nervous System Manipulation Patent
No ratings yet
Nervous System Manipulation Patent
14 pages
Swimming Coaches Report and SQL Queries
No ratings yet
Swimming Coaches Report and SQL Queries
2 pages
Noodling Catfish: Mortality Analysis
No ratings yet
Noodling Catfish: Mortality Analysis
89 pages
English Language Skills Course Syllabus
No ratings yet
English Language Skills Course Syllabus
5 pages
Vinamilk Sustainable Development Report 2023
No ratings yet
Vinamilk Sustainable Development Report 2023
4 pages
Geologic and Hydro-Meteorological Hazards Quiz
No ratings yet
Geologic and Hydro-Meteorological Hazards Quiz
2 pages
Control Systems Exam Questions
No ratings yet
Control Systems Exam Questions
10 pages
Lucky Iron Fish: Addressing Anemia
No ratings yet
Lucky Iron Fish: Addressing Anemia
2 pages
Bio-Inspired Self-Healing Concrete Techniques
No ratings yet
Bio-Inspired Self-Healing Concrete Techniques
2 pages
Deep Learning Curriculum for SMP
No ratings yet
Deep Learning Curriculum for SMP
15 pages
Brosur Retroshield
No ratings yet
Brosur Retroshield
2 pages
Free Peppa Pig Jumper Pattern
No ratings yet
Free Peppa Pig Jumper Pattern
12 pages
Exception Handling in Object-Oriented Programming
No ratings yet
Exception Handling in Object-Oriented Programming
24 pages
Understanding Human Revolution Principles
No ratings yet
Understanding Human Revolution Principles
5 pages
Easter Celebrations and Traditions Guide
No ratings yet
Easter Celebrations and Traditions Guide
13 pages
A Level Art: Exploring Gaps in Society
No ratings yet
A Level Art: Exploring Gaps in Society
6 pages
Citrus Pest Control Product List
No ratings yet
Citrus Pest Control Product List
44 pages
CHG Analysis in Indian Stock Market
No ratings yet
CHG Analysis in Indian Stock Market
11 pages
Hotel Development Process Overview
No ratings yet
Hotel Development Process Overview
16 pages
Energetic and Exergetic Analysis of Rankine Cycles For Solar Power Plants With Parabolic Trough and Thermal Storage
No ratings yet
Energetic and Exergetic Analysis of Rankine Cycles For Solar Power Plants With Parabolic Trough and Thermal Storage
5 pages

Data Exploration & Visualization Syllabus

Uploaded by

Data Exploration & Visualization Syllabus

Uploaded by

[Link]. Prog.

Computer Science Sem IV

DSE 02a: Data Exploration and Visualization

(Admission 2022 onwards)

TOPICS/UNITS Chapter Ref

2. Molin S. Hands-On Data Analysis with Pandas, Packt Publishing, 2019.

1. Chen D. Y, Pandas for Everyone: Python Data Analysis, Pearson, 2018.

Suggestive Practice Questions:

5. Create the visualizations of question 4 (A and C part) using plotly library.

Common questions

How can creating new columns in a dataset, such as "Family Size" in the Titanic dataset, assist in data analysis and model building?

Describe how the use of scatter plots and pair plots in the iris dataset helps in understanding attribute relationships and distribution patterns.

How can you utilize NumPy arrays for data manipulation and what are the basic operations that can be performed using NumPy, such as mathematical operations, random number generation, and reshaping?

Discuss the steps and functions in Pandas used for cleaning data by handling missing values and detecting outliers.

Explain the process of creating interactive plots using Plotly and the benefits of incorporating interactivity such as hover effects and zooming.

How do functions like boolean indexing and conditional filtering in Pandas enhance data selection and transformation?

In what scenarios would statistical measures like mean, median, and standard deviation provide meaningful insights into a dataset's characteristics?

How do pivot tables and cross-tabulation in Pandas assist in data summarization and what are their typical applications?

What are the key differences between Matplotlib and Seaborn in terms of data visualization capabilities, and how can they be complemented for effective visual representation?

Why is it important to assess and handle duplicate records and missing values in datasets like the Titanic dataset, and what strategies are effective for such tasks?

You might also like