0% found this document useful (0 votes)
15 views3 pages

Data Exploration & Visualization Syllabus

The document outlines the syllabus for the Data Exploration and Visualization course for B.Sc. Computer Science students, covering topics such as NumPy arrays, data manipulation with Pandas, grouping and aggregation, data visualization with Matplotlib and Seaborn, and interactive visualizations with Plotly. It includes a week-by-week breakdown of units, chapters, and reference materials, along with suggestive practice questions using various datasets. The course aims to equip students with essential skills in data analysis and visualization techniques.

Uploaded by

whyytrishh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views3 pages

Data Exploration & Visualization Syllabus

The document outlines the syllabus for the Data Exploration and Visualization course for B.Sc. Computer Science students, covering topics such as NumPy arrays, data manipulation with Pandas, grouping and aggregation, data visualization with Matplotlib and Seaborn, and interactive visualizations with Plotly. It includes a week-by-week breakdown of units, chapters, and reference materials, along with suggestive practice questions using various datasets. The course aims to equip students with essential skills in data analysis and visualization techniques.

Uploaded by

whyytrishh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

[Link]. Prog.

Computer Science Sem IV

DSE 02a: Data Exploration and Visualization

(Admission 2022 onwards)

TOPICS/UNITS Chapter Ref

Week 1 Unit 1(10): Creating and Manipulating Ch4:4.1 (upto pg 103), Usage [1]
to 2 NumPy arrays: creating arrays, indexing and of rand(), nrand() and
slicing, mathematical operations with NumPy randint() functions of NumPy
arrays

Week 3 Unit 2(15): Data Manipulation with Ch 5: 5.1, 5.2(upto pg 149), 5.3 [1]
to 5 Pandas: Series and DataFrame objects;
importing and exporting data from various Ch 6: 6.1 (pg 169-172, 175)
file formats into pandas DataFrame; Data Ch 7: 7.1, 7.2 (upto pg 202, 205-
selection and filtering- indexing, slicing, 206)
conditional filtering using boolean indexing; Ch 8: 8.1 (pg 221-223), 8.2 (pg
Data Cleaning- handling missing data in 227-231), 8.3 (pg 243-245)
Pandas and outlier detection; Data
Manipulation-sorting, reshaping, merging.
Week 6 Unit 3(5): Grouping and Aggregation with Ch 10: 10.1(upto pg 293), 10.2, [1]
to 9 Pandas: Grouping data using Pandas, 10.3 (upto pg 303), 10.4
applying aggregation functions such as sum,
mean, count, [Link] grouped data, using pivot
tables and cross-tabulation for data
summarization

Week Unit 4(10): Data Visualization with Matplotlib Ch 9: 9.1 (pg 253-264, 267), 9.2 [1]
10 to 13 and Seaborn: Introduction to Matplotlib and
Seaborn to plot data using figures and
subplots, Plots - Line plots, scatter plots, and bar
plots, Visualizing distributions using histogram
and box plots, Customizing plot aesthetics and
adding annotations

Week Unit 5(5): Interactive Visualizations with Chapter-8 (upto topic- Use of
bar charts in Plotly) [4]
14 to 15 Plotly: Introduction to Plotly library for
interactive visualization; Creating interactive line
plots, scatter plots, and bar plots; Adding
interactivity with hover effects, zooming, and
panning
References

1. McKinney W. Python for Data Analysis: Data Wrangling with Pandas, NumPy and IPython. 2nd edition.
O’Reilly Media, 2018.

2. Molin S. Hands-On Data Analysis with Pandas, Packt Publishing, 2019.

3. VanderPlas, J. Python data science handbook: Essential tools for working with data. " O'Reilly Media,
Inc.", 2nd edition.

4. Rahman K. Python Data Visualization Essentials Guide: Become a Data Visualization expert by building
strong proficiency in Pandas, Matplotlib, Seaborn, Plotly, Numpy, and Bokeh, BPB 2021

Additional References:

1. Chen D. Y, Pandas for Everyone: Python Data Analysis, Pearson, 2018.

Online references/material:
1. [Link]
2. [Link]

Suggestive Practice Questions:

Use data set of your choice from Open Data Portal (https:// [Link]/, UCI repository) or
load from scikit, seaborn library for the following exercises to practice the concepts learnt.

1. Write a program using the NumPy library to perform the following tasks:

A. Generate a 5x2 integer array with values ranging from 50 to 100, where each element has a
difference of 5. Reshape the resulting array to a size of 10x1.
B. Create a 1D random array with values ranging from 1 to 100. Calculate various statistical
measures such as minimum, maximum, mean, median, standard deviation, number of unique
values, count of unique values, and the most frequent value in the array.
C. Create a 5x5 identity matrix where all the diagonal elements are set to the value 5.
D. Consider a dataset containing the heights (in centimeters) and weights (in kilograms) of 20
individuals. Your task is to perform various operations using the NumPy library to analyze the
data.
a. Create a NumPy array called "heights" with the following height values: [165, 170,
175, 168, 172, 180, 160, 169, 176, 171, 174, 182, 158, 167, 173, 179, 163, 166, 177,
181]. Create a NumPy array called "weights" with the following weight values: [60, 65,
70, 75, 80, 85, 55, 58, 63, 68, 72, 77, 50, 62, 67, 74, 52, 57, 69, 73].
b. Create a new NumPy array called "combined" by stacking the heights and weights
arrays such that the shape of the resulting array is 20 x 2.
c. Calculate and print the mean height and weight of the individuals in the dataset.
d. Find and print the index of the shortest and tallest individuals in the dataset.
e. Sort the array based on height on the individuals.
f. Swap the positions of the two columns in the array.
g. Retrieve records of individuals having weight below 70kg.

2. Write a program using the Pandas library to perform the following operations on the penguins dataset from
the Seaborn library:
A. Load the penguins dataset into a Pandas dataframe.
B. Determine the number of observations/records and the number of attributes in the dataframe.
C. Display the names of the attributes, row indexes, and data types of each attribute in the dataframe.
D. Display the first 5 and last 5 records of the dataframe.
E. Retrieve the values of the second column for the third and fourth records.
F. Display a summary of the data distribution for all attributes in the dataframe.
G. Compute the pairwise correlation between all attributes in the dataframe.

3. Consider the Titanic dataset, which contains information about passengers on board the Titanic, including
their age, gender, passenger class, survival status, and other attributes. Write a program using the Pandas
library to perform the following operations on the Titanic dataset:
A. Load the Titanic dataset into a Pandas DataFrame.
B. Check for any duplicate records and missing values in the dataset and handle them appropriately.
C. Calculate and display the total number of passengers who survived and those who did not.
D. Filter the DataFrame to select only the records of passengers who were under the age of 18.
E. Calculate the average age for passengers belonging to each of the passenger class.
F. Create a new column in the DataFrame called "Family Size" that represents the total number of family
members (including the passenger) on board.
G. Calculate the correlation between age and fare attributes of the dataset.
H. Create a contingency table that shows the count of passengers based on their survival status (survived
or not) and passenger class (first, second, or third class). for titanic dataset

4. Utilize the iris dataset from the Sklearn library to generate various visual representations of the data using
the Matplotlib and or Seaborn libraries with proper legends and labels. Perform the following tasks:

A. Create a scatter plot to visualize the relationship between petal length and petal width for different
instances of iris flowers.
B. Generate histograms to display the data distribution of each of the four attributes in the iris dataset.
C. Construct a pie chart to illustrate the frequency count of each flower type in the iris dataset.
D. Create a pair plot that showcases the relationship between every pair of attributes in the iris dataset
(only seaborn library).

5. Create the visualizations of question 4 (A and C part) using plotly library.

Contributors:

Common questions

Powered by AI

Creating new columns in a dataset, like "Family Size" in the Titanic dataset, involves feature engineering, where new relevant attributes are derived to capture additional information from existing data. "Family Size" combines passenger family members, potentially correlating to survival odds and thus adding depth to analysis. Such engineered features can improve model building by offering richer and more nuanced insights, enhancing predictive capabilities by introducing new patterns or relationships into the dataset .

Scatter plots in the iris dataset visualize relationships between two attributes, such as petal length and petal width, highlighting correlations and distribution patterns. Pair plots extend this by comprehensively visualizing relationships between each pair of attributes across all samples in the dataset. This can reveal trends, clusters, and the presence of any anomalies, helping to understand how different attributes relate and vary across different species within the iris dataset .

NumPy arrays allow efficient data manipulation through various operations such as indexing, slicing, and performing mathematical operations element-wise. NumPy's mathematical functions, like sum(), mean(), and max(), can operate over entire arrays or along a specific axis. Additionally, NumPy includes functions for generating random numbers, like rand() for uniform distribution, nrand() for normal distribution, and randint() for random integers. Arrays can be reshaped using the reshape method, enabling conversion between different dimensions as required for data processing .

Data cleaning in Pandas involves handling missing values, which can be achieved using functions like fillna() to replace them with a specified value or method (e.g., forward fill) and dropna() to remove incomplete records. Outlier detection involves identifying values that deviate significantly from the dataset they belong to. Pandas can use conditions combined with statistical methods like z-scores or IQR to detect outliers. These tools allow for effective preprocessing, preparing data for analysis and visualization .

Creating interactive plots using Plotly involves first importing necessary modules and then defining data and layout for the plot, such as scatter or bar plots. Functions like add_trace or plot can be used to render the visualizations. Interactive features like hover effects provide additional data insights by displaying information when a user hovers over an element, while zooming allows users to focus on and explore data details. These features enhance user engagement, providing dynamic visual feedback and improving data exploration and understanding .

Boolean indexing in Pandas involves selecting subsets of data by applying a boolean condition, returning a DataFrame where the condition is True. Conditional filtering refines this by using conditions on one or more columns to filter data more precisely. This enhances data transformation by allowing flexible and intuitive slice-and-dice operations, crucial for exploring datasets and focusing analysis on relevant data points easily .

Statistical measures like mean, median, and standard deviation are fundamental to understanding a dataset's characteristics. The mean provides the average, useful for summarizing data with a single value representing the central tendency. The median indicates the middle value, ideal for skewed distributions as it's not affected by outliers. Standard deviation measures variability, giving insights into data spread or dispersion. These measures are essential for comparing datasets, identifying trends, and conducting inferential statistics to draw conclusions .

Pivot tables in Pandas summarize data by transforming it into a 2D table. They aggregate data based on some criteria using aggregation functions such as sum, mean, or count. Cross-tabulation, enabled by the crosstab function, allows comparison of categorical data, similar to pivot tables but focused on counting occurrences. These methods are typically used to condense large datasets, provide quick insights, and reveal patterns or trends across different categories or variables .

Matplotlib is a comprehensive library for creating static, interactive, and animated visualizations in Python. It is highly customizable but might require complex code for sophisticated plots. Seaborn, built on top of Matplotlib, provides a high-level interface for drawing attractive statistical graphs. It simplifies the creation of complex plots like heatmaps, violin plots, and pair plots. Matplotlib's strength lies in its flexibility, whereas Seaborn excels in providing aesthetically pleasing statistical graphics. They complement each other by allowing a user to leverage Matplotlib's complexity with Seaborn's simplicity to create both detailed and beautiful visualizations .

Handling duplicate records and missing values is crucial to maintain data accuracy and integrity. Duplicates can lead to skewed analytical results by overrepresenting some data points. Missing values can affect model performance if not addressed. Effective strategies include using drop_duplicates() for duplicate removal and fillna() or dropna() for handling missing data, depending on the context. Proper assessment and cleaning ensure that datasets accurately reflect the underlying phenomena and that insights or predictions based on such data are reliable .

You might also like