0% found this document useful (0 votes)

24 views12 pages

Python EDA Workshop with Olympic Data

Cheat Sheet PDA

Uploaded by

Muhammad Faizan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views12 pages

Python EDA Workshop with Olympic Data

Cheat Sheet PDA

Uploaded by

Muhammad Faizan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Python for Exploratory Data

Analysis (Workshop)

Proposal:
Exploratory Data Analysis (EDA) is about getting an overall understanding of data.
EDA includes exploring data to find its main characteristics, identifying patterns and
visualizations. EDA provides meaningful insights into data to be used in a variety of
applications e.g,. machine learning. Python can be effectively used to do EDA as it
has a rich set of easy-to-use libraries like Pandas, Seaborn, Numpy and Matplotlib.
In this workshop we will cover basics of EDA using a real world data set, including,
but not limited to, Correlating, Converting, Completing, Correcting, Creating and
Charting the data. In addition we will learn how to install and use Jupyter Notebooks
(an open-source web application that allows you to create and share documents that
contain live code, equations, visualizations and narrative text).

Setting up Requirements:
First step is to understand and install all requirements. It also includes acquiring data (on
which EDA is going to be done) from a given github link.
Following steps would be completed on all attendant's machines.
● Make sure python is installed and working (Python 2)
● A brief introduction on python virtual environment
○ Virtual environment is a self-contained directory tree that contains a
Python installation for a particular version of Python, plus a number of
additional packages.
● Create a virtual environment
● A brief introduction on jupyter notebooks
○ [Link]
s_jupyter.html
● Install Jupyter notebook
○ [Link]
● Get data and requirement file from
[Link]
● Install all requirements using pip from given requirement file
● Check all requirements are satisfied
A brief introduction of installed libraries:

We will be using installed libraries to perform different operations on data. Let’s

explore these libraries a bit.
● Numpy
○ NumPy is a library for the Python programming language, adding
support for large, multi-dimensional arrays and matrices, along with a
large collection of high-level mathematical functions to operate on
these arrays.
○ [Link]
● Pandas
○ pandas is a python package providing fast, flexible, and expressive
data structures designed to make working with “relational” or “labeled”
data both easy and intuitive. It aims to be the fundamental high-level
building block for doing practical, real world data analysis in Python.
Additionally, it has the broader goal of becoming the most powerful and
flexible open source data analysis / manipulation tool available in any
language. It is already well on its way toward this goal.
○ [Link]
● Seaborn
○ Seaborn is a Python data visualization library based on matplotlib. It
provides a high-level interface for drawing attractive and informative
statistical graphics.
○ [Link]
● Matplotlib
○ Matplotlib is a Python 2D plotting library which produces publication
quality figures in a variety of hardcopy formats and interactive
environments across platforms.
○ [Link]
Introduction of data:
We will be using data of olympic games here. This data holds 120 years of olympic
history including bio of athletes and information about the game they participated in.

The file athlete_events.csv contains 271116 rows and 15 columns; Each row
corresponds to an individual athlete competing in an individual Olympic event
(athlete-events). Columns are the following:

1. ID - Unique number for each athlete;

2. Name - Athlete's name;
3. Sex - M or F;
4. Age - Integer;
5. Height - In centimeters;
6. Weight - In kilograms;
7. Team - Team name;
8. NOC - National Olympic Committee 3-letter code;
9. Games - Year and season;
10. Year - Integer;
11. Season - Summer or Winter;
12. City - Host city;
13. Sport - Sport;
14. Event - Event;
15. Medal - Gold, Silver, Bronze, or NA.

The file noc_regions.csv contains 230 rows and 3 columns. Each row contains a
NOC and its related region and any notes. Columns are following:
1. NOC - National Olympic Committee 3-letter code;
2. Region - Name of country
3. Notes - String containing any useful information about region and NOC

Importing Data into Data Frames:

To start working on data first we need to import data from csv files to pandas DataFrame.
This will be done using pandas’ read_csv method. We will further learn how different
delimiters are used by this function.
Collecting basic information about data:
We need to make sense of our data about how does it look like. We will explore some more
pandas’ function here like
● See data in tabular form using head.

● Descriptive statistics using pandas’ describe

● Overall summary of DataFrame

● we want to find out if there are any null values in columns. Check using pandas’
isnull.
Querying Data:
Run different queries on data to extract further knowledge from data. We will discuss
following important concepts and techniques..

Understanding Boolean Indexing:

Boolean indexing is used to perform general queries on a given pandas dataframe. This is
an important concept to grasp. We will perform different operations on data to understand it
e.g
● Count/Find how many records without any medal mentioned.
● Count/Find most young and most old people who got Gold medal
● Count/Find number of gold medals won by women of any specific country in a
particular year

Explore some builtin functions:

We would explore some important panda library functions by using them e.g
● notnull
● loc
● Groupby
● Value_counts
● Pivot_table
● reindex

Cleaning and Completing Data:

At this point we are well aware of our data. We know that it has some missing values. We
will perform different operations on it. E.g
● Exclude all records from data where we don’t have any information about medals.
● Fill missing age values with average age of other athletes.
● Fill missing height values for women and men with average height of women and
men athletes respectively.
● Fill missing weight values for women and men with average weight of women and
men athletes participating in same sports
Data Visualization:
Visualizing data in different type of graphs will provide us with greater insights into our data.
We will explore different options on visualizing our data and find out any patterns within it.
From now on we will be using our previous knowledge of pandas library and try to grasp new
concepts of seaborn and matplotlib.

Countplot examples:

1. Gold medals in gymnastic over age

2. Medals won by China over years

3. Gold medals won by china in summer olympics in sports

Pointplot examples:

1. Height of male athletes over years.

2. Height of female athletes over years.

Barplot examples:

1. Top 5 countries with most medals

2. Number of athletes in each olympic game

Boxplot Examples:

1. Age distribution of male/female in Olympic games

2. Variation of age for female over time

Scatterplot example:
Height and weight ratio of athletes
Heatmap example:
1. Average age of medal winners in olympic games.

In addition to this we will be discussing and analysing trends and patterns while visualizing
the data.
Here I have given some examples only. We may draw some additional graphs as we
continue to learn more and more about it.

References:
● Data is taken from
[Link]
● This work is inspired by my fellow learners at kaggle:
○ [Link]
c-games
○ [Link]
○ And from other kaggle and great documentation of python libraries.

Common questions

Python significantly enhances the exploratory data analysis (EDA) process through its powerful libraries. Pandas provides flexible and expressive data structures for working with labeled data, making it crucial for real-world data manipulation . Numpy supports large multidimensional arrays and matrices, essential for mathematical computations . Seaborn builds on Matplotlib to offer a higher-level interface for statistical graphics, which helps in drawing attractive visualizations . Matplotlib itself allows generating publication-quality figures across a variety of formats . Together, these tools allow analysts to efficiently clean, transform, visualize, and derive insights from data sets.

Common exploratory data analysis techniques include data cleaning (handling of missing data), aggregation and summarization with pandas functions like groupby and pivot_table, and visualization using Seaborn and Matplotlib for generating plots such as countplots, pointplots, and heatmaps . These techniques allow analysts to identify patterns such as medal distribution by country, athletes' age distributions over time, and correlations among various attributes, providing actionable insights into Olympic data.

Data visualizations such as countplots and pointplots are crucial for gaining insights from Olympic data with Seaborn. Countplots help display the number of occurrences of certain events, such as gold medals over ages or time, revealing trends and anomalies . Pointplots show changes over categories with a spatial dimension, such as athlete's height changes over the years, facilitating the understanding of temporal patterns and correlations . These visual tools highlight underlying patterns, support decision-making, and enhance interpretability of complex data.

To prepare the environment and data for exploratory data analysis in Python, the initial steps include ensuring Python is installed, creating a virtual environment to manage dependencies, and setting up Jupyter Notebooks for interactive development . Additionally, downloading data from source repositories such as GitHub and installing all required packages using pip are crucial setup steps. These processes ensure a robust infrastructure that supports efficient data processing and analysis.

Boxplots provide insights into the distribution and variability of demographic features like age among athletes in the Olympic data set. For instance, a boxplot could reveal age distributions and outlier detection among male and female athletes, indicating the central tendency and spread of ages for each group . Such analyses can also highlight shifts over time, allowing for examination of changes in athlete demographics throughout different Olympic events.

Pandas' pivot_table function enables effective analysis of Olympic data by restructuring data to emphasize relationships between columns. For instance, it can be used to summarize medal counts by athlete or country, creating a multi-dimensional summary table that aggregates data points based on specified criteria . This function supports various metrics (e.g., mean, sum) to compare performance across different categories, providing deeper insights into trends and aiding complex data queries.

The structure of athlete_events.csv, with its 271,116 rows and 15 detailed columns, offers a comprehensive foundation for analyzing Olympic athletes' performance . Each row records a unique athlete's participation in events, including demographics, physical attributes, team affiliations, and medal outcomes, allowing for detailed descriptive and inferential analyses. The structure supports in-depth queries and visualizations, helping analysts understand historical trends, demographic variations, and performance outcomes across different sports and events.

Boolean indexing allows for precise data queries within the Olympic data set by leveraging logical conditions to filter data. For example, analysts can identify athletes without medals, pinpoint the youngest or oldest gold medalists, or find the number of gold medals won by women from a specific country in a given year . This technique enhances data extraction efficiency, enabling targeted analysis by isolating records satisfying specific criteria.

Virtual environments create isolated spaces in which a particular Python installation and associated packages reside, avoiding conflicts between different projects' dependencies . This allows for a controlled, consistent environment across machines for exploratory data analysis workflows, ensuring that all dependencies required for data processing and visualization are correctly managed and do not interfere with other projects.

During exploratory data analysis of the Olympic data set, missing values must be addressed to ensure accurate analyses. Challenges include missing age, weight, and height data . Solutions involve excluding records without medal information, filling missing ages with the average age of athletes, and imputing missing height and weight values with gender and sport-specific averages . This careful handling reduces potential biases and maintains data integrity for further analysis.

Olympic Data Analysis Project Insights
No ratings yet
Olympic Data Analysis Project Insights
23 pages
Data Analysis with Python: NumPy & Pandas
No ratings yet
Data Analysis with Python: NumPy & Pandas
76 pages
Python Data Analysis Syllabus
No ratings yet
Python Data Analysis Syllabus
75 pages
NumPy vs. Pandas in Python
No ratings yet
NumPy vs. Pandas in Python
72 pages
Data Analysis Lab: Python & Visualization
No ratings yet
Data Analysis Lab: Python & Visualization
11 pages
NumPy Basics: A Quick Reference Guide
No ratings yet
NumPy Basics: A Quick Reference Guide
75 pages
Lab 1
No ratings yet
Lab 1
12 pages
Pandas and Data Visualization Lab Manual
No ratings yet
Pandas and Data Visualization Lab Manual
69 pages
Python
No ratings yet
Python
9 pages
Data Wrangling with Python Guide
No ratings yet
Data Wrangling with Python Guide
8 pages
Data Analysis with Pandas and NumPy
No ratings yet
Data Analysis with Pandas and NumPy
31 pages
Python Data Analytics: NumPy, Pandas, Matplotlib
100% (1)
Python Data Analytics: NumPy, Pandas, Matplotlib
14 pages
Data Preprocessing in Python Libraries
No ratings yet
Data Preprocessing in Python Libraries
159 pages
ppt1 - Intro To Data Analytics and Visualization
No ratings yet
ppt1 - Intro To Data Analytics and Visualization
35 pages
Python Data Analysis & Visualization Guide
No ratings yet
Python Data Analysis & Visualization Guide
43 pages
Python Data Science Essentials
No ratings yet
Python Data Science Essentials
27 pages
Numpy, Pandas, Matplotlib Basics in Python
No ratings yet
Numpy, Pandas, Matplotlib Basics in Python
30 pages
Data Analysis with Python and R
No ratings yet
Data Analysis with Python and R
28 pages
Convert 26AS Text to Excel Guide
No ratings yet
Convert 26AS Text to Excel Guide
38 pages
Data Science and Machine Learning Progress
No ratings yet
Data Science and Machine Learning Progress
39 pages
Data Visualization with Python Libraries
No ratings yet
Data Visualization with Python Libraries
33 pages
Intro to Data Analysis with Python
100% (2)
Intro to Data Analysis with Python
29 pages
DSBDAL - Assignment No 1
No ratings yet
DSBDAL - Assignment No 1
24 pages
Business Analytics I: Pandas & NumPy Insights
No ratings yet
Business Analytics I: Pandas & NumPy Insights
11 pages
Data Analysis Training Report Python
No ratings yet
Data Analysis Training Report Python
12 pages
Basic Data Science Tutorial in Python
No ratings yet
Basic Data Science Tutorial in Python
10 pages
Data Analysis with Python Course Overview
No ratings yet
Data Analysis with Python Course Overview
66 pages
Data Science Lab Manual: Python Basics
No ratings yet
Data Science Lab Manual: Python Basics
8 pages
Module2 Python Data Analytics Notes
No ratings yet
Module2 Python Data Analytics Notes
3 pages
Python Data Analysis Essentials
No ratings yet
Python Data Analysis Essentials
29 pages
Python Data Analysis Basics Guide
No ratings yet
Python Data Analysis Basics Guide
6 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
65 pages
DEV manual-AD3301 Sem3
No ratings yet
DEV manual-AD3301 Sem3
19 pages
Introduction to Data Science with Python
No ratings yet
Introduction to Data Science with Python
93 pages
Refsheet 2
No ratings yet
Refsheet 2
25 pages
Session 8
No ratings yet
Session 8
12 pages
Data Analysis with Python Course Overview
No ratings yet
Data Analysis with Python Course Overview
300 pages
Part A Assignment No 1 PDF
No ratings yet
Part A Assignment No 1 PDF
24 pages
Chapter 2 Introduction To Python Libraries
No ratings yet
Chapter 2 Introduction To Python Libraries
46 pages
Introduction to Pandas for Data Analysis
No ratings yet
Introduction to Pandas for Data Analysis
48 pages
EDA Techniques with Python Libraries
No ratings yet
EDA Techniques with Python Libraries
47 pages
Python Data Wrangling Techniques
No ratings yet
Python Data Wrangling Techniques
49 pages
Data Science with Python: NumPy, Pandas, Matplotlib
No ratings yet
Data Science with Python: NumPy, Pandas, Matplotlib
36 pages
Chapter 3
No ratings yet
Chapter 3
36 pages
ML Introduction
No ratings yet
ML Introduction
12 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
174 pages
AI & Data Science Lab Manual
No ratings yet
AI & Data Science Lab Manual
27 pages
Data Analysis with Python Course Overview
No ratings yet
Data Analysis with Python Course Overview
137 pages
Data Analysis & Visualization with Python
No ratings yet
Data Analysis & Visualization with Python
4 pages
Day 2 Workshop
No ratings yet
Day 2 Workshop
16 pages
Data Exploration and Cleaning Techniques
No ratings yet
Data Exploration and Cleaning Techniques
12 pages
Data Analysis with Python Course Overview
No ratings yet
Data Analysis with Python Course Overview
21 pages
AD3301 - Dev Lab Manual
No ratings yet
AD3301 - Dev Lab Manual
27 pages
Python Pandas Program Guide
No ratings yet
Python Pandas Program Guide
2 pages
Data Analytics and Reporting Overview
No ratings yet
Data Analytics and Reporting Overview
11 pages
Remote Data Entry in R/3 Systems
No ratings yet
Remote Data Entry in R/3 Systems
168 pages
Cashless Canteen Management System
No ratings yet
Cashless Canteen Management System
6 pages
LFP PRO Service Tool Manual v1.0.1.3
No ratings yet
LFP PRO Service Tool Manual v1.0.1.3
17 pages
Managing A Mobile Agricultural Robot Swarm For A Seeding Task
No ratings yet
Managing A Mobile Agricultural Robot Swarm For A Seeding Task
8 pages
Introduction to Draw.io Software
No ratings yet
Introduction to Draw.io Software
15 pages
Key Characteristics of Operating Systems
No ratings yet
Key Characteristics of Operating Systems
3 pages
Flutter Application Development - Course Outline
No ratings yet
Flutter Application Development - Course Outline
8 pages
8-Bit 2's Complement Calculator Project
No ratings yet
8-Bit 2's Complement Calculator Project
11 pages
Excel Workbook Components Explained
No ratings yet
Excel Workbook Components Explained
2 pages
Deep Learning for Malware Detection Review
No ratings yet
Deep Learning for Malware Detection Review
25 pages
Data Domain Config
No ratings yet
Data Domain Config
560 pages
Bentley SewerCAD V8i SS4 Readme Guide
No ratings yet
Bentley SewerCAD V8i SS4 Readme Guide
8 pages
SAPAnalyticsCloud AnalyticsDesigner DeveloperHandbook
100% (1)
SAPAnalyticsCloud AnalyticsDesigner DeveloperHandbook
344 pages
MS PowerPoint MCQs for Skill Improvement
No ratings yet
MS PowerPoint MCQs for Skill Improvement
12 pages
ASRock Z790 Riptide WiFi Overview
No ratings yet
ASRock Z790 Riptide WiFi Overview
63 pages
Types of Computer Mouse Explained
No ratings yet
Types of Computer Mouse Explained
10 pages
Archi: Plug-In For Archicad
No ratings yet
Archi: Plug-In For Archicad
24 pages
Cropping Images in Microsoft Word
No ratings yet
Cropping Images in Microsoft Word
5 pages
Microsoft Official Course: Deploying and Managing Windows Server 2012
No ratings yet
Microsoft Official Course: Deploying and Managing Windows Server 2012
54 pages
Daily Expense Tracker Web Application
100% (3)
Daily Expense Tracker Web Application
16 pages
Gunbound Maintenance in Windows Mode
No ratings yet
Gunbound Maintenance in Windows Mode
17 pages
SS1 Data Processing Practical Workbook
No ratings yet
SS1 Data Processing Practical Workbook
10 pages
Rosario
No ratings yet
Rosario
18 pages
Introduction To Programming With Mathpiper V
No ratings yet
Introduction To Programming With Mathpiper V
140 pages
JSS2 Computer Studies Exam Questions
No ratings yet
JSS2 Computer Studies Exam Questions
2 pages
ANSYS Rocky User Manual 2024 R2
100% (1)
ANSYS Rocky User Manual 2024 R2
1,332 pages
C# ListBox and ComboBox Overview
No ratings yet
C# ListBox and ComboBox Overview
14 pages
Designing Meaningful Proto-Metaverses
No ratings yet
Designing Meaningful Proto-Metaverses
10 pages
Mac Terminal Commands Guide
No ratings yet
Mac Terminal Commands Guide
4 pages
Tinker Patch Load Fail in Sosomod App
No ratings yet
Tinker Patch Load Fail in Sosomod App
28 pages

Python EDA Workshop with Olympic Data

Uploaded by

Python EDA Workshop with Olympic Data

Uploaded by

Python for Exploratory Data

We will be using installed libraries to perform different operations on data. Let’s

1. ID - Unique number for each athlete;

Importing Data into Data Frames:

● Descriptive statistics using pandas’ describe

Understanding Boolean Indexing:

Explore some builtin functions:

Cleaning and Completing Data:

1. Gold medals in gymnastic over age

2. Medals won by China over years

1. Height of male athletes over years.

1. Top 5 countries with most medals

1. Age distribution of male/female in Olympic games

Common questions

How does Python contribute to the exploratory data analysis process, particularly when using libraries like Pandas, Seaborn, Numpy, and Matplotlib?

What are some common exploratory data analysis techniques used to extract meaningful patterns from Olympic data?

Discuss the importance of data visualizations like countplots and pointplots in extracting insights from Olympic data using Seaborn.

What initial steps should one take to prepare their environment and data for exploratory data analysis using Python?

What insights about athlete demographics can be derived from boxplot analyses in the Olympic data?

How can pandas' pivot_table function be utilized to analyze Olympic data effectively?

How does the structure of the athlete_events.csv file facilitate detailed analysis of Olympic athletes' performance?

How can Boolean indexing be used to perform specific data queries in the Olympic data set?

What role do virtual environments play in setting up a Python-based exploratory data analysis workflow?

What are some challenges and solutions in handling missing data during exploratory data analysis of Olympic data using Python?

You might also like

Python EDA Workshop with Olympic Data

Uploaded by

Python EDA Workshop with Olympic Data

Uploaded by

Python for Exploratory Data

We will be using installed libraries to perform different operations on data. Let’s

1. ID - Unique number for each athlete;

Importing Data into Data Frames:

● Descriptive statistics​ using pandas’ ​describe

Understanding Boolean Indexing:

Explore some builtin functions:

Cleaning and Completing Data:

1. Gold medals in gymnastic over age

2. Medals won by China over years

1. Height of male athletes over years.

1. Top 5 countries with most medals

1. Age distribution of male/female in Olympic games

Common questions

How does Python contribute to the exploratory data analysis process, particularly when using libraries like Pandas, Seaborn, Numpy, and Matplotlib?

How does Python contribute to the exploratory data analysis process, particularly when using libraries like Pandas, Seaborn, Numpy, and Matplotlib?

What are some common exploratory data analysis techniques used to extract meaningful patterns from Olympic data?

What are some common exploratory data analysis techniques used to extract meaningful patterns from Olympic data?

Discuss the importance of data visualizations like countplots and pointplots in extracting insights from Olympic data using Seaborn.

Discuss the importance of data visualizations like countplots and pointplots in extracting insights from Olympic data using Seaborn.

What initial steps should one take to prepare their environment and data for exploratory data analysis using Python?

What initial steps should one take to prepare their environment and data for exploratory data analysis using Python?

What insights about athlete demographics can be derived from boxplot analyses in the Olympic data?

What insights about athlete demographics can be derived from boxplot analyses in the Olympic data?

How can pandas' pivot_table function be utilized to analyze Olympic data effectively?

How can pandas' pivot_table function be utilized to analyze Olympic data effectively?

How does the structure of the athlete_events.csv file facilitate detailed analysis of Olympic athletes' performance?

How does the structure of the athlete_events.csv file facilitate detailed analysis of Olympic athletes' performance?

How can Boolean indexing be used to perform specific data queries in the Olympic data set?

How can Boolean indexing be used to perform specific data queries in the Olympic data set?

What role do virtual environments play in setting up a Python-based exploratory data analysis workflow?

What role do virtual environments play in setting up a Python-based exploratory data analysis workflow?

What are some challenges and solutions in handling missing data during exploratory data analysis of Olympic data using Python?

What are some challenges and solutions in handling missing data during exploratory data analysis of Olympic data using Python?

You might also like

● Descriptive statistics using pandas’ describe