0% found this document useful (0 votes)

161 views9 pages

PySpark EDA Cheat Sheet

This document provides a cheat sheet on exploratory data analysis (EDA) techniques that can be performed with PySpark. It lists over 40 techniques organized into categories like data loading, inspection, cleaning, transformation, SQL queries, statistical analysis, machine learning integration, and more. The techniques are concisely explained and include relevant code snippets using PySpark APIs and functions.

Uploaded by

Jaya Shankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

161 views9 pages

PySpark EDA Cheat Sheet

Uploaded by

Jaya Shankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Transformation
SQL Queries on DataFrames
Data Loading
Basic Data Inspection
Data Cleaning
Statistical Analysis
Handling Missing and Duplicated Data
Data Conversion and Export
Performance Optimization
Exploratory Data Analysis Specifics
Advanced Data Processing
Working with Complex Data Types
Saving and Loading Models
Data Import/Export Tips
Custom Aggregations
Handling JSON and Complex Files
Working with Null Values
SQL Operations and DataFrame Registration
Advanced Window Functions
Spark Streaming for Real-Time EDA
Network Data Handling
Efficient Data Transformation
Advanced Machine Learning Operations
Advanced File Formats
Working with Nested Arrays
Exploring Data with GraphFrames
Efficient Use of SQL Functions
Customizing Spark Session
Utilizing External Data Sources
Advanced Statistical Analysis

[ Exploratory Data Analysis (EDA) with PySpark ] {CheatSheet}

1. Data Loading

● Read CSV File: df = [Link]('[Link]', header=True,

inferSchema=True)
● Read Parquet File: df = [Link]('[Link]')
● Read from JDBC (Databases): df =
[Link]("jdbc").options(url="jdbc_url",
dbtable="table_name").load()

2. Basic Data Inspection

● Display Top Rows: [Link]()

● Print Schema: [Link]()
● Summary Statistics: [Link]().show()
● Count Rows: [Link]()
● Display Columns: [Link]

3. Data Cleaning

● Drop Missing Values: [Link]()

● Fill Missing Values: [Link](value)
● Drop Column: [Link]('column_name')
● Rename Column: [Link]('old_name', 'new_name')

4. Data Transformation

● Select Columns: [Link]('column1', 'column2')

● Add New or Transform Column: [Link]('new_column',
expression)
● Filter Rows: [Link](df['column'] > value)
● Group By and Aggregate: [Link]('column').agg({'column': 'sum'})
● Sort Rows: [Link](df['column'].desc())

5. SQL Queries on DataFrames

● Create Temporary View: [Link]('view_name')

By: Waleed Mousa
● SQL Query: [Link]('SELECT * FROM view_name WHERE condition')

6. Statistical Analysis

● Correlation Matrix: from [Link] import Correlation;

[Link](df, 'column')
● Covariance: [Link]('column1', 'column2')
● Frequency Items: [Link](['column1', 'column2'])
● Sample By: [Link]('column', fractions={'class1': 0.1,
'class2': 0.2})

7. Handling Missing and Duplicated Data

● Fill Missing Values in Column: [Link]({'column': value})

● Drop Duplicates: [Link]()
● Replace Value: [Link](['old_value'], ['new_value'],
'column')

8. Data Conversion and Export

● Convert to Pandas DataFrame: pandas_df = [Link]()

● Write DataFrame to CSV: [Link]('path_to_save.csv')
● Write DataFrame to Parquet:
[Link]('path_to_save.parquet')

9. Column Operations

● Change Column Type: [Link]('column',

df['column'].cast('new_type'))
● Split Column into Multiple Columns: [Link]('new_col1',
split(df['column'], 'delimiter')[0])
● Concatenate Columns: [Link]('new_column', concat_ws(' ',
df['col1'], df['col2']))

10. Date and Time Operations

● Current Date and Time: [Link]('current_date',

current_date())

By: Waleed Mousa

● Date Formatting: [Link]('formatted_date',
date_format('dateColumn', 'yyyyMMdd'))
● Date Arithmetic: [Link]('date_plus_days',
date_add(df['date'], 5))

11. Advanced Data Processing

● Window Functions: from [Link] import Window;

[Link]('rank',
rank().over([Link]('column').orderBy('other_column')))
● Pivot Table:
[Link]('column').pivot('pivot_column').sum('sum_column')
● UDF (User Defined Functions): from [Link] import udf;
my_udf = udf(my_python_function); [Link]('new_col',
my_udf(df['col']))

12. Performance Optimization

● Caching DataFrame: [Link]()

● Repartitioning: [Link](10)
● Broadcast Join Hint: [Link](broadcast(df2), 'key', 'inner')

13. Exploratory Data Analysis Specifics

● Column Value Counts: [Link]('column').count().show()

● Distinct Values in a Column: [Link]('column').distinct().show()
● Aggregations (sum, max, min, avg):
[Link]().sum('column').show()

14. Working with Complex Data Types

● Exploding Arrays: [Link]('exploded',

explode(df['array_column']))
● Working with Structs: [Link](df['struct_column']['field'])
● Handling Maps: [Link](map_keys(df['map_column']))

15. Joins

● Inner Join: [Link](df2, df1['id'] == df2['id'])

By: Waleed Mousa

● Left Outer Join: [Link](df2, df1['id'] == df2['id'],
'left_outer')
● Right Outer Join: [Link](df2, df1['id'] == df2['id'],
'right_outer')

16. Saving and Loading Models

● Saving ML Model: [Link]('model_path')

● Loading ML Model: from [Link] import
LogisticRegressionModel; [Link]('model_path')

17. Handling JSON and Complex Files

● Read JSON: df = [Link]('path_to_file.json')

● Explode JSON Object: [Link]('json_column.*')

18. Custom Aggregations

● Custom Aggregate Function: from [Link] import functions as F;

[Link]('group_column').agg([Link]('sum_column'))

19. Working with Null Values

● Counting Nulls in Each Column:

[Link]([[Link]([Link]([Link](c), c)).alias(c) for c in
[Link]])
● Drop Rows with Null Values: [Link]()

20. Data Import/Export Tips

● Read Text Files: df = [Link]('path_to_file.txt')

● Write Data to JDBC: [Link]("jdbc").options(url="jdbc_url",
dbtable="table_name").save()

21. Advanced SQL Operations

● Register DataFrame as Table:

[Link]('temp_table')

By: Waleed Mousa

● Perform SQL Queries: [Link]('SELECT * FROM temp_table WHERE
condition')

22. Dealing with Large Datasets

● Sampling Data: sampled_df = [Link](False, 0.1)

● Approximate Count Distinct:
[Link](approx_count_distinct('column')).show()

23. Data Quality Checks

● Checking Data Integrity: [Link]()

● Asserting Conditions: [Link](df['column'] > 0).count()

24. Advanced File Handling

● Specify Schema While Reading: schema = StructType([...]); df =

[Link]('[Link]', schema=schema)
● Writing in Overwrite Mode:
[Link]('overwrite').csv('path_to_file.csv')

25. Debugging and Error Handling

● Collecting Data Locally for Debugging: local_data = [Link](5)

● Handling Exceptions in UDFs: def safe_udf(my_udf): def
wrapper(*args, **kwargs): try: return my_udf(*args, **kwargs)
except: return None; return wrapper

26. Machine Learning Integration

● Creating Feature Vector: from [Link] import

VectorAssembler; assembler = VectorAssembler(inputCols=['col1',
'col2'], outputCol='features'); feature_df =
[Link](df)

27. Advanced Joins and Set Operations

● Cross Join: [Link](df2)

By: Waleed Mousa

● Set Operations (Union, Intersect, Minus): [Link](df2);
[Link](df2); [Link](df2)

28. Dealing with Network Data

● Reading Data from HTTP Source:

[Link]("csv").option("url",
"[Link]

29. Integration with Visualization Libraries

● Convert to Pandas for Visualization: pandas_df = [Link]();

pandas_df.plot(kind='bar')

30. Spark Streaming for Real-Time EDA

● Reading from a Stream: df =

[Link]('source').load()
● Writing to a Stream: [Link]('console').start()

31. Advanced Window Functions

● Cumulative Sum: from [Link] import Window;

[Link]('cum_sum',
[Link]('column').over([Link]('group_column').orderBy('or
der_column')))
● Row Number: [Link]('row_num',
F.row_number().over([Link]('column')))

32. Handling Complex Analytics

● Rollup: [Link]('column1', 'column2').agg([Link]('column3'))

● Cube for Multi-Dimensional Aggregation: [Link]('column1',
'column2').agg([Link]('column3'))

33. Dealing with Geospatial Data

● Using GeoSpark for Geospatial Data: from [Link] import

GeoSparkRegistrator; [Link](spark)

By: Waleed Mousa

34. Advanced File Formats

● Reading ORC Files: df = [Link]('[Link]')

● Writing Data to ORC: [Link]('path_to_file.orc')

35. Dealing with Sparse Data

● Using Sparse Vectors: from [Link] import SparseVector;

sparse_vec = SparseVector(size, {index: value})

36. Handling Binary Data

● Reading Binary Files: df =

[Link]('binaryFile').load('path_to_binary_file')

37. Efficient Data Transformation

● Using mapPartitions for Transformation: rdd =

[Link](lambda partition: [transform(row) for row in
partition])

38. Advanced Machine Learning Operations

● Using ML Pipelines: from [Link] import Pipeline; pipeline =

Pipeline(stages=[stage1, stage2]); model = [Link](df)
● Model Evaluation: from [Link] import
BinaryClassificationEvaluator; evaluator =
BinaryClassificationEvaluator(); [Link](predictions)

39. Optimization Techniques

● Broadcast Variables for Efficiency: from [Link]

import broadcast; [Link](broadcast(df2), 'key')
● Using Accumulators for Global Aggregates: accumulator =
[Link](0); [Link](lambda x:
[Link](x))

40. Advanced Data Import/Export

By: Waleed Mousa

● Reading Data from Multiple Sources: df =
[Link]('format').option('option',
'value').load(['path1', 'path2'])
● Writing Data to Multiple Formats:
[Link]('format').save('path', mode='overwrite')

41. Utilizing External Data Sources

● Connecting to External Data Sources (e.g., Kafka, S3): df =

[Link]('kafka').option('[Link]',
'host1:port1').load()

42. Efficient Use of SQL Functions

● Using Built-in SQL Functions: from [Link] import

col, lit; [Link]('new_column', col('existing_column') +
lit(1))

43. Exploring Data with GraphFrames

● Using GraphFrames for Graph Analysis: from graphframes import

GraphFrame; g = GraphFrame(vertices_df, edges_df)

44. Working with Nested Data

● Exploding Nested Arrays: [Link]('id', 'explode(nestedArray)

as element')
● Handling Nested Structs: [Link]('struct_column.*')

45. Advanced Statistical Analysis

● Hypothesis Testing: from [Link] import ChiSquareTest; r =

[Link](df, 'features', 'label')
● Statistical Functions (e.g., mean, stddev): from
[Link] import mean, stddev;
[Link](mean('column'), stddev('column'))

46. Customizing Spark Session

By: Waleed Mousa

● Configuring SparkSession: spark =
[Link]('app').config('[Link]
n', 'value').getOrCreate()

By: Waleed Mousa

SQL, Spark SQL, and PySpark Comparison
No ratings yet
SQL, Spark SQL, and PySpark Comparison
11 pages
Jupyter Notebook Data Analysis Cheat Sheet
No ratings yet
Jupyter Notebook Data Analysis Cheat Sheet
10 pages
Understanding PySpark and Big Data
No ratings yet
Understanding PySpark and Big Data
31 pages
PySpark SQL Basics Cheat Sheet
No ratings yet
PySpark SQL Basics Cheat Sheet
1 page
Apache Spark Tutorial Overview
100% (1)
Apache Spark Tutorial Overview
6 pages
Pyspark Window Functions Overview
100% (1)
Pyspark Window Functions Overview
8 pages
MongoDB Cheat Sheet: Key Commands & Features
No ratings yet
MongoDB Cheat Sheet: Key Commands & Features
10 pages
Python Interview Questions and Answers
No ratings yet
Python Interview Questions and Answers
4 pages
PySpark Cheat Sheet for Data Processing
No ratings yet
PySpark Cheat Sheet for Data Processing
10 pages
ETL Process with Python for Data Engineering
No ratings yet
ETL Process with Python for Data Engineering
80 pages
BigQuery Data Insertion and External Tables
100% (1)
BigQuery Data Insertion and External Tables
8 pages
EDA Cheat Sheet for Pandas Users
No ratings yet
EDA Cheat Sheet for Pandas Users
7 pages
Introduction to PL/SQL Programming
No ratings yet
Introduction to PL/SQL Programming
81 pages
SQL Customer Table Structure
No ratings yet
SQL Customer Table Structure
18 pages
Understanding SQL Window Functions
100% (1)
Understanding SQL Window Functions
15 pages
SQL & PySpark Interview Questions
No ratings yet
SQL & PySpark Interview Questions
57 pages
BigQuery SQL Joins: Questions & Answers
100% (1)
BigQuery SQL Joins: Questions & Answers
5 pages
Numpy and Pandas Basics in Python
No ratings yet
Numpy and Pandas Basics in Python
29 pages
Power BI Data Modeling Guide
No ratings yet
Power BI Data Modeling Guide
5 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Stratascratch PySpark Coding Questions
No ratings yet
Stratascratch PySpark Coding Questions
23 pages
MySQL and Sqoop Command Guide
No ratings yet
MySQL and Sqoop Command Guide
8 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
Apache Airflow 1.10.2 Documentation
No ratings yet
Apache Airflow 1.10.2 Documentation
5 pages
SQL Syntax
No ratings yet
SQL Syntax
321 pages
Snowflake Billing Overview and Costs
No ratings yet
Snowflake Billing Overview and Costs
9 pages
Most Experienced Employees by Project
No ratings yet
Most Experienced Employees by Project
60 pages
PySpark Employee Salary Queries
No ratings yet
PySpark Employee Salary Queries
22 pages
Microstrategy - ProjectDesign
No ratings yet
Microstrategy - ProjectDesign
601 pages
Data Warehousing Interview Insights
No ratings yet
Data Warehousing Interview Insights
9 pages
Tableau Field Types and Functions Guide
No ratings yet
Tableau Field Types and Functions Guide
6 pages
Caching vs Persisting in PySpark
No ratings yet
Caching vs Persisting in PySpark
3 pages
AWS Cloud Practitioner Review Quiz
No ratings yet
AWS Cloud Practitioner Review Quiz
87 pages
Comprehensive PySpark Guide PDF
100% (1)
Comprehensive PySpark Guide PDF
3 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Understanding SQL Windowing Functions
No ratings yet
Understanding SQL Windowing Functions
54 pages
Python Interview Questions for Data Engineers
No ratings yet
Python Interview Questions for Data Engineers
8 pages
Pandas Interview Questions Guide
No ratings yet
Pandas Interview Questions Guide
5 pages
SQL Interview Questions & Answers Guide
No ratings yet
SQL Interview Questions & Answers Guide
18 pages
Advanced SQL Queries and Optimization
No ratings yet
Advanced SQL Queries and Optimization
9 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
4 pages
Data Engineering Course Overview
No ratings yet
Data Engineering Course Overview
15 pages
Oracle SQL Cheat Sheet Guide
No ratings yet
Oracle SQL Cheat Sheet Guide
1 page
Teradata SQL Performance Tuning Insights
0% (1)
Teradata SQL Performance Tuning Insights
37 pages
Introduction to SQL Basics
No ratings yet
Introduction to SQL Basics
27 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
No ratings yet
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
8 pages
Understanding RDD Operations in Spark
No ratings yet
Understanding RDD Operations in Spark
20 pages
Food Dataset Analysis Steps
No ratings yet
Food Dataset Analysis Steps
43 pages
DataStage Administrator Interview Guide
No ratings yet
DataStage Administrator Interview Guide
18 pages
ETL Developer Resume 1660107492
No ratings yet
ETL Developer Resume 1660107492
4 pages
Pandas vs PySpark Operations Guide
No ratings yet
Pandas vs PySpark Operations Guide
3 pages
Understanding Pandas for Data Analysis
100% (1)
Understanding Pandas for Data Analysis
13 pages
SQL and PySpark Data Operations Guide
No ratings yet
SQL and PySpark Data Operations Guide
80 pages
PySpark CSV and Broadcast Variables Guide
No ratings yet
PySpark CSV and Broadcast Variables Guide
5 pages
PySpark Basics by Datacademy
No ratings yet
PySpark Basics by Datacademy
3 pages
Python Pandas Data Manipulation Guide
No ratings yet
Python Pandas Data Manipulation Guide
36 pages
Data Engineering Interview Preparation Guide
No ratings yet
Data Engineering Interview Preparation Guide
5 pages
Data Wrangling with PySpark Essentials
No ratings yet
Data Wrangling with PySpark Essentials
10 pages
PySpark ELT CheatSheet and Commands
100% (2)
PySpark ELT CheatSheet and Commands
8 pages
AgileScrumCheatSheet 1
No ratings yet
AgileScrumCheatSheet 1
1 page
MySQL Commands Cheat Sheet
No ratings yet
MySQL Commands Cheat Sheet
2 pages
Big Basket User Experience Analysis
No ratings yet
Big Basket User Experience Analysis
19 pages
Strategic Communications Plan Template
No ratings yet
Strategic Communications Plan Template
4 pages
Excel XP Shortcuts & Tips Guide
No ratings yet
Excel XP Shortcuts & Tips Guide
11 pages
Digital Adoption Essentials for Product Managers
No ratings yet
Digital Adoption Essentials for Product Managers
19 pages
PG - M.B.A - English - Product Management 4.1.1 - CRC - 3871
No ratings yet
PG - M.B.A - English - Product Management 4.1.1 - CRC - 3871
207 pages
Miro Onboarding and User Journey Insights
100% (1)
Miro Onboarding and User Journey Insights
14 pages
Eucalyptus Cloud Software Overview
No ratings yet
Eucalyptus Cloud Software Overview
6 pages
Data Science Course Overview and Goals
No ratings yet
Data Science Course Overview and Goals
5 pages
Overwatch Players' Success Insights
No ratings yet
Overwatch Players' Success Insights
10 pages
Minecraft Launcher Debug Log
No ratings yet
Minecraft Launcher Debug Log
18 pages
Hotel Management System in Python
No ratings yet
Hotel Management System in Python
7 pages
Android Tutorial
100% (1)
Android Tutorial
216 pages
Nodegrid Serial Console Plus
No ratings yet
Nodegrid Serial Console Plus
2 pages
Echelon, Inc.'s Software Product Release (A) PDF
No ratings yet
Echelon, Inc.'s Software Product Release (A) PDF
4 pages
MAC ID and IP Address Commands Guide
No ratings yet
MAC ID and IP Address Commands Guide
13 pages
OpenOffice Impress Presentation Guide
No ratings yet
OpenOffice Impress Presentation Guide
4 pages
Vue JS 3 CRUD App Workshop Guide
No ratings yet
Vue JS 3 CRUD App Workshop Guide
21 pages
EMTA and Set Top Box Overview
No ratings yet
EMTA and Set Top Box Overview
2 pages
Red Hat Enterprise Linux 9: Boot Options For RHEL Installer
100% (1)
Red Hat Enterprise Linux 9: Boot Options For RHEL Installer
25 pages
AP Create Performance Task Guide
No ratings yet
AP Create Performance Task Guide
127 pages
Software Configuration Management Tools Guide
No ratings yet
Software Configuration Management Tools Guide
9 pages
Shell PDF Recovery Guide
No ratings yet
Shell PDF Recovery Guide
238 pages
STA371G Homework 2 Solutions: Normal Distribution
No ratings yet
STA371G Homework 2 Solutions: Normal Distribution
3 pages
Jasmine Owens - Graphic Design Resume
No ratings yet
Jasmine Owens - Graphic Design Resume
1 page
8052 Memory Types Comparison
No ratings yet
8052 Memory Types Comparison
15 pages
SDET and Automation Testing Jobs in Bangalore
No ratings yet
SDET and Automation Testing Jobs in Bangalore
51 pages
Demon Slayer RPG 2 Update Overview
No ratings yet
Demon Slayer RPG 2 Update Overview
7 pages
SQL Queries for Film Database Analysis
No ratings yet
SQL Queries for Film Database Analysis
6 pages
VLSI Design Student Portfolio and Achievements
No ratings yet
VLSI Design Student Portfolio and Achievements
3 pages
DTP Software Overview: CorelDRAW & Photoshop
No ratings yet
DTP Software Overview: CorelDRAW & Photoshop
4 pages
Understanding Agile Methodologies
89% (9)
Understanding Agile Methodologies
66 pages
General Maintenance User Manual
No ratings yet
General Maintenance User Manual
19 pages
YouTube App ANR Report
No ratings yet
YouTube App ANR Report
62 pages
Software Engineer with Ruby Expertise
No ratings yet
Software Engineer with Ruby Expertise
2 pages
AI Automation Engineer Position Available
No ratings yet
AI Automation Engineer Position Available
3 pages
7-Day Machine Learning Crash Course
No ratings yet
7-Day Machine Learning Crash Course
7 pages