0% found this document useful (1 vote)

180 views20 pages

Data Cleaning Techniques in PySpark

This document discusses data cleaning with Apache Spark. It introduces data cleaning as preparing raw data for use in data processing pipelines. Common data cleaning tasks include reformatting text, performing calculations, and removing garbage or incomplete data. Spark is advantageous for data cleaning due to its scalability and powerful framework for data handling. An example demonstrates cleaning raw data by reformatting names and ages and converting city names to states. Spark schemas can define the format of a DataFrame to filter garbage data and improve read performance. Spark uses immutability and lazy evaluation to efficiently handle transformations on data frames. The Parquet format is also discussed as it supports schemas, columnar data storage, and predicate pushdown for efficient data processing in Spark.

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

180 views20 pages

Data Cleaning Techniques in PySpark

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Data Cleaning with Apache Spark
What is Data Cleaning?
Why Perform Data Cleaning with Spark?
Data Cleaning Example
Spark Schemas
Example Spark Schema
Practice: Data Cleaning
Immutability and Lazy Processing
Variable Review
Immutability
Immutability Example
Lazy Processing
Practice: Immutability and Lazy Processing
Understanding Parquet
Difficulties with CSV Files
Spark and CSV Files
The Parquet Format
Working with Parquet
Parquet and SQL
Practice: Understanding Parquet

Intro to data

cleaning with
Apache Spark
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
What is Data Cleaning?
Data Cleaning: Preparing raw data for use in data processing pipelines.

Possible tasks in data cleaning:

Reformatting or replacing text

Performing calculations

Removing garbage or incomplete data

CLEANING DATA WITH PYSPARK

Why perform data cleaning with Spark?
Problems with typical data systems:

Performance

Organizing data ow

Advantages of Spark:

Scalable

Powerful framework for data handling

CLEANING DATA WITH PYSPARK

Data cleaning example
Raw data: Cleaned data:

name age (years) city last name rst name age (months) state

Smith, John 37 Dallas Smith John 444 TX

Wilson, A. 59 Chicago Wilson A. 708 IL

null 215

CLEANING DATA WITH PYSPARK

Spark Schemas
De ne the format of a DataFrame

May contain various data types:

Strings, dates, integers, arrays

Can lter garbage data during import

Improves read performance

CLEANING DATA WITH PYSPARK

Example Spark Schema
Import schema

import [Link]
peopleSchema = StructType([
# Define the name field
StructField('name', StringType(), True),
# Add the age field
StructField('age', IntegerType(), True),
# Add the city field
StructField('city', StringType(), True)
])

Read CSV le containing data

people_df = [Link]('csv').load(name='[Link]', schema=peopleSchema)

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Immutability and
Lazy Processing
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Variable review
Python variables:

Mutable

Flexibility

Potential for issues with concurrency

Likely adds complexity

CLEANING DATA WITH PYSPARK

Immutability
Immutable variables are:

A component of functional programming

De ned once

Unable to be directly modi ed

Re-created if reassigned

Able to be shared ef ciently

CLEANING DATA WITH PYSPARK

Immutability Example
De ne a new data frame:

voter_df = [Link]('[Link]')

Making changes:

voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)

voter_df = voter_df.drop(voter_df.year)

CLEANING DATA WITH PYSPARK

Lazy Processing
Isn't this slow?

Transformations

Actions

Allows ef cient planning

voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)

voter_df.count()

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Understanding
Parquet
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Dif culties with CSV les
No de ned schema

Nested data requires special handling

Encoding format limited

CLEANING DATA WITH PYSPARK

Spark and CSV les
Slow to parse

Files cannot be ltered (no "predicate pushdown")

Any intermediate use requires rede ning schema

CLEANING DATA WITH PYSPARK

The Parquet Format
A columnar data format

Supported in Spark and other data processing frameworks

Supports predicate pushdown

Automatically stores schema information

CLEANING DATA WITH PYSPARK

Working with Parquet
Reading Parquet les

df = [Link]('parquet').load('[Link]')

df = [Link]('[Link]')

Writing Parquet les

[Link]('parquet').save('[Link]')

[Link]('[Link]')

CLEANING DATA WITH PYSPARK

Parquet and SQL
Parquet as backing stores for SparkSQL operations

flight_df = [Link]('[Link]')

flight_df.createOrReplaceTempView('flights')

short_flights_df = [Link]('SELECT * FROM flights WHERE flightduration < 100')

CLEANING DATA WITH PYSPARK

Let's Practice!
C L E A N I N G D ATA W I T H P Y S PA R K

CLEANING DATA WITH PYSPARK
What is Data Cleaning?
Data Cleaning: Preparing raw data for use in data processing pipelines.
Pos

CLEANING DATA WITH PYSPARK
Why perform data cleaning with Spark?
Problems with typical data systems:
Performance
Organizing d

CLEANING DATA WITH PYSPARK
Data cleaning example
Raw data:
name
age (years)
city
Smith, John
37
Dallas
Wilson, A.
59
Chicago

CLEANING DATA WITH PYSPARK
Spark Schemas
De×ne the format of a DataFrame
May contain various data types:
Strings, dates, inte

CLEANING DATA WITH PYSPARK
Example Spark Schema
Import schema
import pyspark.sql.types
peopleSchema = StructType([
# Defi

Let's practice!
CLEAN IN G DATA W ITH P YS PARK

Immutability and
Lazy Processing
CLEAN IN G DATA W ITH P YS PARK
Mike Metzger
Data Engineering Consultant

CLEANING DATA WITH PYSPARK
Variable review
Python variables:
Mutable
Flexibility
Potential for issues with concurrency
Likely

CLEANING DATA WITH PYSPARK
Immutability
Immutable variables are:
A component of functional programming
De×ned once
Unable to

Caching and Data Cleaning in PySpark
No ratings yet
Caching and Data Cleaning in PySpark
25 pages
Data Cleaning Techniques in PySpark
100% (1)
Data Cleaning Techniques in PySpark
25 pages
Data Cleaning Techniques in PySpark
No ratings yet
Data Cleaning Techniques in PySpark
23 pages
PySpark Cheat Sheet for Data Processing
No ratings yet
PySpark Cheat Sheet for Data Processing
10 pages
SQL to PySpark Conversion Guide
No ratings yet
SQL to PySpark Conversion Guide
9 pages
PySpark Employee Salary Queries
No ratings yet
PySpark Employee Salary Queries
22 pages
PySpark Features and Applications
No ratings yet
PySpark Features and Applications
31 pages
PySpark Installation and Basics Guide
100% (1)
PySpark Installation and Basics Guide
131 pages
Data Cleaning with Apache Spark
No ratings yet
Data Cleaning with Apache Spark
21 pages
Essential PySpark Commands Guide
No ratings yet
Essential PySpark Commands Guide
12 pages
PySpark Cheat Sheet for Data Engineers
No ratings yet
PySpark Cheat Sheet for Data Engineers
7 pages
PySpark SQL Basics Cheat Sheet
No ratings yet
PySpark SQL Basics Cheat Sheet
1 page
PySpark Interview Questions Guide
100% (5)
PySpark Interview Questions Guide
126 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Pyspark Window Functions Overview
100% (1)
Pyspark Window Functions Overview
8 pages
Top 50 PySpark Interview Questions
100% (2)
Top 50 PySpark Interview Questions
57 pages
Pyspark Union and UnionByName Guide
No ratings yet
Pyspark Union and UnionByName Guide
66 pages
PySpark SQL Basics Cheat Sheet
No ratings yet
PySpark SQL Basics Cheat Sheet
1 page
Deloitte PySpark Interview Guide 2024
100% (1)
Deloitte PySpark Interview Guide 2024
9 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
PySpark SQL Basics Cheat Sheet
No ratings yet
PySpark SQL Basics Cheat Sheet
1 page
Databricks Spark Best Practices Guide
100% (1)
Databricks Spark Best Practices Guide
22 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Pyspark Dataframe Queries Guide
No ratings yet
Pyspark Dataframe Queries Guide
10 pages
Learning Apache Spark with Python
No ratings yet
Learning Apache Spark with Python
10 pages
Understanding RDD Operations in Spark
No ratings yet
Understanding RDD Operations in Spark
20 pages
PySpark SQL Module Overview
100% (2)
PySpark SQL Module Overview
119 pages
Spark Interview Questions: Driver & Data Skew
No ratings yet
Spark Interview Questions: Driver & Data Skew
34 pages
PySpark SQL Cheat Sheet
No ratings yet
PySpark SQL Cheat Sheet
1 page
PySpark: Python API for Spark
No ratings yet
PySpark: Python API for Spark
35 pages
Apache Spark Use Cases Across Industries
No ratings yet
Apache Spark Use Cases Across Industries
2 pages
PySpark Tutorial for Beginners
No ratings yet
PySpark Tutorial for Beginners
206 pages
PySpark Cheat Sheet for Data Engineers
No ratings yet
PySpark Cheat Sheet for Data Engineers
6 pages
Essential PySpark for Databricks Interviews
No ratings yet
Essential PySpark for Databricks Interviews
7 pages
PySpark RDD Functions Overview
No ratings yet
PySpark RDD Functions Overview
1 page
Spark SQL and DataFrames Guide 2.2
No ratings yet
Spark SQL and DataFrames Guide 2.2
35 pages
Data Lakes and PySpark Interview Questions
100% (1)
Data Lakes and PySpark Interview Questions
14 pages
Data Cleaning with PySpark Guide
No ratings yet
Data Cleaning with PySpark Guide
21 pages
Python Interview Questions and Answers
No ratings yet
Python Interview Questions and Answers
4 pages
PySpark Transformations: 14 Examples
100% (1)
PySpark Transformations: 14 Examples
58 pages
SCD Type-1 & 2 in PySpark Guide
No ratings yet
SCD Type-1 & 2 in PySpark Guide
6 pages
Hadoop Admin Experience with AWS R5a Instances
100% (1)
Hadoop Admin Experience with AWS R5a Instances
52 pages
Databricks SQL & PySpark Cheat Sheet
100% (4)
Databricks SQL & PySpark Cheat Sheet
11 pages
Essential PySpark Query Techniques
No ratings yet
Essential PySpark Query Techniques
22 pages
Spark SQL: Optimization and Features
No ratings yet
Spark SQL: Optimization and Features
24 pages
PySpark Caching and Performance Tips
No ratings yet
PySpark Caching and Performance Tips
25 pages
Pyspark DataFrame Column Operations
No ratings yet
Pyspark DataFrame Column Operations
25 pages
PySpark DataFrame Operations Cheatsheet
No ratings yet
PySpark DataFrame Operations Cheatsheet
10 pages
Apache Spark vs. Data Cleaning in Python
No ratings yet
Apache Spark vs. Data Cleaning in Python
3 pages
PySpark Basics: DataFrames and Operations
No ratings yet
PySpark Basics: DataFrames and Operations
26 pages
Python Data Cleaning Techniques
No ratings yet
Python Data Cleaning Techniques
14 pages
Data Cleaning Techniques in PySpark
No ratings yet
Data Cleaning Techniques in PySpark
52 pages
PySpark Data Processing Techniques
No ratings yet
PySpark Data Processing Techniques
11 pages
Pandas Note
No ratings yet
Pandas Note
16 pages
Essential PySpark for Data Analysis
No ratings yet
Essential PySpark for Data Analysis
16 pages
Introduction to Pandas for Data Analysis
No ratings yet
Introduction to Pandas for Data Analysis
16 pages
Pandas Notes
No ratings yet
Pandas Notes
16 pages
Pandas Notes
No ratings yet
Pandas Notes
16 pages
Python Functions for Audio Transcription
No ratings yet
Python Functions for Audio Transcription
46 pages
PyDub Audio Processing in Python
No ratings yet
PyDub Audio Processing in Python
26 pages
Olympic Medals Data Visualization Guide
No ratings yet
Olympic Medals Data Visualization Guide
36 pages
Speech Recognition with Python Library
No ratings yet
Speech Recognition with Python Library
23 pages
Audio Processing in Python
No ratings yet
Audio Processing in Python
17 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
35 pages
Customer Segmentation with K-Means in Python
No ratings yet
Customer Segmentation with K-Means in Python
37 pages
Anomaly Detection in Python Workflows
No ratings yet
Anomaly Detection in Python Workflows
38 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
27 pages
Seaborn Data Visualization Guide
No ratings yet
Seaborn Data Visualization Guide
26 pages
Customizing Seaborn Plot Styles
No ratings yet
Customizing Seaborn Plot Styles
54 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
30 pages
Seaborn Categorical Plot Guide
100% (1)
Seaborn Categorical Plot Guide
32 pages
Designing ML Workflows in Python
No ratings yet
Designing ML Workflows in Python
42 pages
Relational Plots and Subplots in Seaborn
No ratings yet
Relational Plots and Subplots in Seaborn
38 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
37 pages
Machine Learning Workflows in Python
No ratings yet
Machine Learning Workflows in Python
32 pages
Credit Risk Modeling with XGBoost in Python
No ratings yet
Credit Risk Modeling with XGBoost in Python
35 pages
Credit Risk Modeling Techniques in Python
100% (1)
Credit Risk Modeling Techniques in Python
35 pages
Designing ML Workflows in Python
No ratings yet
Designing ML Workflows in Python
39 pages
Python Customer Segmentation Guide
No ratings yet
Python Customer Segmentation Guide
25 pages
Credit Risk Modeling with Python
100% (1)
Credit Risk Modeling with Python
36 pages
RFM Customer Segmentation with Python
No ratings yet
RFM Customer Segmentation with Python
33 pages
Building Chatbots with Python
100% (1)
Building Chatbots with Python
41 pages
Credit Risk Modeling in Python Guide
100% (1)
Credit Risk Modeling in Python Guide
27 pages
Building Chatbots with Python
No ratings yet
Building Chatbots with Python
20 pages
Analyzing IoT Data with Python
No ratings yet
Analyzing IoT Data with Python
34 pages
Head First Object-Oriented Design Guide
No ratings yet
Head First Object-Oriented Design Guide
15 pages
VRFAuto Interface Definitions
No ratings yet
VRFAuto Interface Definitions
62 pages
ICSE 2024 Computer Applications Exam Paper
No ratings yet
ICSE 2024 Computer Applications Exam Paper
9 pages
Java Pattern Printing Programs
No ratings yet
Java Pattern Printing Programs
15 pages
C# .NET Lab Manual: Square Root & More
No ratings yet
C# .NET Lab Manual: Square Root & More
69 pages
EER Modeling: Specialization & Inheritance
No ratings yet
EER Modeling: Specialization & Inheritance
33 pages
Fox and Rabbit
No ratings yet
Fox and Rabbit
5 pages
Understanding JPQL and Query Parameters
No ratings yet
Understanding JPQL and Query Parameters
6 pages
Kotlin Language Documentation Overview
No ratings yet
Kotlin Language Documentation Overview
228 pages
Java Programming Exercises Overview
No ratings yet
Java Programming Exercises Overview
3 pages
Product Inventory Management System
No ratings yet
Product Inventory Management System
13 pages
07 Software Design Heuristics
No ratings yet
07 Software Design Heuristics
67 pages
BCA 3rd Sem Java Lab Programs
No ratings yet
BCA 3rd Sem Java Lab Programs
16 pages
Python Flow Control: Loops & Conditionals
No ratings yet
Python Flow Control: Loops & Conditionals
5 pages
Passing Values in Functions Explained
No ratings yet
Passing Values in Functions Explained
22 pages
Iterative Constructs in Java
No ratings yet
Iterative Constructs in Java
40 pages
Writable Classes in Hadoop I/O
No ratings yet
Writable Classes in Hadoop I/O
3 pages
Understanding JavaScript DOM Basics
No ratings yet
Understanding JavaScript DOM Basics
74 pages
Python Conditional and Iterative Statements
No ratings yet
Python Conditional and Iterative Statements
41 pages
ICSE Class 10 Computer Applications Papers
100% (1)
ICSE Class 10 Computer Applications Papers
16 pages
10 Golden Rules of Good OOP - CodeProject
No ratings yet
10 Golden Rules of Good OOP - CodeProject
11 pages
Recursive Functions in C++ Programming
No ratings yet
Recursive Functions in C++ Programming
11 pages
Lecture 09 Methods
No ratings yet
Lecture 09 Methods
17 pages
Java Inheritance Basics Explained
No ratings yet
Java Inheritance Basics Explained
28 pages
Java Multithreading Overview and Methods
No ratings yet
Java Multithreading Overview and Methods
21 pages
C++ Programming Concepts and Flowcharts
No ratings yet
C++ Programming Concepts and Flowcharts
2 pages
Object-Oriented Design Concepts Explained
No ratings yet
Object-Oriented Design Concepts Explained
8 pages
Understanding Object-Oriented Programming
No ratings yet
Understanding Object-Oriented Programming
57 pages
C++ Programs for Factorial and Palindrome
No ratings yet
C++ Programs for Factorial and Palindrome
26 pages
Clean Alumni Donation Lists in C++
No ratings yet
Clean Alumni Donation Lists in C++
3 pages

Data Cleaning Techniques in PySpark

Uploaded by

Data Cleaning Techniques in PySpark

Uploaded by

Intro to data

Possible tasks in data cleaning:

Reformatting or replacing text

Removing garbage or incomplete data

CLEANING DATA WITH PYSPARK

Powerful framework for data handling

CLEANING DATA WITH PYSPARK

Smith, John 37 Dallas Smith John 444 TX

Wilson, A. 59 Chicago Wilson A. 708 IL

CLEANING DATA WITH PYSPARK

May contain various data types:

Can lter garbage data during import

Improves read performance

CLEANING DATA WITH PYSPARK

Read CSV le containing data

people_df = [Link]('csv').load(name='[Link]', schema=peopleSchema)

CLEANING DATA WITH PYSPARK

Potential for issues with concurrency

Likely adds complexity

CLEANING DATA WITH PYSPARK

A component of functional programming

Unable to be directly modi ed

Able to be shared ef ciently

CLEANING DATA WITH PYSPARK

CLEANING DATA WITH PYSPARK

Allows ef cient planning

CLEANING DATA WITH PYSPARK

Nested data requires special handling

Encoding format limited

CLEANING DATA WITH PYSPARK

Files cannot be ltered (no "predicate pushdown")

Any intermediate use requires rede ning schema

CLEANING DATA WITH PYSPARK

Supported in Spark and other data processing frameworks

Supports predicate pushdown

Automatically stores schema information

CLEANING DATA WITH PYSPARK

Writing Parquet les

CLEANING DATA WITH PYSPARK

short_flights_df = [Link]('SELECT * FROM flights WHERE flightduration < 100')

CLEANING DATA WITH PYSPARK

You might also like