0% found this document useful (1 vote)
180 views20 pages

Data Cleaning Techniques in PySpark

This document discusses data cleaning with Apache Spark. It introduces data cleaning as preparing raw data for use in data processing pipelines. Common data cleaning tasks include reformatting text, performing calculations, and removing garbage or incomplete data. Spark is advantageous for data cleaning due to its scalability and powerful framework for data handling. An example demonstrates cleaning raw data by reformatting names and ages and converting city names to states. Spark schemas can define the format of a DataFrame to filter garbage data and improve read performance. Spark uses immutability and lazy evaluation to efficiently handle transformations on data frames. The Parquet format is also discussed as it supports schemas, columnar data storage, and predicate pushdown for efficient data processing in Spark.

Uploaded by

Fgpeqw
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
180 views20 pages

Data Cleaning Techniques in PySpark

This document discusses data cleaning with Apache Spark. It introduces data cleaning as preparing raw data for use in data processing pipelines. Common data cleaning tasks include reformatting text, performing calculations, and removing garbage or incomplete data. Spark is advantageous for data cleaning due to its scalability and powerful framework for data handling. An example demonstrates cleaning raw data by reformatting names and ages and converting city names to states. Spark schemas can define the format of a DataFrame to filter garbage data and improve read performance. Spark uses immutability and lazy evaluation to efficiently handle transformations on data frames. The Parquet format is also discussed as it supports schemas, columnar data storage, and predicate pushdown for efficient data processing in Spark.

Uploaded by

Fgpeqw
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
  • Introduction to Data Cleaning with Apache Spark
  • What is Data Cleaning?
  • Why Perform Data Cleaning with Spark?
  • Data Cleaning Example
  • Spark Schemas
  • Example Spark Schema
  • Practice: Data Cleaning
  • Immutability and Lazy Processing
  • Variable Review
  • Immutability
  • Immutability Example
  • Lazy Processing
  • Practice: Immutability and Lazy Processing
  • Understanding Parquet
  • Difficulties with CSV Files
  • Spark and CSV Files
  • The Parquet Format
  • Working with Parquet
  • Parquet and SQL
  • Practice: Understanding Parquet

Intro to data

cleaning with
Apache Spark
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
What is Data Cleaning?
Data Cleaning: Preparing raw data for use in data processing pipelines.

Possible tasks in data cleaning:

Reformatting or replacing text

Performing calculations

Removing garbage or incomplete data

CLEANING DATA WITH PYSPARK


Why perform data cleaning with Spark?
Problems with typical data systems:

Performance

Organizing data ow

Advantages of Spark:

Scalable

Powerful framework for data handling

CLEANING DATA WITH PYSPARK


Data cleaning example
Raw data: Cleaned data:

name age (years) city last name rst name age (months) state

Smith, John 37 Dallas Smith John 444 TX

Wilson, A. 59 Chicago Wilson A. 708 IL

null 215

CLEANING DATA WITH PYSPARK


Spark Schemas
De ne the format of a DataFrame

May contain various data types:


Strings, dates, integers, arrays

Can lter garbage data during import

Improves read performance

CLEANING DATA WITH PYSPARK


Example Spark Schema
Import schema

import [Link]
peopleSchema = StructType([
# Define the name field
StructField('name', StringType(), True),
# Add the age field
StructField('age', IntegerType(), True),
# Add the city field
StructField('city', StringType(), True)
])

Read CSV le containing data

people_df = [Link]('csv').load(name='[Link]', schema=peopleSchema)

CLEANING DATA WITH PYSPARK


Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Immutability and
Lazy Processing
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Variable review
Python variables:

Mutable

Flexibility

Potential for issues with concurrency

Likely adds complexity

CLEANING DATA WITH PYSPARK


Immutability
Immutable variables are:

A component of functional programming

De ned once

Unable to be directly modi ed

Re-created if reassigned

Able to be shared ef ciently

CLEANING DATA WITH PYSPARK


Immutability Example
De ne a new data frame:

voter_df = [Link]('[Link]')

Making changes:

voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)

voter_df = voter_df.drop(voter_df.year)

CLEANING DATA WITH PYSPARK


Lazy Processing
Isn't this slow?

Transformations

Actions

Allows ef cient planning

voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)

voter_df.count()

CLEANING DATA WITH PYSPARK


Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Understanding
Parquet
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Dif culties with CSV les
No de ned schema

Nested data requires special handling

Encoding format limited

CLEANING DATA WITH PYSPARK


Spark and CSV les
Slow to parse

Files cannot be ltered (no "predicate pushdown")

Any intermediate use requires rede ning schema

CLEANING DATA WITH PYSPARK


The Parquet Format
A columnar data format

Supported in Spark and other data processing frameworks

Supports predicate pushdown

Automatically stores schema information

CLEANING DATA WITH PYSPARK


Working with Parquet
Reading Parquet les

df = [Link]('parquet').load('[Link]')

df = [Link]('[Link]')

Writing Parquet les

[Link]('parquet').save('[Link]')

[Link]('[Link]')

CLEANING DATA WITH PYSPARK


Parquet and SQL
Parquet as backing stores for SparkSQL operations

flight_df = [Link]('[Link]')

flight_df.createOrReplaceTempView('flights')

short_flights_df = [Link]('SELECT * FROM flights WHERE flightduration < 100')

CLEANING DATA WITH PYSPARK


Let's Practice!
C L E A N I N G D ATA W I T H P Y S PA R K

Intro to data
cleaning with
Apache Spark
CLEAN IN G DATA W ITH  P YS PARK
Mike Metzger
Data Engineering Consultant
CLEANING DATA WITH PYSPARK
What is Data Cleaning?
Data Cleaning: Preparing raw data for use in data processing pipelines.
Pos
CLEANING DATA WITH PYSPARK
Why perform data cleaning with Spark?
Problems with typical data systems:
Performance
Organizing d
CLEANING DATA WITH PYSPARK
Data cleaning example
Raw data:
name
age (years)
city
Smith, John
37
Dallas
Wilson, A.
59
Chicago
CLEANING DATA WITH PYSPARK
Spark Schemas
De×ne the format of a DataFrame
May contain various data types:
Strings, dates, inte
CLEANING DATA WITH PYSPARK
Example Spark Schema
Import schema
import pyspark.sql.types 
peopleSchema = StructType([ 
  # Defi
Let's practice!
CLEAN IN G DATA W ITH  P YS PARK
Immutability and
Lazy Processing
CLEAN IN G DATA W ITH  P YS PARK
Mike Metzger
Data Engineering Consultant
CLEANING DATA WITH PYSPARK
Variable review
Python variables:
Mutable
Flexibility
Potential for issues with concurrency
Likely
CLEANING DATA WITH PYSPARK
Immutability
Immutable variables are:
A component of functional programming
De×ned once
Unable to

You might also like