0% found this document useful (0 votes)

57 views51 pages

Caching DataFrames in PySpark

The document discusses caching in Spark SQL. It explains that caching keeps data in memory to improve performance for repeated operations. It describes how to cache and uncache DataFrames using df.cache() and df.unpersist(). It also covers caching tables using the Spark catalog and provides tips for effective caching.

Uploaded by

Tarun Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views51 pages

Caching DataFrames in PySpark

Uploaded by

Tarun Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Caching

I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N

Mark Plutowski
Data Scientist
What is caching?
Keeping data in memory

Spark tends to unload memory aggressively

INTRODUCTION TO SPARK SQL IN PYTHON

Eviction Policy
Least Recently Used (LRU)

Eviction happens independently on each worker

Depends on memory available to each worker

INTRODUCTION TO SPARK SQL IN PYTHON

Caching a dataframe
TO CACHE A DATAFRAME:

[Link]()

TO UNCACHE IT:

[Link]()

INTRODUCTION TO SPARK SQL IN PYTHON

Determining whether a dataframe is cached
df.is_cached

False

[Link]()
df.is_cached

True

INTRODUCTION TO SPARK SQL IN PYTHON

Uncaching a dataframe
[Link]()
df.is_cached()

False

INTRODUCTION TO SPARK SQL IN PYTHON

Storage level
[Link]()
[Link]()
[Link]

StorageLevel(True, True, False, True, 1)

In the storage level above the following hold:

1. useDisk = True

2. useMemory = True

3. useOffHeap = False

4. deserialized = True

5. replication = 1

INTRODUCTION TO SPARK SQL IN PYTHON

Persisting a dataframe
The following are equivalent in Spark 2.1+ :

[Link]()

[Link](storageLevel=[Link].MEMORY_AND_DISK)

[Link]() is the same as [Link]()

INTRODUCTION TO SPARK SQL IN PYTHON

Caching a table
[Link]('df')
[Link](tableName='df')

False

[Link]('df')
[Link](tableName='df')

True

INTRODUCTION TO SPARK SQL IN PYTHON

Uncaching a table
[Link]('df')
[Link](tableName='df')

False

[Link]()

INTRODUCTION TO SPARK SQL IN PYTHON

Tips
Caching is lazy

Only cache if more than one operation is to be performed

Unpersist when you no longer need the object

Cache selectively

INTRODUCTION TO SPARK SQL IN PYTHON

Let's practice
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N
The Spark UI
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N

Mark Plutowski
Data Scientist
Use the Spark UI inspect execution
Spark Task is a unit of execution that runs on a single cpu

Spark Stage a group of tasks that perform the same computation in parallel, each task typically
running on a different subset of the data

Spark Job is a computation triggered by an action, sliced into one or more stages.

INTRODUCTION TO SPARK SQL IN PYTHON

Finding the Spark UI
1. [Link]

2. [Link]

3. [Link]

4. [Link]
...

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
Spark catalog operations
[Link]('table1')

[Link]('table1')

INTRODUCTION TO SPARK SQL IN PYTHON

Spark Catalog
[Link]()

[Table(name='text', database=None, description=None, tableType='TEMPORARY', isTemporary=

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
Spark UI Storage Tab
Shows where data partitions exist

in memory,

or on disk,

across the cluster,

at a snapshot in time.

INTRODUCTION TO SPARK SQL IN PYTHON

Spark UI SQL tab
query3agg = """
SELECT w1, w2, w3, COUNT(*) as count FROM (
SELECT
word AS w1,
LEAD(word,1) OVER(PARTITION BY part ORDER BY id ) AS w2,
LEAD(word,2) OVER(PARTITION BY part ORDER BY id ) AS w3
FROM df
)
GROUP BY w1, w2, w3
ORDER BY count DESC
"""

[Link](query3agg).show()

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
Let's practice
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N
Logging
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N

Mark Plutowski
Data Scientist
Logging primer
import logging
[Link](stream=[Link], level=[Link],
format='%(asctime)s - %(levelname)s - %(message)s')
[Link]("Hello %s", "world")
[Link]("Hello, take %d", 2)

2019-03-14 [Link],359 - INFO - Hello world

INTRODUCTION TO SPARK SQL IN PYTHON

Logging with DEBUG level
import logging
[Link](stream=[Link], level=[Link],
format='%(asctime)s - %(levelname)s - %(message)s')
[Link]("Hello %s", "world")
[Link]("Hello, take %d", 2)

2018-03-14 [Link],000 - INFO - Hello world

2018-03-14 [Link],001 - DEBUG - Hello, take 2

INTRODUCTION TO SPARK SQL IN PYTHON

Debugging lazy evaluation
lazy evaluation

distributed execution

INTRODUCTION TO SPARK SQL IN PYTHON

A simple timer
t = timer()
[Link]()

1. elapsed: 0.0 sec

[Link]() # Do something that takes 2 seconds

2. elapsed: 2.0 sec

[Link]() # Do something else that takes time: reset

[Link]()

3. elapsed: 0.0 sec

INTRODUCTION TO SPARK SQL IN PYTHON

class timer
class timer:
start_time = [Link]()
step = 0

def elapsed(self, reset=True):

[Link] += 1
print("%d. elapsed: %.1f sec %s"
% ([Link], [Link]() - self.start_time))
if reset:
[Link]()

def reset(self):
self.start_time = [Link]()

INTRODUCTION TO SPARK SQL IN PYTHON

Stealth CPU wastage
import logging
[Link](level=[Link],
format='%(asctime)s - %(levelname)s - %(message)s')

# < create dataframe df here >

t = timer()
[Link]("No action here.")
[Link]()
[Link]("df has %d rows.", [Link]())
[Link]()

2018-12-23 [Link],472 - INFO - No action here.

1. elapsed: 0.0 sec
2. elapsed: 2.0 sec

INTRODUCTION TO SPARK SQL IN PYTHON

Disable actions
ENABLED = False

t = timer()
[Link]("No action here.")
[Link]()
if ENABLED:
[Link]("df has %d rows.", [Link]())
[Link]()

2019-03-14 [Link],789 - Pyspark - INFO - No action here.

1. elapsed: 0.0 sec
2. elapsed: 0.0 sec

INTRODUCTION TO SPARK SQL IN PYTHON

Enabling actions
Rerunning the previous example with ENABLED = True triggers the action:

2019-03-14 [Link],789 - INFO - No action here.

1. elapsed: 0.0 sec
2019-03-14 [Link],789 - INFO - df has 1107014 rows.
2. elapsed: 2.0 sec

INTRODUCTION TO SPARK SQL IN PYTHON

Let's practice!
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N
Query Plans
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N

Mark Plutowski
Data Scientist
Explain
EXPLAIN SELECT * FROM table1

INTRODUCTION TO SPARK SQL IN PYTHON

Load dataframe and register
df = [Link]('/temp/[Link]')

[Link]('df')

INTRODUCTION TO SPARK SQL IN PYTHON

Running an EXPLAIN query
[Link]('EXPLAIN SELECT * FROM df').first()

Row(plan='== Physical Plan ==\n

*FileScan parquet [word#1928,id#1929L,title#1930,part#1931]
Batched: true,
Format: Parquet,
Location: InMemoryFileIndex[file:/temp/[Link]],
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>')

INTRODUCTION TO SPARK SQL IN PYTHON

Interpreting an EXPLAIN query
== Physical Plan ==

FileScan parquet [word#1928,id#1929L,title#1930,part#1931]

Batched: true,

Format: Parquet,

Location: InMemoryFileIndex[file:/temp/[Link]],

PartitionFilters: [],

PushedFilters: [],

ReadSchema: struct<word:string,id:bigint,title:string,part:int>'

INTRODUCTION TO SPARK SQL IN PYTHON

[Link]()
[Link]()

== Physical Plan ==
FileScan parquet [word#963,id#964L,title#965,part#966]
Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/temp/[Link]],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>

[Link]("SELECT * FROM df").explain()

== Physical Plan ==
FileScan parquet [word#712,id#713L,title#714,part#715]
Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/temp/[Link]],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>

INTRODUCTION TO SPARK SQL IN PYTHON

[Link](), on cached dataframe
[Link]()
[Link]()

== Physical Plan ==
InMemoryTableScan [word#0, id#1L, title#2, part#3]
+- InMemoryRelation [word#0, id#1L, title#2, part#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- FileScan parquet [word#0,id#1L,title#2,part#3]
Batched: true, Format: Parquet, Location:
InMemoryFileIndex[file:/temp/[Link]],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>

[Link]("SELECT * FROM df").explain()

== Physical Plan ==
InMemoryTableScan [word#0, id#1L, title#2, part#3]
+- InMemoryRelation [word#0, id#1L, title#2, part#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- FileScan parquet [word#0,id#1L,title#2,part#3]
Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/temp/[Link]],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>

INTRODUCTION TO SPARK SQL IN PYTHON

Words sorted by frequency query
SELECT word, COUNT(*) AS count
FROM df
GROUP BY word
ORDER BY count DESC

Equivalent dot notation approach:

[Link]('word')
.count()
.sort(desc('count'))
.explain()

INTRODUCTION TO SPARK SQL IN PYTHON

Same query using dataframe dot notation
== Physical Plan ==
*Sort [count#1040L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#1040L DESC NULLS LAST, 200)
+- *HashAggregate(keys=[word#963], functions=[count(1)])
+- Exchange hashpartitioning(word#963, 200)
+- *HashAggregate(keys=[word#963], functions=[partial_count(1)])
+- InMemoryTableScan [word#963]
+- InMemoryRelation [word#963, id#964L, title#965, part#966],
true,10000, StorageLevel(disk, memory, deserialized,
1 replicas)
+- *FileScan parquet [word#963,id#964L,title#965,part#966]
Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/temp/[Link]],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>

INTRODUCTION TO SPARK SQL IN PYTHON

Reading from bottom up
FileScan parquet

InMemoryRelation

InMemoryTableScan

`HashAggregate(keys=[word#963], ...)``

`HashAggregate(keys=[word#963], functions=[count(1)])``

`Sort [count#1040L DESC NULLS LAST]``

INTRODUCTION TO SPARK SQL IN PYTHON

Query plan
== Physical Plan ==
*Sort [count#1160L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#1160L DESC NULLS LAST, 200)
+- *HashAggregate(keys=[word#963], functions=[count(1)])
+- Exchange hashpartitioning(word#963, 200)
+- *HashAggregate(keys=[word#963], functions=[partial_count(1)])
+- *FileScan parquet [word#963] Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/temp/[Link]], PartitionFilters: [],
PushedFilters: [], ReadSchema: struct<word:string>

The previous plan had the following lines, which are missing from the plan above:

...
+- InMemoryTableScan [word#963]
+- InMemoryRelation [word#963, id#964L, title#965, part#966], true, 10000,
StorageLevel(disk, memory, deserialized, 1 replicas)
...

INTRODUCTION TO SPARK SQL IN PYTHON

Let's practice
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N

Pyspark Syntax and Examples Guide
No ratings yet
Pyspark Syntax and Examples Guide
28 pages
SQL Vs PySpark - Data Engineering Guide
No ratings yet
SQL Vs PySpark - Data Engineering Guide
26 pages
Essential PySpark Query Techniques
No ratings yet
Essential PySpark Query Techniques
22 pages
PySpark: Python API for Spark
No ratings yet
PySpark: Python API for Spark
35 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
Spark Production Insights by Databricks
No ratings yet
Spark Production Insights by Databricks
34 pages
Big Data with Spark and Python Guide
No ratings yet
Big Data with Spark and Python Guide
28 pages
Big Data Processing with Apache Spark
No ratings yet
Big Data Processing with Apache Spark
38 pages
Overview of Spark Architecture
No ratings yet
Overview of Spark Architecture
25 pages
Snowpipe Interview Questions Overview
No ratings yet
Snowpipe Interview Questions Overview
29 pages
Spark Interview Questions: Driver & Data Skew
No ratings yet
Spark Interview Questions: Driver & Data Skew
34 pages
Pyspark Union and UnionByName Guide
No ratings yet
Pyspark Union and UnionByName Guide
66 pages
PySpark Optimization Techniques Guide
No ratings yet
PySpark Optimization Techniques Guide
1 page
PySpark Interview Questions for 2025
No ratings yet
PySpark Interview Questions for 2025
1 page
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
17 pages
Data Engineer Interview Guide
No ratings yet
Data Engineer Interview Guide
41 pages
Understanding Lazy Evaluation in PySpark
No ratings yet
Understanding Lazy Evaluation in PySpark
2 pages
PySpark SQL Window Functions Guide
No ratings yet
PySpark SQL Window Functions Guide
6 pages
Spark vs Hadoop: Key Interview Insights
No ratings yet
Spark vs Hadoop: Key Interview Insights
9 pages
Key Features of PySpark Explained
No ratings yet
Key Features of PySpark Explained
19 pages
PySpark Interview Questions & Answers
No ratings yet
PySpark Interview Questions & Answers
8 pages
Spark Transformations and Actions Guide
No ratings yet
Spark Transformations and Actions Guide
122 pages
Mastering DataFrames in PySpark
No ratings yet
Mastering DataFrames in PySpark
59 pages
S3 Bucket Types and Spark Optimization Techniques
No ratings yet
S3 Bucket Types and Spark Optimization Techniques
7 pages
Python Interview Questions and Answers
No ratings yet
Python Interview Questions and Answers
4 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Spark Repartition vs Coalesce Explained
No ratings yet
Spark Repartition vs Coalesce Explained
7 pages
Real-Time PySpark Scenarios Explained
100% (1)
Real-Time PySpark Scenarios Explained
5 pages
Essential PySpark Transformations Guide
No ratings yet
Essential PySpark Transformations Guide
10 pages
Delta Live Tables for ETL Optimization
No ratings yet
Delta Live Tables for ETL Optimization
27 pages
Top 55 Apache Spark Interview Questions
No ratings yet
Top 55 Apache Spark Interview Questions
10 pages
Apache Log Analysis with Databricks
No ratings yet
Apache Log Analysis with Databricks
9 pages
Spark Shell: Local Mode Guide
No ratings yet
Spark Shell: Local Mode Guide
56 pages
SQL Final Minimal Print
No ratings yet
SQL Final Minimal Print
341 pages
Spark Tuning with Ganglia Insights
No ratings yet
Spark Tuning with Ganglia Insights
37 pages
EY & Deloitte Data Engineer Interview Guide
No ratings yet
EY & Deloitte Data Engineer Interview Guide
26 pages
Databricks Clusters: Types & Management Guide
100% (1)
Databricks Clusters: Types & Management Guide
29 pages
Top PySpark Interview Questions Explained
No ratings yet
Top PySpark Interview Questions Explained
4 pages
PySpark Basics by Datacademy
No ratings yet
PySpark Basics by Datacademy
3 pages
Spark SQL
No ratings yet
Spark SQL
12 pages
PySpark RDD Operations Cheat Sheet
No ratings yet
PySpark RDD Operations Cheat Sheet
1 page
PySpark: Overview and Key Features
No ratings yet
PySpark: Overview and Key Features
120 pages
Mastering Apache Spark Interview Q&A
No ratings yet
Mastering Apache Spark Interview Q&A
14 pages
Pyspark Dataframe Queries Guide
No ratings yet
Pyspark Dataframe Queries Guide
10 pages
Comprehensive PySpark Guide PDF
100% (1)
Comprehensive PySpark Guide PDF
3 pages
Apache Spark RDD API Overview
No ratings yet
Apache Spark RDD API Overview
38 pages
PySpark Tutorial for Beginners
No ratings yet
PySpark Tutorial for Beginners
19 pages
PySpark Interview Questions Overview
No ratings yet
PySpark Interview Questions Overview
16 pages
Working with Apache Spark and Delta Lake
No ratings yet
Working with Apache Spark and Delta Lake
40 pages
Databricks Interview Key Differences Guide
No ratings yet
Databricks Interview Key Differences Guide
8 pages
PySpark SQL Cheat Sheet
No ratings yet
PySpark SQL Cheat Sheet
1 page
PySpark SQL Cheat Sheet Guide
No ratings yet
PySpark SQL Cheat Sheet Guide
1 page
PySpark SQL Cheat Sheet
No ratings yet
PySpark SQL Cheat Sheet
1 page
Topics For Reference
No ratings yet
Topics For Reference
3 pages
Understanding Spark SQL and DataFrames
No ratings yet
Understanding Spark SQL and DataFrames
74 pages
SQL Pyspark
No ratings yet
SQL Pyspark
12 pages
Apache Spark Overview and Usage Guide
No ratings yet
Apache Spark Overview and Usage Guide
11 pages
Spark SQL: Architecture and Benefits
No ratings yet
Spark SQL: Architecture and Benefits
10 pages
PySpark DataFrame Operations Cheatsheet
No ratings yet
PySpark DataFrame Operations Cheatsheet
10 pages
Pyspark CheatSheet
No ratings yet
Pyspark CheatSheet
2 pages
EC2 Overview and Instance Types Guide
No ratings yet
EC2 Overview and Instance Types Guide
10 pages
Data Analysis Projects with Python
No ratings yet
Data Analysis Projects with Python
4 pages
Snowflake Vs Data Bricks
100% (1)
Snowflake Vs Data Bricks
10 pages
Matillion - Optimizing Amazon Redshift
50% (2)
Matillion - Optimizing Amazon Redshift
27 pages
Understanding Databricks DBFS and Widgets
100% (1)
Understanding Databricks DBFS and Widgets
34 pages
Free AWS Cloud Practitioner Study Guide
100% (2)
Free AWS Cloud Practitioner Study Guide
3 pages
Java vs Python: Key Differences
No ratings yet
Java vs Python: Key Differences
10 pages
VS Code Cheat Sheet PDF Guide
No ratings yet
VS Code Cheat Sheet PDF Guide
8 pages
Overview of Shell Scripting in Linux
No ratings yet
Overview of Shell Scripting in Linux
25 pages
PySpark 3.0 Quick Reference Guide
No ratings yet
PySpark 3.0 Quick Reference Guide
2 pages
Kubernetes Essentials: From Basics to Advanced
100% (1)
Kubernetes Essentials: From Basics to Advanced
36 pages
Cisco UCS C220 M8 Rack Server Datasheet
No ratings yet
Cisco UCS C220 M8 Rack Server Datasheet
7 pages
RISC System Overview and History
No ratings yet
RISC System Overview and History
15 pages
Instruction Format Types Explained
No ratings yet
Instruction Format Types Explained
3 pages
Hydroponic Water Quality Control System
No ratings yet
Hydroponic Water Quality Control System
12 pages
Ssio MXM 3 0
No ratings yet
Ssio MXM 3 0
2 pages
Arduino DS3231 AT24C32 Module Guide
100% (2)
Arduino DS3231 AT24C32 Module Guide
16 pages
TP2852 8051 Microcontroller Datasheet
No ratings yet
TP2852 8051 Microcontroller Datasheet
18 pages
Computer Hardware Maintenance Guide
No ratings yet
Computer Hardware Maintenance Guide
2 pages
PIC Programming for Embedded Systems
No ratings yet
PIC Programming for Embedded Systems
14 pages
Influx Automotive Purchase Order Details
No ratings yet
Influx Automotive Purchase Order Details
2 pages
Add Games Bios Files To Batocera
No ratings yet
Add Games Bios Files To Batocera
11 pages
Basic Computer Parts Lesson Plan
No ratings yet
Basic Computer Parts Lesson Plan
11 pages
CAN Bus Protocol for PE3 ECUs
No ratings yet
CAN Bus Protocol for PE3 ECUs
1 page
Computer System Overview Notes
No ratings yet
Computer System Overview Notes
8 pages
Foundations of Information Systems in Business
100% (2)
Foundations of Information Systems in Business
51 pages
RISC-V Microarchitectural Attack Risks
No ratings yet
RISC-V Microarchitectural Attack Risks
209 pages
MIEN3210G-2GF-8GT Industrial Switch
No ratings yet
MIEN3210G-2GF-8GT Industrial Switch
4 pages
Introduction to Cache Memory Concepts
No ratings yet
Introduction to Cache Memory Concepts
30 pages
Installing Windows: Target Device for Files
No ratings yet
Installing Windows: Target Device for Files
62 pages
Overview of Programmable Logic Devices
No ratings yet
Overview of Programmable Logic Devices
28 pages
8085 Microprocessor Instruction Set
No ratings yet
8085 Microprocessor Instruction Set
16 pages
GWY-141 VantagePro2 Modbus Gateway
No ratings yet
GWY-141 VantagePro2 Modbus Gateway
4 pages
Sims 4 Desync Error Report
No ratings yet
Sims 4 Desync Error Report
2 pages
ZyXEL G-302 v3 Quick Start Guide
No ratings yet
ZyXEL G-302 v3 Quick Start Guide
3 pages
CL GigE User Guide
No ratings yet
CL GigE User Guide
118 pages
Understanding Computer Components and Functions
No ratings yet
Understanding Computer Components and Functions
27 pages
Key SQL Concepts and Clustering Guide
No ratings yet
Key SQL Concepts and Clustering Guide
11 pages
Siemens PLC Replacement Project Overview
No ratings yet
Siemens PLC Replacement Project Overview
24 pages
Guyana National Insurance Contribution Form
No ratings yet
Guyana National Insurance Contribution Form
1 page

Caching DataFrames in PySpark

Uploaded by

Caching DataFrames in PySpark

Uploaded by

Caching

Spark tends to unload memory aggressively

INTRODUCTION TO SPARK SQL IN PYTHON

Eviction happens independently on each worker

Depends on memory available to each worker

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

StorageLevel(True, True, False, True, 1)

In the storage level above the following hold:

INTRODUCTION TO SPARK SQL IN PYTHON

[Link]() is the same as [Link]()

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

Only cache if more than one operation is to be performed

Unpersist when you no longer need the object

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

[Table(name='text', database=None, description=None, tableType='TEMPORARY', isTemporary=

INTRODUCTION TO SPARK SQL IN PYTHON

across the cluster,

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

2019-03-14 [Link],359 - INFO - Hello world

INTRODUCTION TO SPARK SQL IN PYTHON

2018-03-14 [Link],000 - INFO - Hello world

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

1. elapsed: 0.0 sec

[Link]() # Do something that takes 2 seconds

2. elapsed: 2.0 sec

[Link]() # Do something else that takes time: reset

3. elapsed: 0.0 sec

INTRODUCTION TO SPARK SQL IN PYTHON

def elapsed(self, reset=True):

INTRODUCTION TO SPARK SQL IN PYTHON

# < create dataframe df here >

2018-12-23 [Link],472 - INFO - No action here.

INTRODUCTION TO SPARK SQL IN PYTHON

2019-03-14 [Link],789 - Pyspark - INFO - No action here.

INTRODUCTION TO SPARK SQL IN PYTHON

2019-03-14 [Link],789 - INFO - No action here.

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

Row(plan='== Physical Plan ==\n

INTRODUCTION TO SPARK SQL IN PYTHON

FileScan parquet [word#1928,id#1929L,title#1930,part#1931]

INTRODUCTION TO SPARK SQL IN PYTHON

[Link]("SELECT * FROM df").explain()

INTRODUCTION TO SPARK SQL IN PYTHON

[Link]("SELECT * FROM df").explain()

INTRODUCTION TO SPARK SQL IN PYTHON

Equivalent dot notation approach:

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

`Sort [count#1040L DESC NULLS LAST]``

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

You might also like