0% found this document useful (0 votes)

5 views9 pages

10 Spark Challenges

The document outlines ten key challenges faced when using Spark, focusing on configuration, memory allocation, data skew, and pipeline optimization. It emphasizes the importance of properly sizing resources, monitoring performance, and optimizing jobs to prevent inefficiencies and high costs. Additionally, it highlights the complexities of managing Spark jobs in both on-premises and cloud environments, including the need for effective troubleshooting and resource management.

Uploaded by

Kaushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views9 pages

10 Spark Challenges

Uploaded by

Kaushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

10 SPARK CHALLENGES

Many Spark challenges relate to configuration, including the number

of executors to assign, memory usage (at the driver level, and per
executor), and what kind of hardware/machine instances to use. You
make configuration choices per job, and also for the overall cluster in
which jobs run, and these are interdependent – so things get
complicated, fast.

Some challenges occur at the job level; these challenges are shared
right across the data team. They include:

==> Challenge 1. How many executors should each job use?

One of the key advantages of Spark is parallelization – you run your

job’s code against different data partitions in parallel workstreams, as
in the diagram below. The number of workstreams that run at once is
the number of executors, times the number of cores per executor. So
how many executors should your job use, and how many cores per
executor – that is, how many workstreams do you want running at
once?

A Spark job using three cores to parallelize output. Up to three tasks

run simultaneously, and seven tasks are completed in a fixed period of
time.
You want high usage of cores, high usage of memory per core, and
data partitioning appropriate to the job. (Usually, partitioning on the
field or fields you’re querying on.) Beginner guide for Hadoop suggests
two-three cores per executor, but not more than five; the expert guide
to Spark tuning on AWS suggests that you use three executors per
node, with five cores per executor, as your starting point for all jobs.

You are likely to have your own sensible starting point for your on-
premises or cloud platform, the servers or instances available, and
experience your team has had with similar workloads. Once your job
runs successfully a few times, you can either leave it alone, or
optimize it. We recommend that you optimize it, because optimization:

• Helps you save resources and money (not over-allocating)

• Helps prevent crashes, because you right-size the resources (not
under-allocating)
• Helps you fix crashes fast, because allocations are roughly correct,
and because you understand the job better

==>Challenge2. How much memory should I allocate for each job?

Memory allocation is per executor, and the most you can allocate is
the total available in the node. If you’re in the cloud, this is governed
by your instance type; on-premises, by your physical server or virtual
machine. Some memory is needed for your cluster manager and
system resources (16GB may be a typical amount), and the rest is
available for jobs.

If you have three executors in a 128GB cluster, and 16GB is taken up

by the cluster, that leaves 37GB per executor. However, a few GB will
be required for executor overhead; the remainder is your per-executor
memory. You will want to partition your data so it can be processed
efficiently in the available memory.

This is just a starting point, however. You may need to be using a

different instance type, or a different number of executors, to make
the most efficient use of your node’s resources against the job you’re
running. As with the number of executors, optimizing your job will help
you know whether you are over- or under-allocating memory, reduce
the likelihood of crashes, and get you ready for troubleshooting when
the need arises.

==>Challenge 3. How do I find and eliminate data skew?

Data skew and small files are complementary problems. Data skew
tends to describe large files – where one key value, or a few, have a
large share of the total data associated with them. This can force
Spark, as it’s processing the data, to move data around in the cluster,
which can slow down your task, cause low utilization of CPU capacity,
and cause out-of-memory errors which abort your job.

Small files are partly the other end of data skew – a share of partitions
will tend to be small. And Spark, since it is a parallel processing
system, may generate many small files from parallel processes. Also,
some processes you use, such as file compression, may cause a large
number of small files to appear, causing inefficiencies. You may need
to reduce parallelism (undercutting one of the advantages of Spark),
repartition (an expensive operation you should minimize), or start
adjusting your parameters, your data, or both.

Both data skew and small files incur a meta-problem that’s common
across Spark – when a job slows down or crashes, how do you know
what the problem was? We will mention this again, but it can be
particularly difficult to know this for data-related problems, as an
otherwise well-constructed job can have seemingly random slowdowns
or halts, caused by hard-to-predict and hard-to-detect inconsistencies
across different data sets.

==>Challenge 4. How do I make my pipelines work better?

Spark pipelines are made up of dataframes, connected by
transformers (which calculate new data from existing data), and
Estimators. Pipelines are widely used for all sorts of processing,
including extract, transform, and load (ETL) jobs and machine
learning. Spark makes it easy to combine jobs into pipelines, but it
does not make it easy to monitor and manage jobs at the pipeline
level. So it’s easy for monitoring, managing, and optimizing pipelines
to appear as an exponentially more difficult version of optimizing
individual Spark jobs.

Existing Transformers create new Dataframes, with an Estimator

producing the final model.
Many pipeline components are “tried and trusted” individually, and
are thereby less likely to cause problems than new components you
create yourself. However, interactions between pipeline steps can
cause novel problems.

Just as job issues roll up to the cluster level, they also roll up to the
pipeline level. Pipelines are increasingly the unit of work for DataOps,
but it takes truly deep knowledge of your jobs and your cluster(s) for
you to work effectively at the pipeline level.

Challenge 5. How do I know if a specific job is optimized?

Neither Spark nor, for that matter, SQL are designed for ease of
optimization. Spark comes with a monitoring and management
interface, Spark UI, which can help. But Spark UI can be challenging to
use, especially for the types of comparisons – over time, across jobs,
and across a large, busy cluster – that you need to really optimize a
job. And there is no “SQL UI” that specifically tells you how to optimize
your SQL queries.

There are some general rules. For instance, a “bad” – inefficient – join
can take hours. But it’s very hard to find where your app is spending
its time, let alone whether a specific SQL command is taking a long
time, and whether it can indeed be optimized.

When data sizes grow large enough, and processing gets complex
enough, you have to help it along if you want your resource usage,
costs, and runtimes to stay on the acceptable side.

Other challenges come up at the cluster level, or even at the stack

level, as you decide what jobs to run on what clusters. These problems
tend to be the remit of operations people and data engineers. They
include:

==> Challenge 6. How do I size my nodes, and match them to the

right servers/instance types?

A Spark node – a physical server or a cloud instance – will have an

allocation of CPUs and physical memory. (The whole point of Spark is
to run things in actual memory, so this is crucial.) You have to fit your
executors and memory allocations into nodes that are carefully
matched to existing resources, on-premises or in the cloud. (You can
allocate more or fewer Spark cores than there are available CPUs, but
matching them makes things more predictable, uses resources better,
and may make troubleshooting easier.)

On-premises, poor matching between nodes, physical servers,

executors and memory results in inefficiencies, but these may not be
very visible; as long as the total physical resource is sufficient for the
jobs running, there’s no obvious problem. However, issues like this can
cause datacenters to be very poorly utilized, meaning there’s big
overspending going on – it’s just not noticed. (Ironically, the
impending prospect of cloud migration may cause an organization to
freeze on-prem spending, shining a spotlight on costs and efficiency.)

In the cloud, “pay as you go” pricing shines a different type of

spotlight on efficient use of resources – inefficiency shows up in each
month’s bill. You need to match nodes, cloud instances, and job CPU
and memory allocations very closely indeed, or incur what might
amount to massive overspending.

You still have big problems here. In the cloud, with costs both visible
and variable, cost allocation is a big issue. It’s hard to know who’s
spending what, let alone what the business results that go with each
unit of spending are. But tuning workloads against server resources
and/or instances is the first step in gaining control of your spending,
across all your data estates.

==> Challenge 7. How do I see what’s going on across the Spark

stack and apps?

“Spark is notoriously difficult to tune and maintain,”

Clusters need to be “expertly managed” to perform well, or all the
good characteristics of Spark can come crashing down in a heap of
frustration and high costs. (In people’s time and in business losses, as
well as direct, hard dollar costs.)

Key Spark advantages include accessibility to a wide range of users

and the ability to run in memory. But the most popular tool for Spark
monitoring and management, Spark UI, doesn’t really help much at
the cluster level. You can’t, for instance, easily tell which jobs consume
the most resources over time. So it’s hard to know where to focus your
optimization efforts. And Spark UI doesn’t support more advanced
functionality – such as comparing the current job run to previous runs,
issuing warnings, or making recommendations, for example.

Logs on cloud clusters are lost when a cluster is terminated, so

problems that occur in short-running clusters can be that much harder
to debug. More generally, managing log files is itself a big data
management and data accessibility issue, making debugging and
governance harder. This occurs in both on-premises and cloud
environments. And, when workloads are moved to the cloud, you no
longer have a fixed-cost data estate, nor the “tribal knowledge”
accrued from years of running a gradually changing set of workloads
on-premises. Instead, you have new technologies and pay-as-you-go
billing. So cluster-level management, hard as it is, becomes critical.

==> Challenge 8. Is my data partitioned correctly for my SQL queries?

Operators can get quite upset, and rightly so, over “bad” or “rogue”
queries that can cost way more, in resources or cost, than they need
to. One colleague describes a team he worked on that went through
more than $100,000 of cloud costs in a weekend of crash-testing a
new application – a discovery made after the fact. (But before the job
was put into production, where it would have really run up some bills.)

SQL is not designed to tell you how much a query is likely to cost, and
more elegant-looking SQL queries (ie, fewer statements) may well be
more expensive. The same is true of all kinds of code you have
running.

So you have to do some or all of three things:

•Learn something about SQL, and about coding languages you use,
especially how they work at runtime
•Understand how to optimize your code and partition your data for
good price/performance
•Experiment with your app to understand where the resource use/cost
“hot spots” are, and reduce them where possible

==> Challenge 9. When do I take advantage of auto-scaling?

The ability to auto-scale – to assign resources to a job just while it’s

running, or to increase resources smoothly to meet processing peaks –
is one of the most enticing features of the cloud. It’s also one of the
most dangerous; there is no practical limit to how much you can
spend. You need some form of guardrails, and some form of alerting,
to remove the risk of truly gigantic bills.

The need for auto-scaling might, for instance, determine whether you
move a given workload to the cloud, or leave it running, unchanged, in
your on-premises data center. But to help an application benefit from
auto-scaling, you have to profile it, then cause resources to be
allocated and de-allocated to match the peaks and valleys. And you
have some calculations to make, because cloud providers charge you
more for spot resources – those you grab and let go of, as needed –
than for persistent resources that you keep running for a long time.
Spot resources may cost two or three times as much as dedicated
ones.
The first step, as you might have guessed, is to optimize your
application, as in the previous sections. Auto-scaling is a
price/performance optimization, and a potentially resource-intensive
one. You should do other optimizations first.

Then profile your optimized application. You need to calculate ongoing

and peak memory and processor usage, figure out how long you need
each, and the resource needs and cost for each state. And then decide
whether it’s worth auto-scaling the job, whenever it runs, and how to
do that. You may also need to find quiet times on a cluster to run some
jobs, so the job’s peaks don’t overwhelm the cluster’s resources.

To help, Databricks has two types of clusters, and the second type
works well with auto-scaling. Most jobs start out in an interactive
cluster, which is like an on-premises cluster; multiple people use a set
a shared resources. It is, by definition, very difficult to avoid seriously
underusing the capacity of an interactive cluster.

So you are meant to move each of your repeated, resource-intensive,

and well-understood jobs off to its own, dedicated, job-specific cluster.
A job-specific cluster spins up, runs its job, and spins down. This is a
form of auto-scaling already, and you can also scale the cluster’s
resources to match job peaks, if appropriate. But note that you want
your application profiled and optimized before moving it to a job-
specific cluster.

==> Challenge10. How do I get insights into jobs that have problems?

Just as it’s hard to fix an individual Spark job, there’s no easy way to
know where to look for problems across a Spark cluster. And once you
do find a problem, there’s very little guidance on how to fix it. Is the
problem with the job itself, or the environment it’s running in? For
instance, over-allocating memory or CPUs for some Spark jobs can
starve others. In the cloud, the noisy neighbors problem can slow
down a Spark job run to the extent that it causes business problems
on one outing – but leaves the same job to finish in good time on the
next run.
The better you handle the other challenges listed in this post, the
fewer problems you’ll have, but it’s still very hard to know how to
most productively spend Spark operations time. For instance, a slow
Spark job on one run may be worth fixing in its own right, and may be
warning you of crashes on future runs. But it’s very hard just to see
what the trend is for a Spark job in performance, let alone to get some
idea of what the job is accomplishing vs. its resource use and average
time to complete. So Spark troubleshooting ends up being reactive,
with all too many furry, blind little heads popping up for operators to
play Whack-a-Mole with.

IMPACTS OF THESE CHALLENGES

If you meet the above challenges effectively, you’ll use your resources
efficiently and cost-effectively.

What we tend to see most are the following problems – at a job level,
within a cluster, or across all clusters:

•Under-allocation. It can be tricky to allocate your resources

efficiently on your cluster, partition your datasets effectively, and
determine the right level of resources for each job. If you under-
allocate (either for a job’s driver or the executors), a job is likely to run
too slowly, or to crash. As a result, many developers and operators
resort to…
•Over-allocation. If you assign too many resources to your job,
you’re wasting resources (on-premises) or money (cloud). We hear
about jobs that need, for example, 2GB of memory, but are allocated
much more – in one case, 85GB.

Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Optimizing Apache Spark Performance
No ratings yet
Optimizing Apache Spark Performance
7 pages
Big Data Analytics with Apache Spark
No ratings yet
Big Data Analytics with Apache Spark
58 pages
Fast Data Analytics with PySpark Guide
No ratings yet
Fast Data Analytics with PySpark Guide
75 pages
Spark Job Troubleshooting Solutions
No ratings yet
Spark Job Troubleshooting Solutions
7 pages
Optimize Apache Spark on AWS: 5 Tips
No ratings yet
Optimize Apache Spark on AWS: 5 Tips
9 pages
Big Data Processing with Apache Spark
No ratings yet
Big Data Processing with Apache Spark
38 pages
Using Spark on NERSC's Cori System
No ratings yet
Using Spark on NERSC's Cori System
14 pages
Big Data Basics: Hadoop & Spark Overview
No ratings yet
Big Data Basics: Hadoop & Spark Overview
7 pages
PySpark 1
No ratings yet
PySpark 1
63 pages
Essential Spark Tips for Beginners
No ratings yet
Essential Spark Tips for Beginners
8 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
29 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
30 pages
PySpark Fundamentals and Architecture Guide
No ratings yet
PySpark Fundamentals and Architecture Guide
19 pages
Data Enginee
No ratings yet
Data Enginee
22 pages
Dimensionnement Spark - Les 5 Erreurs À Éviter
No ratings yet
Dimensionnement Spark - Les 5 Erreurs À Éviter
75 pages
Chapter 3
No ratings yet
Chapter 3
55 pages
Apache Spark & Azure Databricks Guide
No ratings yet
Apache Spark & Azure Databricks Guide
46 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Beginning Database Design
No ratings yet
Beginning Database Design
2 pages
Installing and Using Apache Spark
No ratings yet
Installing and Using Apache Spark
11 pages
Scalable Data Architecture with Hadoop
No ratings yet
Scalable Data Architecture with Hadoop
31 pages
Features and Architecture of Apache Spark
No ratings yet
Features and Architecture of Apache Spark
24 pages
Introduction to Apache Spark Programming
No ratings yet
Introduction to Apache Spark Programming
39 pages
Understanding Apache Spark for Big Data
No ratings yet
Understanding Apache Spark for Big Data
11 pages
Unit 5.1
No ratings yet
Unit 5.1
81 pages
Module 7
No ratings yet
Module 7
30 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
7 pages
Overview of Apache Spark Framework
No ratings yet
Overview of Apache Spark Framework
57 pages
Apache Spark: Fast Stream Processing
No ratings yet
Apache Spark: Fast Stream Processing
74 pages
Introduction to Apache Spark for Big Data
No ratings yet
Introduction to Apache Spark for Big Data
42 pages
Unit 4 Advanced Big Data Analytics
No ratings yet
Unit 4 Advanced Big Data Analytics
17 pages
Understanding Spark Application Concepts
No ratings yet
Understanding Spark Application Concepts
18 pages
Optimize PySpark Performance Techniques
No ratings yet
Optimize PySpark Performance Techniques
23 pages
Spark Concepts and Interview Questions Guide
No ratings yet
Spark Concepts and Interview Questions Guide
21 pages
Data Engineer's Guide To Spark
No ratings yet
Data Engineer's Guide To Spark
22 pages
Introduction To Apache Spark
No ratings yet
Introduction To Apache Spark
10 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
52 pages
Overview of Apache Spark Framework
No ratings yet
Overview of Apache Spark Framework
14 pages
Obsidian
No ratings yet
Obsidian
46 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
48 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
PN s6 Real Time Processingi Bca Bda Vi Ch2
No ratings yet
PN s6 Real Time Processingi Bca Bda Vi Ch2
26 pages
Pyspark Basics
No ratings yet
Pyspark Basics
106 pages
Spark vs. Hadoop: Key Comparisons
No ratings yet
Spark vs. Hadoop: Key Comparisons
49 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
25 pages
Unit - IV Big Data Tools and Platforms
No ratings yet
Unit - IV Big Data Tools and Platforms
21 pages
Apache Spark Resource Management Guide
0% (1)
Apache Spark Resource Management Guide
40 pages
Data Analysis with Apache Spark Overview
No ratings yet
Data Analysis with Apache Spark Overview
39 pages
Spark 01
No ratings yet
Spark 01
8 pages
Unit 3
No ratings yet
Unit 3
40 pages
Efficient Data Processing With Apache Spark
No ratings yet
Efficient Data Processing With Apache Spark
29 pages
Understanding Apache Spark: Features & Benefits
No ratings yet
Understanding Apache Spark: Features & Benefits
19 pages
Spark and Unit 4 Report Overview
No ratings yet
Spark and Unit 4 Report Overview
11 pages
Midterm Review: Big Data Concepts
No ratings yet
Midterm Review: Big Data Concepts
34 pages
Big Data Processing with Hadoop & Spark
No ratings yet
Big Data Processing with Hadoop & Spark
5 pages
Sparklyr Online Course Overview
No ratings yet
Sparklyr Online Course Overview
80 pages
CSCE 1030: Intro to Computer Science
No ratings yet
CSCE 1030: Intro to Computer Science
7 pages
Evolution of Object-Oriented Programming
No ratings yet
Evolution of Object-Oriented Programming
4 pages
Sigenergy Modbus Protocol V2.5 Guide
No ratings yet
Sigenergy Modbus Protocol V2.5 Guide
40 pages
Control Processor and Safety Module Guide
50% (2)
Control Processor and Safety Module Guide
23 pages
Programming Languages Classification Worksheet
No ratings yet
Programming Languages Classification Worksheet
3 pages
JavaScript Objects
No ratings yet
JavaScript Objects
11 pages
Android DatePicker Tutorial Guide
No ratings yet
Android DatePicker Tutorial Guide
9 pages
Python Lists: Creation and Manipulation
No ratings yet
Python Lists: Creation and Manipulation
8 pages
RDP Penetration Testing Guide
No ratings yet
RDP Penetration Testing Guide
28 pages
Essential Networking Devices Guide
No ratings yet
Essential Networking Devices Guide
24 pages
Gaming Keyboard Key Features: Gaming Keyboard With Full Size Layout, Rainbow Wave Illumination and 12 Multimedia Keys
No ratings yet
Gaming Keyboard Key Features: Gaming Keyboard With Full Size Layout, Rainbow Wave Illumination and 12 Multimedia Keys
5 pages
Types of Computer Components Explained
No ratings yet
Types of Computer Components Explained
6 pages
C Programming MCQ Questions and Answers
No ratings yet
C Programming MCQ Questions and Answers
8 pages
Genesis Intermediate Python
No ratings yet
Genesis Intermediate Python
35 pages
Core Java Internship Report Summary
No ratings yet
Core Java Internship Report Summary
35 pages
iCUE SDK Lighting Service Access Update
No ratings yet
iCUE SDK Lighting Service Access Update
27 pages
Heart Patient Monitoring Chatbot System
No ratings yet
Heart Patient Monitoring Chatbot System
49 pages
Fix HTTPS Revocation Warnings in Web Apps
No ratings yet
Fix HTTPS Revocation Warnings in Web Apps
43 pages
WebPower Firmware Upgrade Guide
100% (1)
WebPower Firmware Upgrade Guide
8 pages
Windows Server 2022 Server Manager Guide
No ratings yet
Windows Server 2022 Server Manager Guide
20 pages
Managing Resources with cgroups py
No ratings yet
Managing Resources with cgroups py
4 pages
Overview of Data Storage Devices
No ratings yet
Overview of Data Storage Devices
10 pages
Algorithmic Problem Solving in Python
No ratings yet
Algorithmic Problem Solving in Python
205 pages
ABAP Programming Techniques and Tutorials
No ratings yet
ABAP Programming Techniques and Tutorials
6 pages
CLO-002 Exam Practice Questions
No ratings yet
CLO-002 Exam Practice Questions
4 pages
IBM Storage Defender Level 1 Quiz Results
100% (1)
IBM Storage Defender Level 1 Quiz Results
8 pages
File Sharing Web App Project Report
No ratings yet
File Sharing Web App Project Report
30 pages
IP PA System T-7800A Overview
No ratings yet
IP PA System T-7800A Overview
1 page
3027x ID String Duplication Guide
100% (1)
3027x ID String Duplication Guide
4 pages
SQL Reserved Words and Standards List
No ratings yet
SQL Reserved Words and Standards List
16 pages

10 Spark Challenges

Uploaded by

10 Spark Challenges

Uploaded by

10 SPARK CHALLENGES

Many Spark challenges relate to configuration, including the number

==> Challenge 1. How many executors should each job use?

One of the key advantages of Spark is parallelization – you run your

A Spark job using three cores to parallelize output. Up to three tasks

• Helps you save resources and money (not over-allocating)

==>Challenge2. How much memory should I allocate for each job?

If you have three executors in a 128GB cluster, and 16GB is taken up

This is just a starting point, however. You may need to be using a

==>Challenge 3. How do I find and eliminate data skew?

==>Challenge 4. How do I make my pipelines work better?

Existing Transformers create new Dataframes, with an Estimator

Challenge 5. How do I know if a specific job is optimized?

Other challenges come up at the cluster level, or even at the stack

==> Challenge 6. How do I size my nodes, and match them to the

A Spark node – a physical server or a cloud instance – will have an

On-premises, poor matching between nodes, physical servers,

In the cloud, “pay as you go” pricing shines a different type of

==> Challenge 7. How do I see what’s going on across the Spark

“Spark is notoriously difficult to tune and maintain,”

Key Spark advantages include accessibility to a wide range of users

Logs on cloud clusters are lost when a cluster is terminated, so

==> Challenge 8. Is my data partitioned correctly for my SQL queries?

So you have to do some or all of three things:

==> Challenge 9. When do I take advantage of auto-scaling?

The ability to auto-scale – to assign resources to a job just while it’s

Then profile your optimized application. You need to calculate ongoing

So you are meant to move each of your repeated, resource-intensive,

IMPACTS OF THESE CHALLENGES

•Under-allocation. It can be tricky to allocate your resources

You might also like