0% found this document useful (0 votes)
4 views49 pages

Spark Introduction

The document provides an introduction to Apache Spark, highlighting its evolution from earlier distributed systems and the need for a new generation to handle big data efficiently. It discusses Spark's architecture, features, and advantages over traditional MapReduce, emphasizing its in-memory processing capabilities and support for various programming languages. Additionally, it outlines the installation process and running Spark jobs, while acknowledging the competition from other technologies like Flink in the big data landscape.

Uploaded by

96killerat96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views49 pages

Spark Introduction

The document provides an introduction to Apache Spark, highlighting its evolution from earlier distributed systems and the need for a new generation to handle big data efficiently. It discusses Spark's architecture, features, and advantages over traditional MapReduce, emphasizing its in-memory processing capabilities and support for various programming languages. Additionally, it outlines the installation process and running Spark jobs, while acknowledging the competition from other technologies like Flink in the big data landscape.

Uploaded by

96killerat96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Apache

Spark

Dr. Manish Kumar, Professor


Department of Information Technology
Indian Institute of Information Technology-Allahabad,
Prayagraj
Agenda
● Evolution of distributed systems
● Need of new generation distributed system
● Hardware/software evolution in last decade
● Apache Spark
● Why Spark?
● Who are using Spark?
Evolution of distributed systems
● First Generation

● Second Generation

● Third Generation
First distributed systems
● Proprietary
● Custom Hardware and software
● Centralized data
● Hardware based fault recovery

Ex: Teradata, Netezza etc


Second generation
● Open source
● Commodity hardware
● Distributed data
● Software based fault recovery

Ex : Hadoop, HPCC
Why we need new generation?
● Lot has been changed from 2000
● Both hardware and software gone through changes
● Big data has become necessity now
● Let’s look at what changed over decade
State of hardware in 2000
● Disk was cheap so it was primary source of
data.
● Network was costly so data locality
● RAM was very costly
● Single core machines were dominant
State of hardware now
● RAM is the king
● RAM is primary source of data and we use disk for
fallback.
● Network with high speedier
● Multi core machines are common
Software in 2000
● Object orientation was the king
● Software optimized for single core
● No open frameworks for creating
● Distributed storage
● Distributed processing
● SQL was the only dominant one for data
extraction/analysis
Software now
● Functional programming is on rise
● Software needs to exploit multiple cores on single node
● There are good frameworks to create distributed
systems
● HDFS for storage
● Apache Mesos/ YARN to create distributed
processing
● NoSQL is real alternative now
Big Data processing needs in
2000
● Very few companies had big data issue
● Batch processing system ruled the world
● Volume was big concern compare to velocity
● Mostly used for
○ Search
○ Log analysis
Big data processing needs now
● Most of the companies use big data
● Velocity is as much concern as volume
● Needs of real time are as much important as batch
processing
● Use cases are not just limited to search
Shortcomings of Second
generation
● Batch processing is primary objective
● Not designed to change depending upon use cases
● Tight coupling between API and run time
● Do not exploit new hardware capabilities
● Too much complex
Third generation distributed
systems
● Handle both batch processing and real time
● Exploit RAM as much as disk
● Multiple core aware
● Do not reinvent the wheel
● They use
● HDFS for storage
● Apache Mesos / YARN for distribution
● Plays well with Hadoop
Motivation
Most current cluster programming models are based
on acyclic data flow from stable storage to stable
storage

Map
Reduce

Input Map Output

Reduce
Map
Benefits of data flow: runtime can decide
where to run tasks and can automatically
recover from failures

Map
Reduce

Input Map Output

Reduce
Map
Motivation
Acyclic data flow is inefficient for applications that
repeatedly reuse a working set of data:

○Iterative algorithms (machine learning, graphs)


○Interactive data mining tools (R, Excel, Python)

With current frameworks, apps reload data from stable


storage on each query
Apache Spark
Apache Spark is an open-source cluster computing framework.
Its primary purpose is to handle the real-time generated data.
Spark was built on the top of the Hadoop MapReduce. It was
optimized to run in memory whereas alternative approaches
like Hadoop's MapReduce writes data to and from computer
hard drives. So, Spark process the data much quicker than
other alternatives. primarily written in Scala but now support
for Java, Scala, Python, and R.
Contd…
Apache Spark supports data analysis, machine learning,
graphs, streaming data, etc. It can read/write from a
range of data types and allows development in multiple
languages. Scala, Java, Python, R, SQL

DataFrames ML Pipelines

Spark
Spark SQL MLlib GraphX
Streaming
Spark Core

Data Sources

Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS,
Lustre)
History of Apache Spark
● Mesos, a distributed system framework as class project in
UC Berkeley in 2009.
● Spark to test how mesos (that pools datacenter resources
(CPU, RAM, storage) across machines, acting as a
"datacenter-level operating system" to enable efficient,
shared utilization) works
● Focused on
● Iterative programs (ML)
● Interactive querying
● Unifying real time and batch processing
● Open sourced in 2010
Apache Spark EcoSystem

Apache Spark, Apache Spark Ecosystem


[Link]
Spark SQL
The Spark SQL component is a distributed framework for structured
data processing and semi-structured data processing.

Spark Streaming
Spark Streaming which allows scalable, high-throughput, fault-
tolerant stream processing of live data. Streams.

MLlib/ML
MLlib in Spark is a scalable Machine learning library that discusses
both high-quality algorithm and high speed.

GraphX
GraphX in Spark is API for graphs and graph parallel execution. It is
network graph analytics engine and data store.
Features of Apache Spark
➢ Fast - It provides high performance for both batch and streaming data, using a
state-of-the-art DAG (Directed Acyclic Graph) scheduler, a query optimizer,
and a physical execution engine.

➢ Easy to Use - It facilitates to write the application in Java, Scala, Python, R,


and SQL.

➢ Generality - It provides a collection of libraries including SQL and


DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.

➢ Lightweight - It is a light unified analytics engine which is used for large


scale data processing. Runs Everywhere - It can easily run on Hadoop, Apache
Mesos, Kubernetes, standalone, or in the cloud.
Uses of Spark
Data integration: The data generated by systems are not consistent
enough to combine for analysis. To fetch consistent data from
systems we can use processes like Extract, transform, and load
(ETL). Spark is used to reduce the cost and time required for this
ETL process.
Stream processing: . Spark can handle the real-time generated data
and capable enough to operate streams of data and refuses
potentially fraudulent operations.
Machine learning: spark is capable of storing data in memory and
can run repeated queries quickly, it makes it easy to work on
machine learning algorithms.
Interactive analytics: Spark can handle the data interactively.
MapReduce (Hadoop)
MapReduce Bottlenecks
and Improvements
• Bottlenecks
• MapReduce is a very I/O heavy operation
• Map phase needs to read from disk then write back out
• Reduce phase needs to read from disk and then write back
out.
• How can we improve it?
• RAM is becoming very cheap and abundant
• Use RAM for in-data sharing
MapReduce with Spark

● Apache Spark supports Map and Reduce operations using its in-memory
processing model.

● Spark replaces traditional Hadoop MapReduce by performing


computations faster.

● It uses RDDs and transformations instead of disk-based MapReduce jobs.

● MapReduce-style processing in Spark is simpler to implement using high-


level APIs.

● Spark improves performance for iterative and real-time data processing


tasks.
PageRank Example

● PageRank is an algorithm used by search engines to rank web pages based


on their importance.

● It determines importance by analyzing the number and quality of links


pointing to a page.

● Pages with more incoming links from important pages get higher rank.

Objective of PageRank

● To assign a numerical rank to each web page.

● Higher PageRank means the page is more relevant or authoritative.

● Helps search engines order search results.


MapReduce vs. Spark
(Performance) (Cont.)
• Dayton Gray 100 TB sorting results
• [Link]
[Link] ( accessed on 11th March 2026)

MapReduce Record Spark Record Spark Record 1PB


Data Size 102.5 TB 100 TB 1000 TB
# Nodes 2100 206 190
# Cores 50400 physical 6592 virtualized 6080 virtualized
Elapsed Time 72 mins 23 mins 234 mins
Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
SparkContext

• A Spark program first creates a SparkContext object


• Spark Shell automatically creates a SparkContext as the sc variable

• Tells spark how and where to access a cluster

• Use SparkContext to create RDDs

• Documentation
Overview - Spark 4.1.1 Documentation (accessed on11-03-2026)
Resilient distributed dataset (RDD),

● Spark provides is a resilient distributed dataset (RDD), which is a collection


of elements partitioned across the nodes of the cluster

● Resilient Distributed Dataset (RDD) is the fundamental data structure of


Spark. They are immutable Distributed collections of objects of any type.
As the name suggests is a Resilient (Fault-tolerant) records of data that
resides on multiple nodes.

31
Figure:-Apache spark architecture
Spark Architecture

Apache Spark, Cluster Mode Overview


[Link]
(accessed on 11-03-2026)
DAG in Apache Spark is a set of Vertices and Edges, where vertices represent the

RDDs and the edges represent the Operation to be applied on RDD

35
⦿ SPARK DRIVE :-
⦿ Separate process to execute user application
⦿ Creates SparkContext to schedual
⦿ Jobs execution & negotiate with cluster manager
⦿ EXECUTORS :-
⦿ Run tasks scheduled by driver
⦿ Store computation result in memory,on disk or off-heap
⦿ Interact with storage systems
⦿ CLUSTER MANAGER :-
⦿ Spark context works with the cluster manager to
manage various jobs
⦿ The driver program & Spark context takes care of the job
execution within the cluster
⦿ ApacheSpark Architecture is based on two
main abstractions:
⦿ Resilient Distributed Dataset (RDD)
⦿ Directed Acyclic Graph (DAG)
❑ RDDs can perform two types of operations:
❑ Transformations: They are the operations that are
applied to create a new RDD.
❑ Actions: They are applied on an RDD to instruct
Apache Spark to apply computation and pass the
result back to the driver.
Spark Installation
Download the Apache Spark tar file
[Link]

Unzip the downloaded tar file.

sudo tar -xzvf /home/username/[Link]


Open the bashrc file.
sudo nano ~/.bashrc
copy the following spark path in the last.

SPARK_HOME=/ home/user_name /spark-2.4.1-bin-


hadoop2.7

export PATH=$SPARK_HOME/bin:$PATH

Update the environment variable

source ~/.bashrc
Running Spark Jobs
• Shell
• Shell for running Scala Code
$ spark-shell
• Shell for running Python Code
$ pyspark
• Shell for running R Code
$ sparkR
• Submitting (Java, Scala, Python, R)
$ spark-submit --class {MAIN_CLASS}
[OPTIONS] {PATH_TO_FILE} {ARG0}
{ARG1} … {ARGN}
⦿SPARK makes it easy to write and run complicated data
processing
⦿It enables computation of tasks at a very large scale
⦿ Although spark has many limitations, it is still trending in the big
data world
⦿ Due to these drawbacks, many technologies are
overtaking Spark
⦿ Such as Flink offers complete real-time processing than the
spark
⦿In this way somehow other technologies overcoming the
drawbacks of Spark

You might also like