0% found this document useful (0 votes)

4 views49 pages

Spark Introduction

The document provides an introduction to Apache Spark, highlighting its evolution from earlier distributed systems and the need for a new generation to handle big data efficiently. It discusses Spark's architecture, features, and advantages over traditional MapReduce, emphasizing its in-memory processing capabilities and support for various programming languages. Additionally, it outlines the installation process and running Spark jobs, while acknowledging the competition from other technologies like Flink in the big data landscape.

Uploaded by

96killerat96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views49 pages

Spark Introduction

Uploaded by

96killerat96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Apache

Spark

Dr. Manish Kumar, Professor

Department of Information Technology
Indian Institute of Information Technology-Allahabad,
Prayagraj
Agenda
● Evolution of distributed systems
● Need of new generation distributed system
● Hardware/software evolution in last decade
● Apache Spark
● Why Spark?
● Who are using Spark?
Evolution of distributed systems
● First Generation

● Second Generation

● Third Generation
First distributed systems
● Proprietary
● Custom Hardware and software
● Centralized data
● Hardware based fault recovery

Ex: Teradata, Netezza etc

Second generation
● Open source
● Commodity hardware
● Distributed data
● Software based fault recovery

Ex : Hadoop, HPCC
Why we need new generation?
● Lot has been changed from 2000
● Both hardware and software gone through changes
● Big data has become necessity now
● Let’s look at what changed over decade
State of hardware in 2000
● Disk was cheap so it was primary source of
data.
● Network was costly so data locality
● RAM was very costly
● Single core machines were dominant
State of hardware now
● RAM is the king
● RAM is primary source of data and we use disk for
fallback.
● Network with high speedier
● Multi core machines are common
Software in 2000
● Object orientation was the king
● Software optimized for single core
● No open frameworks for creating
● Distributed storage
● Distributed processing
● SQL was the only dominant one for data
extraction/analysis
Software now
● Functional programming is on rise
● Software needs to exploit multiple cores on single node
● There are good frameworks to create distributed
systems
● HDFS for storage
● Apache Mesos/ YARN to create distributed
processing
● NoSQL is real alternative now
Big Data processing needs in
2000
● Very few companies had big data issue
● Batch processing system ruled the world
● Volume was big concern compare to velocity
● Mostly used for
○ Search
○ Log analysis
Big data processing needs now
● Most of the companies use big data
● Velocity is as much concern as volume
● Needs of real time are as much important as batch
processing
● Use cases are not just limited to search
Shortcomings of Second
generation
● Batch processing is primary objective
● Not designed to change depending upon use cases
● Tight coupling between API and run time
● Do not exploit new hardware capabilities
● Too much complex
Third generation distributed
systems
● Handle both batch processing and real time
● Exploit RAM as much as disk
● Multiple core aware
● Do not reinvent the wheel
● They use
● HDFS for storage
● Apache Mesos / YARN for distribution
● Plays well with Hadoop
Motivation
Most current cluster programming models are based
on acyclic data flow from stable storage to stable
storage

Map
Reduce

Input Map Output

Reduce
Map
Benefits of data flow: runtime can decide
where to run tasks and can automatically
recover from failures

Map
Reduce

Input Map Output

Reduce
Map
Motivation
Acyclic data flow is inefficient for applications that
repeatedly reuse a working set of data:

○Iterative algorithms (machine learning, graphs)

○Interactive data mining tools (R, Excel, Python)

With current frameworks, apps reload data from stable

storage on each query
Apache Spark
Apache Spark is an open-source cluster computing framework.
Its primary purpose is to handle the real-time generated data.
Spark was built on the top of the Hadoop MapReduce. It was
optimized to run in memory whereas alternative approaches
like Hadoop's MapReduce writes data to and from computer
hard drives. So, Spark process the data much quicker than
other alternatives. primarily written in Scala but now support
for Java, Scala, Python, and R.
Contd…
Apache Spark supports data analysis, machine learning,
graphs, streaming data, etc. It can read/write from a
range of data types and allows development in multiple
languages. Scala, Java, Python, R, SQL

DataFrames ML Pipelines

Spark
Spark SQL MLlib GraphX
Streaming
Spark Core

Data Sources

Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS,
Lustre)
History of Apache Spark
● Mesos, a distributed system framework as class project in
UC Berkeley in 2009.
● Spark to test how mesos (that pools datacenter resources
(CPU, RAM, storage) across machines, acting as a
"datacenter-level operating system" to enable efficient,
shared utilization) works
● Focused on
● Iterative programs (ML)
● Interactive querying
● Unifying real time and batch processing
● Open sourced in 2010
Apache Spark EcoSystem

Apache Spark, Apache Spark Ecosystem

[Link]
Spark SQL
The Spark SQL component is a distributed framework for structured
data processing and semi-structured data processing.

Spark Streaming
Spark Streaming which allows scalable, high-throughput, fault-
tolerant stream processing of live data. Streams.

MLlib/ML
MLlib in Spark is a scalable Machine learning library that discusses
both high-quality algorithm and high speed.

GraphX
GraphX in Spark is API for graphs and graph parallel execution. It is
network graph analytics engine and data store.
Features of Apache Spark
➢ Fast - It provides high performance for both batch and streaming data, using a
state-of-the-art DAG (Directed Acyclic Graph) scheduler, a query optimizer,
and a physical execution engine.

➢ Easy to Use - It facilitates to write the application in Java, Scala, Python, R,

and SQL.

➢ Generality - It provides a collection of libraries including SQL and

DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.

➢ Lightweight - It is a light unified analytics engine which is used for large

scale data processing. Runs Everywhere - It can easily run on Hadoop, Apache
Mesos, Kubernetes, standalone, or in the cloud.
Uses of Spark
Data integration: The data generated by systems are not consistent
enough to combine for analysis. To fetch consistent data from
systems we can use processes like Extract, transform, and load
(ETL). Spark is used to reduce the cost and time required for this
ETL process.
Stream processing: . Spark can handle the real-time generated data
and capable enough to operate streams of data and refuses
potentially fraudulent operations.
Machine learning: spark is capable of storing data in memory and
can run repeated queries quickly, it makes it easy to work on
machine learning algorithms.
Interactive analytics: Spark can handle the data interactively.
MapReduce (Hadoop)
MapReduce Bottlenecks
and Improvements
• Bottlenecks
• MapReduce is a very I/O heavy operation
• Map phase needs to read from disk then write back out
• Reduce phase needs to read from disk and then write back
out.
• How can we improve it?
• RAM is becoming very cheap and abundant
• Use RAM for in-data sharing
MapReduce with Spark

● Apache Spark supports Map and Reduce operations using its in-memory
processing model.

● Spark replaces traditional Hadoop MapReduce by performing

computations faster.

● It uses RDDs and transformations instead of disk-based MapReduce jobs.

● MapReduce-style processing in Spark is simpler to implement using high-

level APIs.

● Spark improves performance for iterative and real-time data processing

tasks.
PageRank Example

● PageRank is an algorithm used by search engines to rank web pages based

on their importance.

● It determines importance by analyzing the number and quality of links

pointing to a page.

● Pages with more incoming links from important pages get higher rank.

Objective of PageRank

● To assign a numerical rank to each web page.

● Higher PageRank means the page is more relevant or authoritative.

● Helps search engines order search results.

MapReduce vs. Spark
(Performance) (Cont.)
• Dayton Gray 100 TB sorting results
• [Link]
[Link] ( accessed on 11th March 2026)

MapReduce Record Spark Record Spark Record 1PB

Data Size 102.5 TB 100 TB 1000 TB
# Nodes 2100 206 190
# Cores 50400 physical 6592 virtualized 6080 virtualized
Elapsed Time 72 mins 23 mins 234 mins
Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
SparkContext

• A Spark program first creates a SparkContext object

• Spark Shell automatically creates a SparkContext as the sc variable

• Tells spark how and where to access a cluster

• Use SparkContext to create RDDs

• Documentation
Overview - Spark 4.1.1 Documentation (accessed on11-03-2026)
Resilient distributed dataset (RDD),

● Spark provides is a resilient distributed dataset (RDD), which is a collection

of elements partitioned across the nodes of the cluster

● Resilient Distributed Dataset (RDD) is the fundamental data structure of

Spark. They are immutable Distributed collections of objects of any type.
As the name suggests is a Resilient (Fault-tolerant) records of data that
resides on multiple nodes.

31
Figure:-Apache spark architecture
Spark Architecture

Apache Spark, Cluster Mode Overview

[Link]
(accessed on 11-03-2026)
DAG in Apache Spark is a set of Vertices and Edges, where vertices represent the

RDDs and the edges represent the Operation to be applied on RDD

35
⦿ SPARK DRIVE :-
⦿ Separate process to execute user application
⦿ Creates SparkContext to schedual
⦿ Jobs execution & negotiate with cluster manager
⦿ EXECUTORS :-
⦿ Run tasks scheduled by driver
⦿ Store computation result in memory,on disk or off-heap
⦿ Interact with storage systems
⦿ CLUSTER MANAGER :-
⦿ Spark context works with the cluster manager to
manage various jobs
⦿ The driver program & Spark context takes care of the job
execution within the cluster
⦿ ApacheSpark Architecture is based on two
main abstractions:
⦿ Resilient Distributed Dataset (RDD)
⦿ Directed Acyclic Graph (DAG)
❑ RDDs can perform two types of operations:
❑ Transformations: They are the operations that are
applied to create a new RDD.
❑ Actions: They are applied on an RDD to instruct
Apache Spark to apply computation and pass the
result back to the driver.
Spark Installation
Download the Apache Spark tar file
[Link]

Unzip the downloaded tar file.

sudo tar -xzvf /home/username/[Link]

Open the bashrc file.
sudo nano ~/.bashrc
copy the following spark path in the last.

SPARK_HOME=/ home/user_name /spark-2.4.1-bin-

hadoop2.7

export PATH=$SPARK_HOME/bin:$PATH

Update the environment variable

source ~/.bashrc
Running Spark Jobs
• Shell
• Shell for running Scala Code
$ spark-shell
• Shell for running Python Code
$ pyspark
• Shell for running R Code
$ sparkR
• Submitting (Java, Scala, Python, R)
$ spark-submit --class {MAIN_CLASS}
[OPTIONS] {PATH_TO_FILE} {ARG0}
{ARG1} … {ARGN}
⦿SPARK makes it easy to write and run complicated data
processing
⦿It enables computation of tasks at a very large scale
⦿ Although spark has many limitations, it is still trending in the big
data world
⦿ Due to these drawbacks, many technologies are
overtaking Spark
⦿ Such as Flink offers complete real-time processing than the
spark
⦿In this way somehow other technologies overcoming the
drawbacks of Spark

Spark: Fast Data Processing Overview
No ratings yet
Spark: Fast Data Processing Overview
80 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Unit Iv - Bda Spark
No ratings yet
Unit Iv - Bda Spark
45 pages
Unit 4 Advanced Big Data Analytics
No ratings yet
Unit 4 Advanced Big Data Analytics
17 pages
Sparklyr Online Course Overview
No ratings yet
Sparklyr Online Course Overview
80 pages
Big Data Processing with Apache Spark
No ratings yet
Big Data Processing with Apache Spark
38 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
48 pages
Understanding Apache Spark Framework
No ratings yet
Understanding Apache Spark Framework
58 pages
Data Analysis with Apache Spark Overview
No ratings yet
Data Analysis with Apache Spark Overview
39 pages
Features and Architecture of Apache Spark
No ratings yet
Features and Architecture of Apache Spark
24 pages
Apache Spark: Fast Stream Processing
No ratings yet
Apache Spark: Fast Stream Processing
74 pages
Apache Spark
No ratings yet
Apache Spark
69 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
35 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
28 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
44 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
30 pages
Introduction To Apache Spark
No ratings yet
Introduction To Apache Spark
10 pages
Apache Spark Overview and Benefits
No ratings yet
Apache Spark Overview and Benefits
18 pages
Master Big Data with Apache Spark
No ratings yet
Master Big Data with Apache Spark
47 pages
Overview of Apache Spark Features
No ratings yet
Overview of Apache Spark Features
26 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Overview of Apache Spark Basics
No ratings yet
Overview of Apache Spark Basics
49 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
26 pages
PN s6 Real Time Processingi Bca Bda Vi Ch2
No ratings yet
PN s6 Real Time Processingi Bca Bda Vi Ch2
26 pages
Introduction to Apache Spark Features
No ratings yet
Introduction to Apache Spark Features
27 pages
Introduction to Apache Spark 2 Architecture
No ratings yet
Introduction to Apache Spark 2 Architecture
43 pages
Introduction to Apache Spark by Dulari Bhatt
No ratings yet
Introduction to Apache Spark by Dulari Bhatt
19 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Understanding Spark RDD and Ecosystem
No ratings yet
Understanding Spark RDD and Ecosystem
15 pages
Overview of Apache Spark Framework
No ratings yet
Overview of Apache Spark Framework
14 pages
Advanced Analytics with Apache Spark
No ratings yet
Advanced Analytics with Apache Spark
45 pages
Introduction to Apache Spark Essentials
No ratings yet
Introduction to Apache Spark Essentials
69 pages
Apache Spark
No ratings yet
Apache Spark
34 pages
Apache Spark Features and Use Cases Explained
No ratings yet
Apache Spark Features and Use Cases Explained
14 pages
Unit IV Notes
No ratings yet
Unit IV Notes
40 pages
Understanding Apache Spark: Features & Benefits
No ratings yet
Understanding Apache Spark: Features & Benefits
19 pages
Apache Spark vs. MapReduce Limitations
No ratings yet
Apache Spark vs. MapReduce Limitations
47 pages
Cassandra SSTable Management Benchmark
No ratings yet
Cassandra SSTable Management Benchmark
41 pages
Apache Spark - Quick Guide
No ratings yet
Apache Spark - Quick Guide
22 pages
History and Overview of Apache Spark
No ratings yet
History and Overview of Apache Spark
65 pages
Unit - IV Big Data Tools and Platforms
No ratings yet
Unit - IV Big Data Tools and Platforms
21 pages
Introduction to Big Data with Spark
No ratings yet
Introduction to Big Data with Spark
18 pages
Introduction to Apache Spark and RDDs
No ratings yet
Introduction to Apache Spark and RDDs
14 pages
Apache Spark: Performance and Fault Tolerance
No ratings yet
Apache Spark: Performance and Fault Tolerance
66 pages
Understanding Apache Spark Clusters
No ratings yet
Understanding Apache Spark Clusters
9 pages
Pyspark Learning Notes PDF Guide
No ratings yet
Pyspark Learning Notes PDF Guide
18 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
36 pages
Apache Spark Basics and Features
No ratings yet
Apache Spark Basics and Features
44 pages
Apache Spark Overview and Features
No ratings yet
Apache Spark Overview and Features
52 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
49 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
66 pages
Apache Spark: Fast Data Processing Overview
No ratings yet
Apache Spark: Fast Data Processing Overview
19 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
20 pages
Overview of Apache Spark Ecosystem
No ratings yet
Overview of Apache Spark Ecosystem
17 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Using Open Data To Predict Market Movements: Ravinder Singh Marina Levina
No ratings yet
Using Open Data To Predict Market Movements: Ravinder Singh Marina Levina
31 pages
MCS-225 Hand Written
No ratings yet
MCS-225 Hand Written
21 pages
Apache Spark: Overview and Architecture
No ratings yet
Apache Spark: Overview and Architecture
12 pages
Migrating Cloudera from Azure to GCP
No ratings yet
Migrating Cloudera from Azure to GCP
16 pages
Lambda Architecture with Apache Spark
No ratings yet
Lambda Architecture with Apache Spark
11 pages
Data Analytics and Visualization Overview
No ratings yet
Data Analytics and Visualization Overview
25 pages
IoT Messaging Infrastructure Guide
No ratings yet
IoT Messaging Infrastructure Guide
6 pages
Azure Data Engineer Certification Roadmap
No ratings yet
Azure Data Engineer Certification Roadmap
3 pages
Understanding Big Data Concepts and Architecture
No ratings yet
Understanding Big Data Concepts and Architecture
15 pages
SQL to PySpark DML Operations Guide
No ratings yet
SQL to PySpark DML Operations Guide
9 pages
Spark Lab: Analyzing Historical Weather Data
No ratings yet
Spark Lab: Analyzing Historical Weather Data
3 pages
Data Engineering Overview and Skills
No ratings yet
Data Engineering Overview and Skills
11 pages
Introduction to Spark Development
No ratings yet
Introduction to Spark Development
172 pages
Microsoft Fabric Data Engineering Guide
No ratings yet
Microsoft Fabric Data Engineering Guide
65 pages
Senior Data Engineer Profile Summary
No ratings yet
Senior Data Engineer Profile Summary
4 pages
Big Data Processing with Spark RDDs
No ratings yet
Big Data Processing with Spark RDDs
97 pages
ETL Data Pipelines for Modern Architectures
No ratings yet
ETL Data Pipelines for Modern Architectures
107 pages
Azure Data Engineer Resume Summary
No ratings yet
Azure Data Engineer Resume Summary
2 pages
Senior Data Engineer Profile Summary
No ratings yet
Senior Data Engineer Profile Summary
7 pages
Charul Gupta SR Data Eng
No ratings yet
Charul Gupta SR Data Eng
4 pages
Azure SQL Training Overview
No ratings yet
Azure SQL Training Overview
6 pages
Nancy's Professional Profile and Skills
No ratings yet
Nancy's Professional Profile and Skills
2 pages
Azure Data Engg Syllabus
No ratings yet
Azure Data Engg Syllabus
10 pages
Key Use Cases of Apache Spark
No ratings yet
Key Use Cases of Apache Spark
26 pages
Azure Data Engineer Technical Interview Questions
No ratings yet
Azure Data Engineer Technical Interview Questions
7 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
17 pages
Fabric Interview Qna
No ratings yet
Fabric Interview Qna
12 pages
Azure Databricks Course Overview
100% (5)
Azure Databricks Course Overview
308 pages
Rushikesh Bhalerao: Data Analyst Profile
No ratings yet
Rushikesh Bhalerao: Data Analyst Profile
2 pages
2025 Data Science Roadmap Guide
No ratings yet
2025 Data Science Roadmap Guide
3 pages

Spark Introduction

Uploaded by

Spark Introduction

Uploaded by

Introduction to Apache

Dr. Manish Kumar, Professor

Ex: Teradata, Netezza etc

Input Map Output

Input Map Output

○Iterative algorithms (machine learning, graphs)

With current frameworks, apps reload data from stable

Apache Spark, Apache Spark Ecosystem

➢ Easy to Use - It facilitates to write the application in Java, Scala, Python, R,

➢ Generality - It provides a collection of libraries including SQL and

➢ Lightweight - It is a light unified analytics engine which is used for large

● Spark replaces traditional Hadoop MapReduce by performing

● It uses RDDs and transformations instead of disk-based MapReduce jobs.

● MapReduce-style processing in Spark is simpler to implement using high-

● Spark improves performance for iterative and real-time data processing

● PageRank is an algorithm used by search engines to rank web pages based

● It determines importance by analyzing the number and quality of links

● To assign a numerical rank to each web page.

● Higher PageRank means the page is more relevant or authoritative.

● Helps search engines order search results.

MapReduce Record Spark Record Spark Record 1PB

• A Spark program first creates a SparkContext object

• Tells spark how and where to access a cluster

• Use SparkContext to create RDDs

● Spark provides is a resilient distributed dataset (RDD), which is a collection

● Resilient Distributed Dataset (RDD) is the fundamental data structure of

Apache Spark, Cluster Mode Overview

RDDs and the edges represent the Operation to be applied on RDD

Unzip the downloaded tar file.

sudo tar -xzvf /home/username/[Link]

SPARK_HOME=/ home/user_name /spark-2.4.1-bin-

Update the environment variable

You might also like