0% found this document useful (0 votes)

2 views32 pages

Spark Programming for Data Science

The document provides an overview of Apache Spark, highlighting its structure, features, and components such as Spark SQL, Spark ML, and Spark Streaming. It explains the concept of Resilient Distributed Datasets (RDDs), their transformations, and actions, as well as how to process JSON files using Spark. Additionally, it includes examples of Spark programming and key-value pair operations.

Uploaded by

Sathvik Cisco

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views32 pages

Spark Programming for Data Science

Uploaded by

Sathvik Cisco

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

INFO H516 Cloud Computing

for Data Science

Week 3: Spark
Spark Structure
Master

worker worker worker

Driver

worker worker worker

9/15/2025 2
Features
• Highly accessible
• Scala, Python, Java, R, SQL
• Flexible
• Can run on Hadoop clusters
• Compatible with HDFS

9/15/2025 3
Spark Stack

9/15/2025 4
Spark Core
• Basic functionalities
• Task scheduling
• Memory management
• Fault tolerance
• RDDs: Resilient Distributed Datasets
• APIs for RDD operations

9/15/2025 5
Other Components
• Spark SQL
• Integrates spark programming framework with SQL
• Enables access to distributed data storage
• Spark ML
• Learning from large datasets
• Includes – basic stats, classifiers, regression models, clustering, recommendation
models
• GraphX
• Processing large scale networks
• Spark Streaming
• Processing of large stream of data
• Distributed and realtime analysis
9/15/2025 6
Cluster Manager
• Run over different cluster managers
• Standalone (native) scheduler
• Mesos
• YARN

9/15/2025 7
Driver program
• Starting point for every spark
process
• Launches and manages various
parallel operations
• main function of the program
• Driver program connects to Spark
using SparkContext object
• Driver program distributes the task
by creating, managing worker
nodes (Executors)

9/15/2025 8
Spark Programming - Flow
• Input data
• Data analysis (Task)
• Output data

9/15/2025 9
Spark Programming - Flow

Reading into an RDD

External Data
RDD Output Data
(input)

9/15/2025 10
Spark Programming - Flow

Reading into an RDD

External Data
RDD Output Data
(input)

9/15/2025 11
Spark RDD: Resilient Distributed Datasets
• Special data structure for distributing datasets
• Immutable
• Partitioned collection
• Easily parallelized and scalable
• Can be read from:
• Local files
• Distributed files
• Databases
• Performance improvement compared to directly reading from hard disk
• Used in other platforms (e.g. Hadoop)
• In memory processing
• Efficient Input/Output operation
• Fault tolerant

9/15/2025 12
Spark Programming - Flow

Source: [Link]
9/15/2025 13
Spark features
• Immutable RDD
• The same RDD never changes
• Computational output is saved in a new RDD
• Lazy evaluation
• Transformation is recorded and scheduled
• Executed when action is called

9/15/2025 14
Let’s get started!!

9/15/2025 15
Reading into an RDD
• From python variables
val = [Link]()

• Files
• Local file:
txtfl = [Link](textfile)

• HDFS
• Database

9/15/2025 16
RDD Transformations
• Converting one RDD to another
• Input: RDD output: a different RDD
• Functions
• filter
• flatMap
• union, intersection, distinct, subtract
• map
• reduce
• reduceByKey

9/15/2025 17
Example
dataRDD = [Link](range(100))
[Link]()
even_odd_RDD = [Link](lambda x:(x,x%2))
event_odd_RDD.collect()
[Link](lambda x:x[0]+1).collect()
[Link]()
[Link](5)
[Link]()
[Link](lambda x:x).collect()

9/15/2025 18
RDD Actions
• Converting an RDD to:
• standard python variable
• Write to an external file/database
• Input: RDD Output: external
• Functions:
• take(n)
• first()
• count()
• reduce()
• countByValue()

9/15/2025 19
Spark RDD: Action Operations
• Collect():
data=[Link]()

• First():
data=[Link]()

• Take(n):
data=[Link](10)

• Count():
size=[Link]()

9/15/2025 20
Functions
• Lambda functions
• For more complex processes
• Standard python function

def m(x):
return (x,1)

lambda x: (x,1)

rdd_data.map(lambda x:(x,1))
OR
rdd_data.map(m)
9/15/2025 21
Example
dataRDD = [Link](range(100))
[Link]()
even_odd_RDD = [Link](lambda x:(x,x%2))
event_odd_RDD.collect()
[Link](lambda x:x[0]+1).collect()
[Link]()
[Link](5)
[Link]()
[Link](lambda x:x).collect()

9/15/2025 22
#input
file1 = 'to be or not to be‘

#output if name == ‘main’:

final_output = {} ###MAPPING
##for each file
map_output=map(0,file1)
def map(id,txt):
mapop = [] print map_output
words = [Link]()
map = lambda
for wordx,y: [(k,1) for k in [Link]()]
in words: ##REDUCING
tmp = (word,1) for k,v in map_output:
[Link](tmp) if k not in final_output:
return mapop final_output[k] = []

reduce(k,v)
def reduce(key,val):
if key not in final_output: print final_output
final_output[key] = 0
final_output[key] += 1

reduce = lambda x,y:final_output[x]+[y]

9/15/2025 23
Example: Word Count
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('WordCount')
sc = SparkContext(conf=conf)

t1=[Link]('[Link]')
print [Link]()

f1=[Link](lambda x:[Link]())
print [Link]()

m1=[Link](lambda x: (x,1))
print [Link]()

r1=[Link](lambda x,y:x+y)
print [Link]()

[Link]('output')
9/15/2025 24
Key-Value pairs
• Creating key-value pairs
• Map function returns key-val pairs
• Input: name (RDD containing a list of names )
• Output → Key: initials; value: full name

[Link](lambda x: (getInitials(x),x))
def getInitials(x):
fn,ln = [Link]()
initial = fn[0]+ln[0]
return initial

9/15/2025 25
Transformations on Key,Val pairs
Function Description
reduceByKey Operates by combining the keys
mapValues Operates on the values
groupByKey Groups based on keys
keys() Produces the keys
values() Produces the values
sortByKey() Sorts the RDD based on keys
filter() Selects a part of the RDD based on a condition
Involving two RDDs
subtractByKey() Removes items that are present only in the second RDD
join Inner join between two RDDs
(for common keys)
rightOuterJoin Performs a join with common keys and the ones present in the
first RDD
leftOuterJoin
9/15/2025
Performs a join with common keys and the ones present in26the
second RDD
Actions on Key,Val pairs

Function Description
countByKey() Count the number of items for each key
collectAsMap() Converts key,val RDDs into a python dict
lookup() Returns a list of values for a given key

9/15/2025 27
JSON
• Lightweight text data representational storage format
• Independent of language
• Variable “width”
• Plain text
• Human readable
• Hierarchical structure

9/15/2025 28
Representation
• Unordered pair of “field”, “values”
• Begins and ends with ‘{‘ and ‘}’ respectively
• “Value” is followed by “Field” , separated by ‘:’
• Field, value pairs are separated by ‘,’
student = { student = {
‘name’: ‘abc’, ‘name’: ‘abc’,
‘id’: 1234, ‘id’: 1234,
‘dateofbirth’: mmddyyyy, ‘dateofbirth’: mmddyyyy,
‘dept/school’: xyz, ‘dept/school’: {‘deptid’: 111, ‘name’: ‘computer_science’},
‘Enrolment’ : 2017 ‘Enrolment’ : 2017
} }

9/15/2025 29
JSON in Python
• Standard library
• Pyson
• simplejson

9/15/2025 30
Standard json library
• import json
• [Link](string)
• [Link](jsonobject)

9/15/2025 31
Processing JSON files using spark

Assuming a file has json string in each line:

rdd1 = [Link](‘filepath’)
[Link](lambda x:[Link](x))

9/15/2025 32

Spark Shell: Local Mode Guide
No ratings yet
Spark Shell: Local Mode Guide
56 pages
Introduction to Apache Spark for Big Data
No ratings yet
Introduction to Apache Spark for Big Data
42 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
52 pages
Overview of Apache Spark Components
No ratings yet
Overview of Apache Spark Components
24 pages
Spark RDD Transformations and Actions Guide
No ratings yet
Spark RDD Transformations and Actions Guide
41 pages
Big Data Processing with Hadoop & Spark
No ratings yet
Big Data Processing with Hadoop & Spark
67 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
51 pages
Spark
No ratings yet
Spark
27 pages
Spark RDDs: Transformations & Actions
No ratings yet
Spark RDDs: Transformations & Actions
30 pages
Spark Overview by Roi Yehoshua
No ratings yet
Spark Overview by Roi Yehoshua
87 pages
Spark Programming Model Overview
No ratings yet
Spark Programming Model Overview
20 pages
Spark RDDs
No ratings yet
Spark RDDs
72 pages
Introduction to Apache Spark and Sqoop
No ratings yet
Introduction to Apache Spark and Sqoop
76 pages
Analyzing Large Datasets with Spark
No ratings yet
Analyzing Large Datasets with Spark
11 pages
Pyspark Swapon
No ratings yet
Pyspark Swapon
10 pages
PySpark RDD Operations Cheat Sheet
No ratings yet
PySpark RDD Operations Cheat Sheet
1 page
Using Spark Low-Level APIs
No ratings yet
Using Spark Low-Level APIs
20 pages
Overview of Apache Spark and Hadoop
No ratings yet
Overview of Apache Spark and Hadoop
69 pages
Spark Basics for Big Data Systems
No ratings yet
Spark Basics for Big Data Systems
51 pages
Apache Spark RDDs: Overview and Operations
No ratings yet
Apache Spark RDDs: Overview and Operations
60 pages
Apache Spark Batch Processing Insights
No ratings yet
Apache Spark Batch Processing Insights
48 pages
Spark Shell Commands and RDD Examples
No ratings yet
Spark Shell Commands and RDD Examples
61 pages
Apache Spark
No ratings yet
Apache Spark
47 pages
Spark RDD Overview and Operations
No ratings yet
Spark RDD Overview and Operations
46 pages
Understanding Spark RDDs and Execution
No ratings yet
Understanding Spark RDDs and Execution
35 pages
Introduction to Apache Spark Framework
No ratings yet
Introduction to Apache Spark Framework
30 pages
Spark vs MapReduce: Key Differences
No ratings yet
Spark vs MapReduce: Key Differences
51 pages
Overview of Apache Spark Features
No ratings yet
Overview of Apache Spark Features
9 pages
Chap 5 Apache Spark
No ratings yet
Chap 5 Apache Spark
13 pages
Scala and Spark Programming Guide
No ratings yet
Scala and Spark Programming Guide
29 pages
Apache Spark Internal Workings Explained
No ratings yet
Apache Spark Internal Workings Explained
17 pages
Understanding RDDs in Apache Spark
No ratings yet
Understanding RDDs in Apache Spark
43 pages
Introduction to Apache Spark Architecture
No ratings yet
Introduction to Apache Spark Architecture
55 pages
PySpark RDD Operations Cheat Sheet
No ratings yet
PySpark RDD Operations Cheat Sheet
1 page
Apache Spark RDDs and Cluster Computing
No ratings yet
Apache Spark RDDs and Cluster Computing
33 pages
PySpark Functions Overview
No ratings yet
PySpark Functions Overview
4 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
31 pages
Understanding RDD in Spark
No ratings yet
Understanding RDD in Spark
93 pages
Spark Performance Tuning Techniques
No ratings yet
Spark Performance Tuning Techniques
41 pages
RDDs and DataFrames in Spark
No ratings yet
RDDs and DataFrames in Spark
38 pages
PySpark Essentials: A Quick Guide
No ratings yet
PySpark Essentials: A Quick Guide
190 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
160 pages
Apache Spark Overview and Examples
No ratings yet
Apache Spark Overview and Examples
36 pages
Apache Spark Overview and RDD Insights
100% (2)
Apache Spark Overview and RDD Insights
120 pages
Introduction to PySpark and Its Features
No ratings yet
Introduction to PySpark and Its Features
20 pages
PySpark Transformations: 14 Examples
100% (1)
PySpark Transformations: 14 Examples
58 pages
Spark Introcuction 2025-10-16
No ratings yet
Spark Introcuction 2025-10-16
14 pages
Unit - V Bda Final
No ratings yet
Unit - V Bda Final
45 pages
PySpark RDD Transformations Guide
No ratings yet
PySpark RDD Transformations Guide
38 pages
Bda 4
No ratings yet
Bda 4
16 pages
Introduction to Apache Spark Concepts
No ratings yet
Introduction to Apache Spark Concepts
55 pages
Spark Architecture: RDDs and Operations
No ratings yet
Spark Architecture: RDDs and Operations
37 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
44 pages
PySpark RDD and DataFrame Operations
No ratings yet
PySpark RDD and DataFrame Operations
20 pages
Pyspark
No ratings yet
Pyspark
31 pages
Apache Spark: RDDs and Data Processing
No ratings yet
Apache Spark: RDDs and Data Processing
6 pages
Python and Spark Functions Overview
No ratings yet
Python and Spark Functions Overview
15 pages
Understanding Apache Spark and RDDs
No ratings yet
Understanding Apache Spark and RDDs
54 pages
PySpark: Overview and Key Features
No ratings yet
PySpark: Overview and Key Features
120 pages
Model Driven Architecture: An Introduction: Richard Mark Soley, Ph.D. Chairman and CEO
No ratings yet
Model Driven Architecture: An Introduction: Richard Mark Soley, Ph.D. Chairman and CEO
32 pages
Node.js Web App Security Analysis
No ratings yet
Node.js Web App Security Analysis
60 pages
Overview of Computing Paradigms
No ratings yet
Overview of Computing Paradigms
14 pages
NYU Compiler Construction Course Overview
No ratings yet
NYU Compiler Construction Course Overview
37 pages
E-commerce Database Design Assignment
No ratings yet
E-commerce Database Design Assignment
4 pages
Accessing BITS Library via OpenAthens
No ratings yet
Accessing BITS Library via OpenAthens
28 pages
MCA First Year
No ratings yet
MCA First Year
51 pages
Backup Administrator Resume - Sreenivasulu G
No ratings yet
Backup Administrator Resume - Sreenivasulu G
4 pages
Securing Your Home from Cyber Threats
No ratings yet
Securing Your Home from Cyber Threats
2 pages
Understanding Servlet Cookies
No ratings yet
Understanding Servlet Cookies
15 pages
Understanding Java ExecutorService
No ratings yet
Understanding Java ExecutorService
33 pages
Bakery Shop Management System Project
No ratings yet
Bakery Shop Management System Project
18 pages
Intrusion Detection System Overview
No ratings yet
Intrusion Detection System Overview
66 pages
Extended ECM Administration Guide
No ratings yet
Extended ECM Administration Guide
60 pages
Telecom Billing and Support Resume
No ratings yet
Telecom Billing and Support Resume
9 pages
Overview of Database Management Systems
No ratings yet
Overview of Database Management Systems
17 pages
Understanding SQL Views and Their Benefits
No ratings yet
Understanding SQL Views and Their Benefits
13 pages
Hospital Management System Overview
No ratings yet
Hospital Management System Overview
6 pages
Free Online ICT Courses in Sri Lanka
No ratings yet
Free Online ICT Courses in Sri Lanka
25 pages
Database Management Overview
No ratings yet
Database Management Overview
11 pages
Vyatta Community Edition 5.0.2 Release Notes
No ratings yet
Vyatta Community Edition 5.0.2 Release Notes
19 pages
Hadoop Installation Guide
No ratings yet
Hadoop Installation Guide
6 pages
University Database System Design Report
No ratings yet
University Database System Design Report
13 pages
Cybersecurity Nanodegree Syllabus
No ratings yet
Cybersecurity Nanodegree Syllabus
3 pages
PSADMIN Domain Configuration Guide
No ratings yet
PSADMIN Domain Configuration Guide
9 pages
AZ-204 Exam Guide: Sample Questions
No ratings yet
AZ-204 Exam Guide: Sample Questions
12 pages
Computer Forensics: Key Concepts & Practices
No ratings yet
Computer Forensics: Key Concepts & Practices
2 pages
Computer Preventive Maintenance Checklist
100% (1)
Computer Preventive Maintenance Checklist
1 page
Introduction To Data Mining: Business Analytics, 1e
No ratings yet
Introduction To Data Mining: Business Analytics, 1e
18 pages
DB2 UDB V8.1 Family Application Development Certification:: Advanced Programming
No ratings yet
DB2 UDB V8.1 Family Application Development Certification:: Advanced Programming
49 pages

Spark Programming for Data Science

Uploaded by

Spark Programming for Data Science

Uploaded by

INFO H516 Cloud Computing

for Data Science

worker worker worker

worker worker worker

Reading into an RDD

Reading into an RDD

#output if __name__ == ‘__main__’:

reduce = lambda x,y:final_output[x]+[y]

Assuming a file has json string in each line:

You might also like

#output if name == ‘main’: