INFO H516 Cloud Computing
for Data Science
Week 3: Spark
Spark Structure
Master
worker worker worker
Driver
worker worker worker
9/15/2025 2
Features
• Highly accessible
• Scala, Python, Java, R, SQL
• Flexible
• Can run on Hadoop clusters
• Compatible with HDFS
9/15/2025 3
Spark Stack
9/15/2025 4
Spark Core
• Basic functionalities
• Task scheduling
• Memory management
• Fault tolerance
• RDDs: Resilient Distributed Datasets
• APIs for RDD operations
9/15/2025 5
Other Components
• Spark SQL
• Integrates spark programming framework with SQL
• Enables access to distributed data storage
• Spark ML
• Learning from large datasets
• Includes – basic stats, classifiers, regression models, clustering, recommendation
models
• GraphX
• Processing large scale networks
• Spark Streaming
• Processing of large stream of data
• Distributed and realtime analysis
9/15/2025 6
Cluster Manager
• Run over different cluster managers
• Standalone (native) scheduler
• Mesos
• YARN
9/15/2025 7
Driver program
• Starting point for every spark
process
• Launches and manages various
parallel operations
• main function of the program
• Driver program connects to Spark
using SparkContext object
• Driver program distributes the task
by creating, managing worker
nodes (Executors)
9/15/2025 8
Spark Programming - Flow
• Input data
• Data analysis (Task)
• Output data
9/15/2025 9
Spark Programming - Flow
Reading into an RDD
External Data
RDD Output Data
(input)
9/15/2025 10
Spark Programming - Flow
Reading into an RDD
External Data
RDD Output Data
(input)
9/15/2025 11
Spark RDD: Resilient Distributed Datasets
• Special data structure for distributing datasets
• Immutable
• Partitioned collection
• Easily parallelized and scalable
• Can be read from:
• Local files
• Distributed files
• Databases
• Performance improvement compared to directly reading from hard disk
• Used in other platforms (e.g. Hadoop)
• In memory processing
• Efficient Input/Output operation
• Fault tolerant
9/15/2025 12
Spark Programming - Flow
Source: [Link]
9/15/2025 13
Spark features
• Immutable RDD
• The same RDD never changes
• Computational output is saved in a new RDD
• Lazy evaluation
• Transformation is recorded and scheduled
• Executed when action is called
9/15/2025 14
Let’s get started!!
9/15/2025 15
Reading into an RDD
• From python variables
val = [Link]()
• Files
• Local file:
txtfl = [Link](textfile)
• HDFS
• Database
9/15/2025 16
RDD Transformations
• Converting one RDD to another
• Input: RDD output: a different RDD
• Functions
• filter
• flatMap
• union, intersection, distinct, subtract
• map
• reduce
• reduceByKey
9/15/2025 17
Example
dataRDD = [Link](range(100))
[Link]()
even_odd_RDD = [Link](lambda x:(x,x%2))
event_odd_RDD.collect()
[Link](lambda x:x[0]+1).collect()
[Link]()
[Link](5)
[Link]()
[Link](lambda x:x).collect()
9/15/2025 18
RDD Actions
• Converting an RDD to:
• standard python variable
• Write to an external file/database
• Input: RDD Output: external
• Functions:
• take(n)
• first()
• count()
• reduce()
• countByValue()
9/15/2025 19
Spark RDD: Action Operations
• Collect():
data=[Link]()
• First():
data=[Link]()
• Take(n):
data=[Link](10)
• Count():
size=[Link]()
9/15/2025 20
Functions
• Lambda functions
• For more complex processes
• Standard python function
def m(x):
return (x,1)
lambda x: (x,1)
rdd_data.map(lambda x:(x,1))
OR
rdd_data.map(m)
9/15/2025 21
Example
dataRDD = [Link](range(100))
[Link]()
even_odd_RDD = [Link](lambda x:(x,x%2))
event_odd_RDD.collect()
[Link](lambda x:x[0]+1).collect()
[Link]()
[Link](5)
[Link]()
[Link](lambda x:x).collect()
9/15/2025 22
#input
file1 = 'to be or not to be‘
#output if __name__ == ‘__main__’:
final_output = {} ###MAPPING
##for each file
map_output=map(0,file1)
def map(id,txt):
mapop = [] print map_output
words = [Link]()
map = lambda
for wordx,y: [(k,1) for k in [Link]()]
in words: ##REDUCING
tmp = (word,1) for k,v in map_output:
[Link](tmp) if k not in final_output:
return mapop final_output[k] = []
reduce(k,v)
def reduce(key,val):
if key not in final_output: print final_output
final_output[key] = 0
final_output[key] += 1
reduce = lambda x,y:final_output[x]+[y]
9/15/2025 23
Example: Word Count
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('WordCount')
sc = SparkContext(conf=conf)
t1=[Link]('[Link]')
print [Link]()
f1=[Link](lambda x:[Link]())
print [Link]()
m1=[Link](lambda x: (x,1))
print [Link]()
r1=[Link](lambda x,y:x+y)
print [Link]()
[Link]('output')
9/15/2025 24
Key-Value pairs
• Creating key-value pairs
• Map function returns key-val pairs
• Input: name (RDD containing a list of names )
• Output → Key: initials; value: full name
[Link](lambda x: (getInitials(x),x))
def getInitials(x):
fn,ln = [Link]()
initial = fn[0]+ln[0]
return initial
9/15/2025 25
Transformations on Key,Val pairs
Function Description
reduceByKey Operates by combining the keys
mapValues Operates on the values
groupByKey Groups based on keys
keys() Produces the keys
values() Produces the values
sortByKey() Sorts the RDD based on keys
filter() Selects a part of the RDD based on a condition
Involving two RDDs
subtractByKey() Removes items that are present only in the second RDD
join Inner join between two RDDs
(for common keys)
rightOuterJoin Performs a join with common keys and the ones present in the
first RDD
leftOuterJoin
9/15/2025
Performs a join with common keys and the ones present in26the
second RDD
Actions on Key,Val pairs
Function Description
countByKey() Count the number of items for each key
collectAsMap() Converts key,val RDDs into a python dict
lookup() Returns a list of values for a given key
9/15/2025 27
JSON
• Lightweight text data representational storage format
• Independent of language
• Variable “width”
• Plain text
• Human readable
• Hierarchical structure
9/15/2025 28
Representation
• Unordered pair of “field”, “values”
• Begins and ends with ‘{‘ and ‘}’ respectively
• “Value” is followed by “Field” , separated by ‘:’
• Field, value pairs are separated by ‘,’
student = { student = {
‘name’: ‘abc’, ‘name’: ‘abc’,
‘id’: 1234, ‘id’: 1234,
‘dateofbirth’: mmddyyyy, ‘dateofbirth’: mmddyyyy,
‘dept/school’: xyz, ‘dept/school’: {‘deptid’: 111, ‘name’: ‘computer_science’},
‘Enrolment’ : 2017 ‘Enrolment’ : 2017
} }
9/15/2025 29
JSON in Python
• Standard library
• Pyson
• simplejson
9/15/2025 30
Standard json library
• import json
• [Link](string)
• [Link](jsonobject)
9/15/2025 31
Processing JSON files using spark
Assuming a file has json string in each line:
rdd1 = [Link](‘filepath’)
[Link](lambda x:[Link](x))
9/15/2025 32