0% found this document useful (0 votes)
2 views32 pages

Spark Programming for Data Science

The document provides an overview of Apache Spark, highlighting its structure, features, and components such as Spark SQL, Spark ML, and Spark Streaming. It explains the concept of Resilient Distributed Datasets (RDDs), their transformations, and actions, as well as how to process JSON files using Spark. Additionally, it includes examples of Spark programming and key-value pair operations.

Uploaded by

Sathvik Cisco
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views32 pages

Spark Programming for Data Science

The document provides an overview of Apache Spark, highlighting its structure, features, and components such as Spark SQL, Spark ML, and Spark Streaming. It explains the concept of Resilient Distributed Datasets (RDDs), their transformations, and actions, as well as how to process JSON files using Spark. Additionally, it includes examples of Spark programming and key-value pair operations.

Uploaded by

Sathvik Cisco
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

INFO H516 Cloud Computing

for Data Science


Week 3: Spark
Spark Structure
Master

worker worker worker

Driver

worker worker worker

9/15/2025 2
Features
• Highly accessible
• Scala, Python, Java, R, SQL
• Flexible
• Can run on Hadoop clusters
• Compatible with HDFS

9/15/2025 3
Spark Stack

9/15/2025 4
Spark Core
• Basic functionalities
• Task scheduling
• Memory management
• Fault tolerance
• RDDs: Resilient Distributed Datasets
• APIs for RDD operations

9/15/2025 5
Other Components
• Spark SQL
• Integrates spark programming framework with SQL
• Enables access to distributed data storage
• Spark ML
• Learning from large datasets
• Includes – basic stats, classifiers, regression models, clustering, recommendation
models
• GraphX
• Processing large scale networks
• Spark Streaming
• Processing of large stream of data
• Distributed and realtime analysis
9/15/2025 6
Cluster Manager
• Run over different cluster managers
• Standalone (native) scheduler
• Mesos
• YARN

9/15/2025 7
Driver program
• Starting point for every spark
process
• Launches and manages various
parallel operations
• main function of the program
• Driver program connects to Spark
using SparkContext object
• Driver program distributes the task
by creating, managing worker
nodes (Executors)

9/15/2025 8
Spark Programming - Flow
• Input data
• Data analysis (Task)
• Output data

9/15/2025 9
Spark Programming - Flow

Reading into an RDD

External Data
RDD Output Data
(input)

9/15/2025 10
Spark Programming - Flow

Reading into an RDD

External Data
RDD Output Data
(input)

9/15/2025 11
Spark RDD: Resilient Distributed Datasets
• Special data structure for distributing datasets
• Immutable
• Partitioned collection
• Easily parallelized and scalable
• Can be read from:
• Local files
• Distributed files
• Databases
• Performance improvement compared to directly reading from hard disk
• Used in other platforms (e.g. Hadoop)
• In memory processing
• Efficient Input/Output operation
• Fault tolerant

9/15/2025 12
Spark Programming - Flow

Source: [Link]
9/15/2025 13
Spark features
• Immutable RDD
• The same RDD never changes
• Computational output is saved in a new RDD
• Lazy evaluation
• Transformation is recorded and scheduled
• Executed when action is called

9/15/2025 14
Let’s get started!!

9/15/2025 15
Reading into an RDD
• From python variables
val = [Link]()

• Files
• Local file:
txtfl = [Link](textfile)

• HDFS
• Database

9/15/2025 16
RDD Transformations
• Converting one RDD to another
• Input: RDD output: a different RDD
• Functions
• filter
• flatMap
• union, intersection, distinct, subtract
• map
• reduce
• reduceByKey

9/15/2025 17
Example
dataRDD = [Link](range(100))
[Link]()
even_odd_RDD = [Link](lambda x:(x,x%2))
event_odd_RDD.collect()
[Link](lambda x:x[0]+1).collect()
[Link]()
[Link](5)
[Link]()
[Link](lambda x:x).collect()

9/15/2025 18
RDD Actions
• Converting an RDD to:
• standard python variable
• Write to an external file/database
• Input: RDD Output: external
• Functions:
• take(n)
• first()
• count()
• reduce()
• countByValue()

9/15/2025 19
Spark RDD: Action Operations
• Collect():
data=[Link]()

• First():
data=[Link]()

• Take(n):
data=[Link](10)

• Count():
size=[Link]()

9/15/2025 20
Functions
• Lambda functions
• For more complex processes
• Standard python function

def m(x):
return (x,1)

lambda x: (x,1)

rdd_data.map(lambda x:(x,1))
OR
rdd_data.map(m)
9/15/2025 21
Example
dataRDD = [Link](range(100))
[Link]()
even_odd_RDD = [Link](lambda x:(x,x%2))
event_odd_RDD.collect()
[Link](lambda x:x[0]+1).collect()
[Link]()
[Link](5)
[Link]()
[Link](lambda x:x).collect()

9/15/2025 22
#input
file1 = 'to be or not to be‘

#output if __name__ == ‘__main__’:


final_output = {} ###MAPPING
##for each file
map_output=map(0,file1)
def map(id,txt):
mapop = [] print map_output
words = [Link]()
map = lambda
for wordx,y: [(k,1) for k in [Link]()]
in words: ##REDUCING
tmp = (word,1) for k,v in map_output:
[Link](tmp) if k not in final_output:
return mapop final_output[k] = []

reduce(k,v)
def reduce(key,val):
if key not in final_output: print final_output
final_output[key] = 0
final_output[key] += 1

reduce = lambda x,y:final_output[x]+[y]

9/15/2025 23
Example: Word Count
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('WordCount')
sc = SparkContext(conf=conf)

t1=[Link]('[Link]')
print [Link]()

f1=[Link](lambda x:[Link]())
print [Link]()

m1=[Link](lambda x: (x,1))
print [Link]()

r1=[Link](lambda x,y:x+y)
print [Link]()

[Link]('output')
9/15/2025 24
Key-Value pairs
• Creating key-value pairs
• Map function returns key-val pairs
• Input: name (RDD containing a list of names )
• Output → Key: initials; value: full name

[Link](lambda x: (getInitials(x),x))
def getInitials(x):
fn,ln = [Link]()
initial = fn[0]+ln[0]
return initial

9/15/2025 25
Transformations on Key,Val pairs
Function Description
reduceByKey Operates by combining the keys
mapValues Operates on the values
groupByKey Groups based on keys
keys() Produces the keys
values() Produces the values
sortByKey() Sorts the RDD based on keys
filter() Selects a part of the RDD based on a condition
Involving two RDDs
subtractByKey() Removes items that are present only in the second RDD
join Inner join between two RDDs
(for common keys)
rightOuterJoin Performs a join with common keys and the ones present in the
first RDD
leftOuterJoin
9/15/2025
Performs a join with common keys and the ones present in26the
second RDD
Actions on Key,Val pairs

Function Description
countByKey() Count the number of items for each key
collectAsMap() Converts key,val RDDs into a python dict
lookup() Returns a list of values for a given key

9/15/2025 27
JSON
• Lightweight text data representational storage format
• Independent of language
• Variable “width”
• Plain text
• Human readable
• Hierarchical structure

9/15/2025 28
Representation
• Unordered pair of “field”, “values”
• Begins and ends with ‘{‘ and ‘}’ respectively
• “Value” is followed by “Field” , separated by ‘:’
• Field, value pairs are separated by ‘,’
student = { student = {
‘name’: ‘abc’, ‘name’: ‘abc’,
‘id’: 1234, ‘id’: 1234,
‘dateofbirth’: mmddyyyy, ‘dateofbirth’: mmddyyyy,
‘dept/school’: xyz, ‘dept/school’: {‘deptid’: 111, ‘name’: ‘computer_science’},
‘Enrolment’ : 2017 ‘Enrolment’ : 2017
} }

9/15/2025 29
JSON in Python
• Standard library
• Pyson
• simplejson

9/15/2025 30
Standard json library
• import json
• [Link](string)
• [Link](jsonobject)

9/15/2025 31
Processing JSON files using spark

Assuming a file has json string in each line:

rdd1 = [Link](‘filepath’)
[Link](lambda x:[Link](x))

9/15/2025 32

You might also like