DS5460: Big Data Scaling
Week 7: Midterm Review
Instructor: Dana Zhang, Ph.D.
Midterm
• During class time Wed 2/19
• Paper-based exam
• Duration: 75 minutes (in class)
• Content covered: material prior to Wed 2/19
• Allowed on the exam: 1 page double-sided printed cheatsheet
• Not allowed on the exam: any electronic devices
• For additional exam logistics, see Brightspace announcement
Midterm
• Total point value: 100
• Format
• 25-30 multiple choice questions
• 4-6 written response questions
• Questions will be a mixture of concepts, commands, PySpark/MR code, calculations, etc
Review Outline
• Cloud Computing (DFS, MapReduce)
• HDFS Architecture
• Apache Spark (RDD, Spark Applications)
• DataFrame and SQL
• Commands
What is Cloud Computing?
Cloud computing:
• Internet-based computing in which large groups of remote servers
are networked so as to allow sharing of data-processing tasks,
centralized data storage, and online access to computer services or
resources.
• Any computer related task that is done entirely on the Internet.
• Allows users to deal with the software without having the hardware.
• Everything is done remotely, nothing is saved locally.
Cloud Services
Three tiers of cloud services:
• Infrastructure as a Service (IaaS)
• Basic/raw, service users maintain most components
• Ex: Google Compute Engine – provides virtual
machines where systems like PySpark/Hadoop has to
be manually installed
• Platform as a Service (PaaS)
• Users are given hardware and some pre-configured
software automatically
• Ex: Dataproc – fully managed PySpark/Hadoop
• Software as a Service (SaaS)
• All software and hardware are transparent
• User only knows their own access point
• Ex: IBM Waston ML – fully trained models, ready to use
Cloud Challenges
• Equipment Failures
• With so many machines, steady rate of failures is expected and constant
maintenance is required
• Scalability
• Cloud needs to be able to add more servers
• (Horizontal vs vertical scaling)
• Asynchronous processing
• Clocks of different servers cannot all be synchronized to each other
• Concurrency
• Many machines may try to access the same data
Distributed File System
• Master node (Name node in Hadoop’s HDFS)
• Stores metadata about where files are stored
• Might be replicated
• Chunk servers (Data nodes)
• File is split into contiguous chunks
• Typically each chunk is 16-64MB
• Each chunk replicated (usually 2x or 3x)
• Try to keep replicas in different racks
• Data Coherency
- Write-once-read-many access model
- Client can only append to existing files
• Client library for file access (e.g. hdfs commands)
• Talks to master to find chunk servers
• Connects directly to chunk servers to access data
Distributed File System
[Link]
Goals of HDFS
• Very Large Distributed File System
• 10K nodes, 100 million files, 10PB
• Assumes Commodity Hardware
• Files are replicated to handle hardware failure
• Detect failures and recover from them
• Optimized for Batch Processing
• Data locations exposed so that computations can move to where
data resides
• Provides very high aggregate bandwidth
HDFS Architecture
The NameNode executes file system
namespace operations like opening,
closing, and renaming files and
directories. It also determines the
mapping of blocks to DataNodes.
The DataNodes are responsible for
serving read and write requests from
the file system’s clients.
The DataNodes also perform block
creation, deletion, and replication
upon instruction from the NameNode.
MapReduce: Map + Shuffle + Reduce
[Link]
Problems with Hadoop MapReduce
• Difficulty in Programming
• Many tasks are not easily described as map-reduce
• Not well suited for complex applications
• Performance Bottlenecks
• Disk IO
• Data (including intermediate data from Map) is persisted in HDFS,
requiring multiple read/write operations
• After Map, data must be sorted and shuffled before sending to
Reduce
• HDFS replicates data (3x by default)
MapReduce: Word Count
[Link]
MapReduce: Word Count
[Link]
MapReduce: Word Count
Hadoop Streaming API command
MapReduce Limitations?
• Hadoop MapReduce heavily relies on reading and writing data (files)
from/to HDFS
• Problem for data science: Many operations are carried out in the same
dataset
• No support for interactivity
• Problem for data science: can’t REPL means not hard to validate
intermediate results
• Complex jobs are not supported
• Problem for data science: not specialized for machine learning tasks!
Spark
• Not limited to the map-reduce model
• Additions to MapReduce model:
- Fast data sharing
- Avoids saving intermediate results to disk
- Caches data for repetitive queries (e.g. for machine learning)
- Richer functions than just map and reduce
• Compatible with Hadoop
Spark vs. Hadoop MapReduce
• Performance: Spark normally faster but with caveats
- Spark can process data in-memory;
- Spark often needs lots of memory to perform well; if there are other resource-
demanding services or can’t fit in memory, Spark degrades
• Ease of use: Spark is easier to program (higher-level APIs)
• Data processing: Spark more general
• Flexibility in input/output files
• Interactive shell
In Memory Processing
MapReduce Spark
Resilient Distributed Datasets
(RDD)
Resilient Distributed Datasets (RDDs)
• The main abstraction Spark provides is a resilient distributed
dataset (RDD), which is a collection of elements partitioned across
the nodes of the cluster that can be operated on in parallel.
• RDDs are datasets created from HDFS, S3, JSON, text, or other RDDs
• Read-only, partitioned collection of records.
• RDDs automatically recover from node failures. They track the history
of the partition and can rerun through DAG lineage
Spark Architecture
Driver & executors
• Driver program runs your Spark application
• Driver delegates tasks to executors
• In local mode, executors are located in the same machine as driver
• In cluster mode, executors may be located on other machines
(worker nodes)
• Actions are processed in executors
• Outcome is passed to driver
[Link]
Transformations
• Narrow Transformation: executed locally, with no need to shuffle
partitions
• Wide Transformation: processing depends on data in different RDD
partitions, in different worker nodes; requires data transferred
through the network
MapReduce in Spark
• Let’s combine the flatMap, map and reduceByKey transformations to compute
the per-word counts as an RDD of (string, int) pairs.
>>> wordCounts = [Link](lambda line:
[Link]()).map(lambda word: (word, 1)).reduceByKey(lambda a, b:
a+b)
• To collect the word counts in our shell, we can use the collect action:
>>> [Link]()
[('Apache', 1), ('Spark', 2), ('cat', 4), ('fish', 2), ('cow',
2), ('chicken', 2), ('dog', 4), ('horse', 1)]
• RDDs can be cached
>>> [Link]()
Passing Functions to Spark
• Spark’s API relies on passing functions in the driver program to run on
the cluster. There are three ways to do this:
• Lambda expressions, for simple functions that can be written as an expression.
Lambdas do not support multi-statement functions or statements that do not
return a value.
• Local defs inside the function calling into Spark, for longer code
• Top-level functions in a module
Broadcasting
• Join and Lookup are use cases of broadcasting
• The smaller of two datasets is broadcasted to all nodes and cached in memory
• Global distribution
Transformations, Actions, Laziness
• DataFrames are lazy.
• Transformations contribute to the query
plan, but they don't execute anything.
• Actions cause the execution of the query.
• Dataframes are Immutable in nature. By
immutable I mean that it is an object
whose state cannot be modified after it is
created. But we can transform its values
by applying a certain transformation, like
in RDDs.
DataFrames and Spark SQL
• DataFrames are fundamentally tied to Spark SQL.
• Spark SQL provides a SQL-like interface.
• What you can do in Spark SQL, you can do in DataFrames
• … and vice versa.
DataFrame, SQL and RDD
• SQL-like query
• Dataframe -> implement map/reduce
• RDD -> create dataframe
DataFrame commands
• Create DataFrames from files, RDDs, etc
• Show the data
• Create/rename new columns
• Filter Data
• Integrate with Pandas
• GroupBy and Aggregate functions
• Missing data
• Dates and Timestamps
Linux
• chmod
• cat
• head -n
• sort (with optional k1 args)
• | vs >
Good Luck on the Exam!