0% found this document useful (0 votes)

11 views34 pages

Midterm Review: Big Data Concepts

The document outlines the midterm exam details for the DS5460 Big Data Scaling course, including logistics, allowed materials, and content coverage. It reviews key topics such as Cloud Computing, HDFS Architecture, Apache Spark, and MapReduce, along with their challenges and functionalities. Additionally, it provides a brief overview of DataFrames, SQL, and Linux commands relevant to the course material.

Uploaded by

keqingliush

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views34 pages

Midterm Review: Big Data Concepts

Uploaded by

keqingliush

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DS5460: Big Data Scaling

Week 7: Midterm Review

Instructor: Dana Zhang, Ph.D.

Midterm
• During class time Wed 2/19
• Paper-based exam
• Duration: 75 minutes (in class)
• Content covered: material prior to Wed 2/19
• Allowed on the exam: 1 page double-sided printed cheatsheet
• Not allowed on the exam: any electronic devices
• For additional exam logistics, see Brightspace announcement
Midterm
• Total point value: 100
• Format
• 25-30 multiple choice questions
• 4-6 written response questions
• Questions will be a mixture of concepts, commands, PySpark/MR code, calculations, etc
Review Outline
• Cloud Computing (DFS, MapReduce)
• HDFS Architecture
• Apache Spark (RDD, Spark Applications)
• DataFrame and SQL

• Commands
What is Cloud Computing?
Cloud computing:
• Internet-based computing in which large groups of remote servers
are networked so as to allow sharing of data-processing tasks,
centralized data storage, and online access to computer services or
resources.
• Any computer related task that is done entirely on the Internet.
• Allows users to deal with the software without having the hardware.
• Everything is done remotely, nothing is saved locally.
Cloud Services
Three tiers of cloud services:
• Infrastructure as a Service (IaaS)
• Basic/raw, service users maintain most components
• Ex: Google Compute Engine – provides virtual
machines where systems like PySpark/Hadoop has to
be manually installed
• Platform as a Service (PaaS)
• Users are given hardware and some pre-configured
software automatically
• Ex: Dataproc – fully managed PySpark/Hadoop
• Software as a Service (SaaS)
• All software and hardware are transparent
• User only knows their own access point
• Ex: IBM Waston ML – fully trained models, ready to use
Cloud Challenges
• Equipment Failures
• With so many machines, steady rate of failures is expected and constant
maintenance is required
• Scalability
• Cloud needs to be able to add more servers
• (Horizontal vs vertical scaling)
• Asynchronous processing
• Clocks of different servers cannot all be synchronized to each other
• Concurrency
• Many machines may try to access the same data
Distributed File System
• Master node (Name node in Hadoop’s HDFS)
• Stores metadata about where files are stored
• Might be replicated
• Chunk servers (Data nodes)
• File is split into contiguous chunks
• Typically each chunk is 16-64MB
• Each chunk replicated (usually 2x or 3x)
• Try to keep replicas in different racks
• Data Coherency
- Write-once-read-many access model
- Client can only append to existing files
• Client library for file access (e.g. hdfs commands)
• Talks to master to find chunk servers
• Connects directly to chunk servers to access data
Distributed File System

[Link]
Goals of HDFS
• Very Large Distributed File System
• 10K nodes, 100 million files, 10PB
• Assumes Commodity Hardware
• Files are replicated to handle hardware failure
• Detect failures and recover from them
• Optimized for Batch Processing
• Data locations exposed so that computations can move to where
data resides
• Provides very high aggregate bandwidth
HDFS Architecture
The NameNode executes file system
namespace operations like opening,
closing, and renaming files and
directories. It also determines the
mapping of blocks to DataNodes.

The DataNodes are responsible for

serving read and write requests from
the file system’s clients.

The DataNodes also perform block

creation, deletion, and replication
upon instruction from the NameNode.
MapReduce: Map + Shuffle + Reduce

[Link]
Problems with Hadoop MapReduce
• Difficulty in Programming
• Many tasks are not easily described as map-reduce
• Not well suited for complex applications

• Performance Bottlenecks
• Disk IO
• Data (including intermediate data from Map) is persisted in HDFS,
requiring multiple read/write operations
• After Map, data must be sorted and shuffled before sending to
Reduce
• HDFS replicates data (3x by default)
MapReduce: Word Count

[Link]
MapReduce: Word Count

Hadoop Streaming API command

MapReduce Limitations?
• Hadoop MapReduce heavily relies on reading and writing data (files)
from/to HDFS
• Problem for data science: Many operations are carried out in the same
dataset

• No support for interactivity

• Problem for data science: can’t REPL means not hard to validate
intermediate results

• Complex jobs are not supported

• Problem for data science: not specialized for machine learning tasks!
Spark
• Not limited to the map-reduce model
• Additions to MapReduce model:
- Fast data sharing
- Avoids saving intermediate results to disk
- Caches data for repetitive queries (e.g. for machine learning)
- Richer functions than just map and reduce
• Compatible with Hadoop
Spark vs. Hadoop MapReduce
• Performance: Spark normally faster but with caveats
- Spark can process data in-memory;
- Spark often needs lots of memory to perform well; if there are other resource-
demanding services or can’t fit in memory, Spark degrades
• Ease of use: Spark is easier to program (higher-level APIs)
• Data processing: Spark more general
• Flexibility in input/output files
• Interactive shell
In Memory Processing
MapReduce Spark

Resilient Distributed Datasets

(RDD)
Resilient Distributed Datasets (RDDs)
• The main abstraction Spark provides is a resilient distributed
dataset (RDD), which is a collection of elements partitioned across
the nodes of the cluster that can be operated on in parallel.
• RDDs are datasets created from HDFS, S3, JSON, text, or other RDDs
• Read-only, partitioned collection of records.
• RDDs automatically recover from node failures. They track the history
of the partition and can rerun through DAG lineage
Spark Architecture
Driver & executors
• Driver program runs your Spark application
• Driver delegates tasks to executors
• In local mode, executors are located in the same machine as driver
• In cluster mode, executors may be located on other machines
(worker nodes)

• Actions are processed in executors

• Outcome is passed to driver

[Link]
Transformations
• Narrow Transformation: executed locally, with no need to shuffle
partitions

• Wide Transformation: processing depends on data in different RDD

partitions, in different worker nodes; requires data transferred
through the network
MapReduce in Spark
• Let’s combine the flatMap, map and reduceByKey transformations to compute
the per-word counts as an RDD of (string, int) pairs.
>>> wordCounts = [Link](lambda line:
[Link]()).map(lambda word: (word, 1)).reduceByKey(lambda a, b:
a+b)
• To collect the word counts in our shell, we can use the collect action:
>>> [Link]()
[('Apache', 1), ('Spark', 2), ('cat', 4), ('fish', 2), ('cow',
2), ('chicken', 2), ('dog', 4), ('horse', 1)]

• RDDs can be cached

>>> [Link]()
Passing Functions to Spark
• Spark’s API relies on passing functions in the driver program to run on
the cluster. There are three ways to do this:
• Lambda expressions, for simple functions that can be written as an expression.
Lambdas do not support multi-statement functions or statements that do not
return a value.
• Local defs inside the function calling into Spark, for longer code
• Top-level functions in a module
Broadcasting

• Join and Lookup are use cases of broadcasting

• The smaller of two datasets is broadcasted to all nodes and cached in memory
• Global distribution
Transformations, Actions, Laziness
• DataFrames are lazy.
• Transformations contribute to the query
plan, but they don't execute anything.
• Actions cause the execution of the query.
• Dataframes are Immutable in nature. By
immutable I mean that it is an object
whose state cannot be modified after it is
created. But we can transform its values
by applying a certain transformation, like
in RDDs.
DataFrames and Spark SQL
• DataFrames are fundamentally tied to Spark SQL.
• Spark SQL provides a SQL-like interface.
• What you can do in Spark SQL, you can do in DataFrames
• … and vice versa.
DataFrame, SQL and RDD
• SQL-like query
• Dataframe -> implement map/reduce
• RDD -> create dataframe
DataFrame commands
• Create DataFrames from files, RDDs, etc

• Show the data

• Create/rename new columns
• Filter Data
• Integrate with Pandas
• GroupBy and Aggregate functions
• Missing data
• Dates and Timestamps
Linux
• chmod
• cat
• head -n
• sort (with optional k1 args)
• | vs >
Good Luck on the Exam!

Hadoop and Spark Overview Guide
No ratings yet
Hadoop and Spark Overview Guide
34 pages
Apache Spark vs. MapReduce Limitations
No ratings yet
Apache Spark vs. MapReduce Limitations
47 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Big Data Management and Architecture Guide
No ratings yet
Big Data Management and Architecture Guide
67 pages
Big Data Analytics with Apache Spark
No ratings yet
Big Data Analytics with Apache Spark
58 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Hadoop Spark
No ratings yet
Hadoop Spark
31 pages
PySpark 1
No ratings yet
PySpark 1
63 pages
Introduction to Apache Spark Concepts
No ratings yet
Introduction to Apache Spark Concepts
96 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
44 pages
Spark Overview for Big Data
No ratings yet
Spark Overview for Big Data
31 pages
Introduction to Apache Spark 2 Architecture
No ratings yet
Introduction to Apache Spark 2 Architecture
43 pages
Big Data Processing with Hadoop & Spark
No ratings yet
Big Data Processing with Hadoop & Spark
11 pages
MapReduce and Spark for Big Data
No ratings yet
MapReduce and Spark for Big Data
30 pages
Spark Deep Dive
No ratings yet
Spark Deep Dive
34 pages
Features and Architecture of Apache Spark
No ratings yet
Features and Architecture of Apache Spark
24 pages
Spark and Hadoop Use Cases Explained
No ratings yet
Spark and Hadoop Use Cases Explained
48 pages
Unit - V Bda Final
No ratings yet
Unit - V Bda Final
45 pages
Pyspark Basics
No ratings yet
Pyspark Basics
106 pages
Hadoop and Spark for Big Data Analysis
No ratings yet
Hadoop and Spark for Big Data Analysis
36 pages
MapReduce: Big Data Processing Explained
No ratings yet
MapReduce: Big Data Processing Explained
53 pages
Spark vs Hadoop: Performance Insights
No ratings yet
Spark vs Hadoop: Performance Insights
9 pages
Apache Spark Overview and Advantages
No ratings yet
Apache Spark Overview and Advantages
52 pages
Big Data Processing with Hadoop & Spark
No ratings yet
Big Data Processing with Hadoop & Spark
5 pages
Hadoop Ecosystem and MapReduce Overview
No ratings yet
Hadoop Ecosystem and MapReduce Overview
10 pages
Spark: Beyond MapReduce for Big Data
No ratings yet
Spark: Beyond MapReduce for Big Data
99 pages
Scaling With Big Data Using Apache Spark
No ratings yet
Scaling With Big Data Using Apache Spark
27 pages
Understanding Apache Spark and RDDs
No ratings yet
Understanding Apache Spark and RDDs
42 pages
Overview of Hadoop and Spark Modules
No ratings yet
Overview of Hadoop and Spark Modules
27 pages
Spark: Integrating Big Data Paradigms
No ratings yet
Spark: Integrating Big Data Paradigms
35 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
66 pages
Apache Spark: Performance and Fault Tolerance
No ratings yet
Apache Spark: Performance and Fault Tolerance
66 pages
Overview of Scala and Apache Spark
No ratings yet
Overview of Scala and Apache Spark
37 pages
Spark Basics for Big Data Systems
No ratings yet
Spark Basics for Big Data Systems
51 pages
Introduction to Apache Spark Architecture
No ratings yet
Introduction to Apache Spark Architecture
96 pages
Chapter 3
No ratings yet
Chapter 3
55 pages
Big Data Technologies: MapReduce vs Spark
No ratings yet
Big Data Technologies: MapReduce vs Spark
21 pages
Sparklyr Online Course Overview
No ratings yet
Sparklyr Online Course Overview
80 pages
Big Data and Hadoop Overview
No ratings yet
Big Data and Hadoop Overview
32 pages
Spark Introduction
No ratings yet
Spark Introduction
49 pages
Overview of Hadoop and Spark Frameworks
No ratings yet
Overview of Hadoop and Spark Frameworks
16 pages
Introduction to Apache Spark for Big Data
No ratings yet
Introduction to Apache Spark for Big Data
42 pages
Introduction to Apache Spark Framework
No ratings yet
Introduction to Apache Spark Framework
30 pages
Apache Spark & Azure Databricks Guide
No ratings yet
Apache Spark & Azure Databricks Guide
46 pages
Understanding Apache Spark Essentials
No ratings yet
Understanding Apache Spark Essentials
125 pages
Unit 5.1
No ratings yet
Unit 5.1
81 pages
Real-Time Analytics with Spark & Kafka
No ratings yet
Real-Time Analytics with Spark & Kafka
53 pages
Fast Data Analytics with PySpark Guide
No ratings yet
Fast Data Analytics with PySpark Guide
75 pages
Spark vs MapReduce: Key Differences
No ratings yet
Spark vs MapReduce: Key Differences
51 pages
CSE6242 20141007 ScalingUp3 Spark
No ratings yet
CSE6242 20141007 ScalingUp3 Spark
65 pages
Apache Spark
No ratings yet
Apache Spark
34 pages
Hadoop and Spark for Big Data Analysis
No ratings yet
Hadoop and Spark for Big Data Analysis
33 pages
Apache Spark
No ratings yet
Apache Spark
69 pages
Introduction to Hadoop & MapReduce
No ratings yet
Introduction to Hadoop & MapReduce
117 pages
Unit 4 Advanced Big Data Analytics
No ratings yet
Unit 4 Advanced Big Data Analytics
17 pages
Data Storage and Processing Frameworks
No ratings yet
Data Storage and Processing Frameworks
18 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
75 pages
System Requirements For Autocad 2015
No ratings yet
System Requirements For Autocad 2015
10 pages
Smart Parking Systems in IoT Cities
No ratings yet
Smart Parking Systems in IoT Cities
28 pages
CCNA 2 Module 1: WAN Overview
No ratings yet
CCNA 2 Module 1: WAN Overview
36 pages
Caching YouTube and Facebook HTTPS Guide
No ratings yet
Caching YouTube and Facebook HTTPS Guide
9 pages
HSBC Third Party Payment Activation Form
100% (2)
HSBC Third Party Payment Activation Form
1 page
SMSC Operations and Optimization Overview
No ratings yet
SMSC Operations and Optimization Overview
3 pages
Networking Concepts and Protocols Quiz
No ratings yet
Networking Concepts and Protocols Quiz
10 pages
ICT Impact on Workplace Structure Survey
No ratings yet
ICT Impact on Workplace Structure Survey
13 pages
Application Server Installation Guide
No ratings yet
Application Server Installation Guide
37 pages
Defense Against DDoS on Cisco Routers
No ratings yet
Defense Against DDoS on Cisco Routers
8 pages
RPL Protocol Review with Contiki OS
No ratings yet
RPL Protocol Review with Contiki OS
16 pages
CB Itm
No ratings yet
CB Itm
16 pages
FWT GENERAL CATALOGUE ENG Rev.1 - 1118 PRINT PDF
No ratings yet
FWT GENERAL CATALOGUE ENG Rev.1 - 1118 PRINT PDF
8 pages
Network Computing: Collaboration & Tools
No ratings yet
Network Computing: Collaboration & Tools
31 pages
Final Expanded 15 Page WSN IEEE Report
No ratings yet
Final Expanded 15 Page WSN IEEE Report
4 pages
Fiber Optic Infrastructure in Turkey
No ratings yet
Fiber Optic Infrastructure in Turkey
41 pages
Fujitsu Warranty Claim & Log Collection Guide
No ratings yet
Fujitsu Warranty Claim & Log Collection Guide
18 pages
CS601 Final Term Papers Overview 2018-2020
No ratings yet
CS601 Final Term Papers Overview 2018-2020
4 pages
Cybersecurity Practitioner Exam Questions and Answers
No ratings yet
Cybersecurity Practitioner Exam Questions and Answers
12 pages
Online Dating: Challenges and Solutions
No ratings yet
Online Dating: Challenges and Solutions
7 pages
Dooya DM59M Motor Specifications Guide
No ratings yet
Dooya DM59M Motor Specifications Guide
196 pages
College Feedback Module Overview
No ratings yet
College Feedback Module Overview
7 pages
Writing a Synthesis Essay Guide
No ratings yet
Writing a Synthesis Essay Guide
3 pages
Understanding Three-Tier Architecture
No ratings yet
Understanding Three-Tier Architecture
6 pages
Internet Concepts and Applications Guide
No ratings yet
Internet Concepts and Applications Guide
97 pages
PCI DSS Compliance Overview and Requirements
100% (1)
PCI DSS Compliance Overview and Requirements
8 pages
Understanding F5 Traffic Flow
No ratings yet
Understanding F5 Traffic Flow
5 pages
Agentless ZTNA with FortiSIEM UEBA
No ratings yet
Agentless ZTNA with FortiSIEM UEBA
39 pages
Current Challenges and Future Research Areas For Digital Forensic
No ratings yet
Current Challenges and Future Research Areas For Digital Forensic
13 pages
Nutanix iLO and IPMI Vulnerabilities
No ratings yet
Nutanix iLO and IPMI Vulnerabilities
24 pages

Midterm Review: Big Data Concepts

Uploaded by

Midterm Review: Big Data Concepts

Uploaded by

DS5460: Big Data Scaling

Week 7: Midterm Review

Instructor: Dana Zhang, Ph.D.

The DataNodes are responsible for

The DataNodes also perform block

Hadoop Streaming API command

• No support for interactivity

• Complex jobs are not supported

Resilient Distributed Datasets

• Actions are processed in executors

• Wide Transformation: processing depends on data in different RDD

• RDDs can be cached

• Join and Lookup are use cases of broadcasting

• Show the data

You might also like