Agenda
Big Data
Overview
Spark
Overview
Spark
Internals
Spark
Libraries
BIG DATA OVERVIEW
Big Data -- Digital Data growth…
V-V-V
Legacy Architecture Pain Points
• Report arrival latency quite high - Hours to perform joins,
aggregate data
• Existing frameworks cannot do both
• Either, stream processing of 100s of MB/s with low latency
• Or, batch processing of TBs of data with high latency
• Expressibility of business logic in Hadoop MR is challenging
SPARK OVERVIEW
Why
Spark?
Why Spark
Separate, fast, Map-Reduce-like engine
In-memory data storage for very fast iterative queries
Better Fault Tolerance
Combine SQL, Streaming and complex analytics
Runs on Hadoop, Mesos, standalone, or in the cloud
Data sources -> HDFS, Cassandra, HBase and S3
In Memory - Spark vs Hadoop
Improve efficiency over MapReduce
100x in memory , 2-10x in disk
Up to 40x faster than Hadoop
Spark In & Out
RDBMS
Streaming
SQL
GraphX
BlinkDB
Hadoop Input Format
Apps
Distributions:
- CDH
- HDP
- MapR
- DSE
Tachyon
MLlib
Ref: http://training.databricks.com/intro.pdf
Spark Streaming + SQL
Streaming
SQL
Benchmarking & Best Facts
SPARK INSIDE – AROUND RDD
Resilient Distributed Data (RDD)
Immutable + Distributed+ Catchable+ Lazy evaluated
 Distributed collections of objects
 Can be cached in memory across cluster nodes
 Manipulated through various parallel operations
RDD Types
RDD
RDD Operation
Memory and Persistent
Dependencies Types
Spark Cluster Overview
o Application
o Driver program
o Cluster manage
o Worker node
o Job
o Stage
o Executor
o Task
Job Flow
Task Scheduler , DAG
• Pipelines functions within a stage
• Cache-aware data reuse & locality
• Partitioning-aware to avoid
shuffles
rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)
Fault Recovery & Checkpoints
• Efficient fault recovery using Lineage
• log one operation to apply to many elements (lineage)
• Recomputed lost partitions on failure
• Checkpoint RDDs to prevent long lineage chains during fault
recovery
QUICK DEMO
SPARK STACAK DETAILS
Spark SQL
• Seamlessly mix SQL queries with Spark programs
• Load and query data from a variety of sources
• Standard Connectivity through (J)ODBC
• Hive Compatibility
Data Frames
• A distributed collection of data organized into named columns
• Like a table in a relational database
Spark SQL
Resilient Distributed Datasets
Spark
JDBC Console
User Programs
(Java, Scala, Python)
Catalyst Optimizer
DataFrame API
Figur e 1: I nter faces to Spar k SQL , and inter action with Spar k.
3.1 DataFr ame API
The main abstraction in Spark SQL’s API is a DataFrame, a dis-
tributed collection of rows with a homogeneous schema. A DataFrame
is equivalent to a table in a relational database, and can also be
manipulated in similar ways to the “ native” distributed collections
as well
maps an
to creat
Spark S
the quer
ports us
Using
data fro
tional d
3.3 D
Users c
domain
Python
operato
aggrega
jects in
expressi
of fema
empl oy
. j oi
SparkR
• New R language for Spark and SparkSQL
• Exposes existing Spark functionality in
an R-friendly syntax view the DataFrame API
Spark Streaming
File systems
Databases
Dashboards
Flume
HDFS
Kinesis
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrate with MLlib, SQL,
DataFrames, GraphX
Chop up the live stream into batches of X seconds. DStream is represented by
a continuous series of RDDs
MLib
• Scalable Machine learning library
• Iterative computing -> High Quality algorithm 100x faster than
hadoop
MLib Algorithms
ML Pipeline
• Feature Extraction
• Normalization
• Dimensionality reduction
• Model training
GraphX
• Spark’s API For Graph and Graph-parallel computation
• Graph abstraction: a directed multigraph with properties attached
to each vertex and edge
• Seamlessly works with both graph and collections
GraphX Framework & Algorithms
Algorithms
Spark Packages
Users & Distributors…
Thanks to Apache Spark by….
Started using it in our projects…
Contribute to their open source community…
Socialize Spark ..
Backup Slides
SPARK CLUSTER
Cluster Support
• Standalone – a simple cluster manager included with Spark that makes it easy to set
up a cluster
• Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and
service applications
• Hadoop YARN – the resource manager in Hadoop 2
Spark On Mesos
Spark on YARN
Data Science Process
Data Science in Practice
• Data Collection
• Munging
• Analysis
• Visualization
• Decision
Real Time Feedback
SQL Optimization (Catalyst)
Project Tungsten
• Memory Management and Binary Processing: leveraging application semantics to
manage memory explicitly and eliminate the overhead of JVM object model and
garbage collection
• Cache-aware computation: algorithms and data structures to exploit memory
hierarchy
• Code generation: using code generation to exploit modern compilers and CPUs
BDAS - Berkeley Data Analytics
Stackhttps://amplab.cs.berkeley.edu/software/
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that
integrates software components being built by the AMPLab to make sense of Big Data.
Optimization
• groupBy is costlier – use mapr() or reduceByKey()
• RDD storage level MEMOR_ONLY is better
Optimization Code Example
RDDs vs Distributed Shared Mem
DAG Visualization
Spark + Akka+Spray
Spark R Architecture
PySpark
GraphX representation
Links References
• Spark
• Spark Submit 2015
• Spark External Projects
• Spark Central
Project Tungsten Roadmap
TACHYON
• Tachyon is a memory-centric distributed storage system enabling
reliable data sharing at memory-speed across cluster frameworks,
such as Spark and MapReduce. It achieves high performance by
leveraging lineage information and using memory aggressively.
Tachyon caches working set files in memory, thereby avoiding going
to disk to load datasets that are frequently read. This enables
different jobs/queries and frameworks to access cached files at
memory speed.
Blink DB
Batches…
• Chop up the live stream into batches of X seconds
• Spark treats each batch of data as RDDs and processes
them using RDD operations
• Finally, the processed results of the RDD operations are
returned in batches
Micro Batch
Dstream (Discretized Streams)
DStream is represented by a continuous series of RDDs
Window Operation & Checkpoint
Streaming
• Scalable high-throughput
streaming process of live data
• Integrate with many sources
• Fault-tolerant- Stateful
exactly-once semantics out of
box
• Combine streaming with
batch and interactive queries
Spark streaming
data streams
Receiv
ers
batches
as RDDs
results as
RDDs
Streaming Fault Tolerance
Spark Streaming UI
Micro Batch (Near Real Time)
Micro Batch
Spark with Storm
Spark + Cassandra
Big Data Landscape
100 opensourceBig Dataarchitecturepapers