Uploaded byVenkata Naga Ravi

12,863 views

Processing Large Data with Apache Spark -- HasGeek

This document provides an overview of Apache Spark, highlighting its advantages over traditional big data processing frameworks like Hadoop, including faster processing speeds and improved fault tolerance. It discusses Spark's architecture, key components such as resilient distributed datasets (RDDs) and Spark SQL, as well as integrations with various data sources and libraries for machine learning and graph processing. The document also touches on data science processes and optimizations for effective big data analytics.

Related topics:

Apache Spark•

Agenda
Big Data
Overview
Spark
Overview
Spark
Internals
Spark
Libraries

BIG DATA OVERVIEW

Big Data -- Digital Data growth…

V-V-V

Legacy Architecture Pain Points
• Report arrival latency quite high - Hours to perform joins,
aggregate data
• Existing frameworks cannot do both
• Either, stream processing of 100s of MB/s with low latency
• Or, batch processing of TBs of data with high latency
• Expressibility of business logic in Hadoop MR is challenging

SPARK OVERVIEW
Why
Spark?

Why Spark
Separate, fast, Map-Reduce-like engine
In-memory data storage for very fast iterative queries
Better Fault Tolerance
Combine SQL, Streaming and complex analytics
Runs on Hadoop, Mesos, standalone, or in the cloud
Data sources -> HDFS, Cassandra, HBase and S3

In Memory - Spark vs Hadoop
Improve efficiency over MapReduce
100x in memory , 2-10x in disk
Up to 40x faster than Hadoop

Spark In & Out
RDBMS
Streaming
SQL
GraphX
BlinkDB
Hadoop Input Format
Apps
Distributions:
- CDH
- HDP
- MapR
- DSE
Tachyon
MLlib
Ref: http://training.databricks.com/intro.pdf

Spark Streaming + SQL
Streaming
SQL

Benchmarking & Best Facts

SPARK INSIDE – AROUND RDD

Resilient Distributed Data (RDD)
Immutable + Distributed+ Catchable+ Lazy evaluated
 Distributed collections of objects
 Can be cached in memory across cluster nodes
 Manipulated through various parallel operations

RDD Types
RDD

RDD Operation

Memory and Persistent

Dependencies Types

Spark Cluster Overview
o Application
o Driver program
o Cluster manage
o Worker node
o Job
o Stage
o Executor
o Task

Job Flow

Task Scheduler , DAG
• Pipelines functions within a stage
• Cache-aware data reuse & locality
• Partitioning-aware to avoid
shuffles
rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)

Fault Recovery & Checkpoints
• Efficient fault recovery using Lineage
• log one operation to apply to many elements (lineage)
• Recomputed lost partitions on failure
• Checkpoint RDDs to prevent long lineage chains during fault
recovery

QUICK DEMO

SPARK STACAK DETAILS

Spark SQL
• Seamlessly mix SQL queries with Spark programs
• Load and query data from a variety of sources
• Standard Connectivity through (J)ODBC
• Hive Compatibility

Data Frames
• A distributed collection of data organized into named columns
• Like a table in a relational database
Spark SQL
Resilient Distributed Datasets
Spark
JDBC Console
User Programs
(Java, Scala, Python)
Catalyst Optimizer
DataFrame API
Figur e 1: I nter faces to Spar k SQL , and inter action with Spar k.
3.1 DataFr ame API
The main abstraction in Spark SQL’s API is a DataFrame, a dis-
tributed collection of rows with a homogeneous schema. A DataFrame
is equivalent to a table in a relational database, and can also be
manipulated in similar ways to the “ native” distributed collections
as well
maps an
to creat
Spark S
the quer
ports us
Using
data fro
tional d
3.3 D
Users c
domain
Python
operato
aggrega
jects in
expressi
of fema
empl oy
. j oi

SparkR
• New R language for Spark and SparkSQL
• Exposes existing Spark functionality in
an R-friendly syntax view the DataFrame API

Spark Streaming
File systems
Databases
Dashboards
Flume
HDFS
Kinesis
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrate with MLlib, SQL,
DataFrames, GraphX
Chop up the live stream into batches of X seconds. DStream is represented by
a continuous series of RDDs

MLib
• Scalable Machine learning library
• Iterative computing -> High Quality algorithm 100x faster than
hadoop

MLib Algorithms

ML Pipeline
• Feature Extraction
• Normalization
• Dimensionality reduction
• Model training

GraphX
• Spark’s API For Graph and Graph-parallel computation
• Graph abstraction: a directed multigraph with properties attached
to each vertex and edge
• Seamlessly works with both graph and collections

GraphX Framework & Algorithms
Algorithms

Spark Packages

Users & Distributors…

Thanks to Apache Spark by….
Started using it in our projects…
Contribute to their open source community…
Socialize Spark ..

Backup Slides

SPARK CLUSTER

Cluster Support
• Standalone – a simple cluster manager included with Spark that makes it easy to set
up a cluster
• Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and
service applications
• Hadoop YARN – the resource manager in Hadoop 2

Spark On Mesos

Spark on YARN

Data Science Process
Data Science in Practice
• Data Collection
• Munging
• Analysis
• Visualization
• Decision

Real Time Feedback

SQL Optimization (Catalyst)

Project Tungsten
• Memory Management and Binary Processing: leveraging application semantics to
manage memory explicitly and eliminate the overhead of JVM object model and
garbage collection
• Cache-aware computation: algorithms and data structures to exploit memory
hierarchy
• Code generation: using code generation to exploit modern compilers and CPUs

BDAS - Berkeley Data Analytics
Stackhttps://amplab.cs.berkeley.edu/software/
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that
integrates software components being built by the AMPLab to make sense of Big Data.

Optimization
• groupBy is costlier – use mapr() or reduceByKey()
• RDD storage level MEMOR_ONLY is better

Optimization Code Example

RDDs vs Distributed Shared Mem

DAG Visualization

Spark + Akka+Spray

Spark R Architecture

PySpark

GraphX representation

Links References
• Spark
• Spark Submit 2015
• Spark External Projects
• Spark Central

Project Tungsten Roadmap

TACHYON
• Tachyon is a memory-centric distributed storage system enabling
reliable data sharing at memory-speed across cluster frameworks,
such as Spark and MapReduce. It achieves high performance by
leveraging lineage information and using memory aggressively.
Tachyon caches working set files in memory, thereby avoiding going
to disk to load datasets that are frequently read. This enables
different jobs/queries and frameworks to access cached files at
memory speed.

Blink DB

Batches…
• Chop up the live stream into batches of X seconds
• Spark treats each batch of data as RDDs and processes
them using RDD operations
• Finally, the processed results of the RDD operations are
returned in batches
Micro Batch

Dstream (Discretized Streams)
DStream is represented by a continuous series of RDDs

Window Operation & Checkpoint

Streaming
• Scalable high-throughput
streaming process of live data
• Integrate with many sources
• Fault-tolerant- Stateful
exactly-once semantics out of
box
• Combine streaming with
batch and interactive queries

Spark streaming
data streams
Receiv
ers
batches
as RDDs
results as
RDDs

Streaming Fault Tolerance

Spark Streaming UI

Micro Batch (Near Real Time)
Micro Batch

Spark with Storm

Spark + Cassandra

Big Data Landscape

100 opensourceBig Dataarchitecturepapers

Recommended

PDF

Apache Spark in Depth: Core Concepts, Architecture & Internals

byAnton Kirillov

26 slides10.7K views

PPTX

Apache Spark Architecture

byAlexey Grishchenko

114 slides83.6K views

PDF

Deep Dive: Memory Management in Apache Spark

54 slides15.3K views

PDF

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

45 slides11.9K views

PDF

Apache Spark Core—Deep Dive—Proper Optimization

50 slides7.2K views

PPTX

Optimizing Apache Spark SQL Joins

24 slides45.7K views

PPTX

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

19 slides3.5K views

PDF

Spark Performance Tuning .pdf

20 slides862 views

PDF

Top 5 Mistakes When Writing Spark Applications

75 slides28.6K views

PDF

Apache Spark Overview

byVadim Y. Bichutskiy

43 slides1.9K views

PDF

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...

52 slides886 views

PDF

Simplifying Big Data Analytics with Apache Spark

45 slides12.6K views

PDF

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

byNoritaka Sekiyama

60 slides35.1K views

PDF

Spark tuning

byGMO-Z.com Vietnam Lab Center

26 slides1.9K views

PDF

PySpark in practice slides

39 slides3.3K views

PDF

Spark SQL Deep Dive @ Melbourne Spark Meetup

57 slides9.2K views

PDF

Apache Spark Core – Practical Optimization

40 slides3.3K views

PDF

Deep Dive into the New Features of Apache Spark 3.0

97 slides2.9K views

PDF

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab

84 slides3K views

PDF

Introduction to apache spark

33 slides1.6K views

PPTX

Introduction to Apache Spark

27 slides27K views

PDF

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

25 slides1.8K views

PDF

Memory Management in Apache Spark

59 slides10.4K views

PDF

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...

28 slides18.4K views

PPTX

Hive Bucketing in Apache Spark

72 slides8.5K views

PDF

Intro to Delta Lake

22 slides3.9K views

PDF

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...

76 slides10K views

PPTX

Apache Spark Fundamentals

byZahra Eskandari

68 slides1.1K views

PPTX

Glint with Apache Spark

byVenkata Naga Ravi

55 slides1.5K views

PPTX

APACHE SPARK.pptx

byDeepaThirumurugan

39 slides333 views

Processing Large Data with Apache Spark -- HasGeek

2.
Agenda Big Data Overview Spark Overview Spark Internals Spark Libraries
3.
BIG DATA OVERVIEW
4.
Big Data --Digital Data growth…
5.
V-V-V
6.
Legacy Architecture PainPoints • Report arrival latency quite high - Hours to perform joins, aggregate data • Existing frameworks cannot do both • Either, stream processing of 100s of MB/s with low latency • Or, batch processing of TBs of data with high latency • Expressibility of business logic in Hadoop MR is challenging
7.
SPARK OVERVIEW Why Spark?
8.
Why Spark Separate, fast,Map-Reduce-like engine In-memory data storage for very fast iterative queries Better Fault Tolerance Combine SQL, Streaming and complex analytics Runs on Hadoop, Mesos, standalone, or in the cloud Data sources -> HDFS, Cassandra, HBase and S3
9.
In Memory -Spark vs Hadoop Improve efficiency over MapReduce 100x in memory , 2-10x in disk Up to 40x faster than Hadoop
10.
Spark In &Out RDBMS Streaming SQL GraphX BlinkDB Hadoop Input Format Apps Distributions: - CDH - HDP - MapR - DSE Tachyon MLlib Ref: http://training.databricks.com/intro.pdf
11.
Spark Streaming +SQL Streaming SQL
12.
Benchmarking & BestFacts
13.
SPARK INSIDE –AROUND RDD
14.
Resilient Distributed Data(RDD) Immutable + Distributed+ Catchable+ Lazy evaluated  Distributed collections of objects  Can be cached in memory across cluster nodes  Manipulated through various parallel operations
15.
RDD Types RDD
16.
RDD Operation
17.
Memory and Persistent
18.
Dependencies Types
19.
Spark Cluster Overview oApplication o Driver program o Cluster manage o Worker node o Job o Stage o Executor o Task
20.
Job Flow
21.
Task Scheduler ,DAG • Pipelines functions within a stage • Cache-aware data reuse & locality • Partitioning-aware to avoid shuffles rdd1.map(splitlines).filter("ERROR") rdd2.map(splitlines).groupBy(key) rdd2.join(rdd1, key).take(10)
22.
Fault Recovery &Checkpoints • Efficient fault recovery using Lineage • log one operation to apply to many elements (lineage) • Recomputed lost partitions on failure • Checkpoint RDDs to prevent long lineage chains during fault recovery
23.
QUICK DEMO
24.
SPARK STACAK DETAILS
25.
Spark SQL • Seamlesslymix SQL queries with Spark programs • Load and query data from a variety of sources • Standard Connectivity through (J)ODBC • Hive Compatibility
26.
Data Frames • Adistributed collection of data organized into named columns • Like a table in a relational database Spark SQL Resilient Distributed Datasets Spark JDBC Console User Programs (Java, Scala, Python) Catalyst Optimizer DataFrame API Figur e 1: I nter faces to Spar k SQL , and inter action with Spar k. 3.1 DataFr ame API The main abstraction in Spark SQL’s API is a DataFrame, a dis- tributed collection of rows with a homogeneous schema. A DataFrame is equivalent to a table in a relational database, and can also be manipulated in similar ways to the “ native” distributed collections as well maps an to creat Spark S the quer ports us Using data fro tional d 3.3 D Users c domain Python operato aggrega jects in expressi of fema empl oy . j oi
27.
SparkR • New Rlanguage for Spark and SparkSQL • Exposes existing Spark functionality in an R-friendly syntax view the DataFrame API
28.
Spark Streaming File systems Databases Dashboards Flume HDFS Kinesis Kafka Twitter High-levelAPI joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrate with MLlib, SQL, DataFrames, GraphX Chop up the live stream into batches of X seconds. DStream is represented by a continuous series of RDDs
29.
MLib • Scalable Machinelearning library • Iterative computing -> High Quality algorithm 100x faster than hadoop
30.
MLib Algorithms
31.
ML Pipeline • FeatureExtraction • Normalization • Dimensionality reduction • Model training
32.
GraphX • Spark’s APIFor Graph and Graph-parallel computation • Graph abstraction: a directed multigraph with properties attached to each vertex and edge • Seamlessly works with both graph and collections
33.
GraphX Framework &Algorithms Algorithms
34.
Spark Packages
35.
Users & Distributors…
36.
Thanks to ApacheSpark by…. Started using it in our projects… Contribute to their open source community… Socialize Spark ..
37.
Backup Slides
38.
SPARK CLUSTER
39.
Cluster Support • Standalone– a simple cluster manager included with Spark that makes it easy to set up a cluster • Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications • Hadoop YARN – the resource manager in Hadoop 2
40.
Spark On Mesos
41.
Spark on YARN
43.
Data Science Process DataScience in Practice • Data Collection • Munging • Analysis • Visualization • Decision
44.
Real Time Feedback
45.
SQL Optimization (Catalyst)
46.
Project Tungsten • MemoryManagement and Binary Processing: leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection • Cache-aware computation: algorithms and data structures to exploit memory hierarchy • Code generation: using code generation to exploit modern compilers and CPUs
47.
BDAS - BerkeleyData Analytics Stackhttps://amplab.cs.berkeley.edu/software/ BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.
48.
Optimization • groupBy iscostlier – use mapr() or reduceByKey() • RDD storage level MEMOR_ONLY is better
49.
Optimization Code Example
50.
RDDs vs DistributedShared Mem
51.
DAG Visualization
52.
Spark + Akka+Spray
53.
Spark R Architecture
54.
PySpark
55.
GraphX representation
56.
Links References • Spark •Spark Submit 2015 • Spark External Projects • Spark Central
57.
Project Tungsten Roadmap
58.
TACHYON • Tachyon isa memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read. This enables different jobs/queries and frameworks to access cached files at memory speed.
59.
Blink DB
61.
Batches… • Chop upthe live stream into batches of X seconds • Spark treats each batch of data as RDDs and processes them using RDD operations • Finally, the processed results of the RDD operations are returned in batches Micro Batch
62.
Dstream (Discretized Streams) DStreamis represented by a continuous series of RDDs
63.
Window Operation &Checkpoint
64.
Streaming • Scalable high-throughput streamingprocess of live data • Integrate with many sources • Fault-tolerant- Stateful exactly-once semantics out of box • Combine streaming with batch and interactive queries
65.
Spark streaming data streams Receiv ers batches asRDDs results as RDDs
66.
Streaming Fault Tolerance
67.
Spark Streaming UI
68.
Micro Batch (NearReal Time) Micro Batch
69.
Spark with Storm
70.
Spark + Cassandra
71.
Big Data Landscape
72.
100 opensourceBig Dataarchitecturepapers