Spark
for
Big
Data
Agenda: Chapter 1
01 What is Apache Spark?
02 Apache Spark’s Philosophy
03 Running Spark
§ Apache Spark is a unified computing
engine and a set of libraries designed
for large-scale distributed data
processing.
§ Spark supports multiple widely used
programming languages (Python, Java,
Scala, and R).
§ It includes libraries with composable
APIs for machine learning (MLlib),
SQL for interactive queries (Spark
SQL), stream processing (Structured
1. What is
Streaming) for interacting with real-
time data, and graph processing
Apache
(GraphX). Spark?
Apache Spark
2. Apache Spark’s Philosophy
Unified 01
Computing
02
Engine
03
libraries
v Spark operations can be applied across
many types of workloads and expressed in
any of the supported programming
languages: Scala, Java, Python, SQL, and R.
v Spark offers unified libraries with well-
documented APIs that include the
following modules as core components:
Spark SQL, Spark Structured Streaming,
Spark MLlib, and GraphX, combining all the
workloads running under one engine.
v You can write a single Spark application
that can do it all—no need for distinct
engines for disparate workloads, no need
to learn separate APIs. With Spark, you get Unified
a unified processing engine for your
workloads.
v Spark focuses on its fast, parallel
computation engine rather than on
storage.
v That means you can use Spark to read
data stored in a variety of storage
systems, including:
ü Azure Storage ü Apache
ü Amazon S3 Hadoop
Cloud storage Distributed
systems file systems
Computing
Engine
v Spark’s final component is its
libraries, which build on its
design as a unified engine to
provide a unified API for
common data analysis tasks.
v Spark includes libraries for SQL
and structured data (Spark
SQL), machine learning(MLib),
stream processing (Spark
streaming), graph analytics
(GraphX).
Libraries
3. Running Spark
You can use Spark from Python, Java, Scala, R, or SQL. Spark itself is written in
Scala and runs on the Java Virtual Machine (JVM), so therefore to run Spark all
you need is an installation of Java
If you want to use the Python API, you will also need a Python
interpreter (version 2.7 or later). If you want to use R, you will
need a version of R on your machine.
Thank you