0% found this document useful (0 votes)
29 views10 pages

Introduction to Apache Spark Basics

Apache Spark is a unified computing engine designed for large-scale distributed data processing, supporting multiple programming languages such as Python, Java, Scala, and R. It offers libraries for machine learning, SQL, stream processing, and graph processing, allowing users to perform various workloads with a single application. Spark emphasizes fast, parallel computation and can read data from various storage systems, requiring only Java for installation.

Uploaded by

20912023100044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views10 pages

Introduction to Apache Spark Basics

Apache Spark is a unified computing engine designed for large-scale distributed data processing, supporting multiple programming languages such as Python, Java, Scala, and R. It offers libraries for machine learning, SQL, stream processing, and graph processing, allowing users to perform various workloads with a single application. Spark emphasizes fast, parallel computation and can read data from various storage systems, requiring only Java for installation.

Uploaded by

20912023100044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Spark

for
Big
Data
Agenda: Chapter 1

01 What is Apache Spark?

02 Apache Spark’s Philosophy

03 Running Spark
§ Apache Spark is a unified computing
engine and a set of libraries designed
for large-scale distributed data
processing.

§ Spark supports multiple widely used


programming languages (Python, Java,
Scala, and R).

§ It includes libraries with composable


APIs for machine learning (MLlib),
SQL for interactive queries (Spark
SQL), stream processing (Structured
1. What is
Streaming) for interacting with real-
time data, and graph processing
Apache
(GraphX). Spark?
Apache Spark
2. Apache Spark’s Philosophy

Unified 01
Computing
02
Engine

03

libraries
v Spark operations can be applied across
many types of workloads and expressed in
any of the supported programming
languages: Scala, Java, Python, SQL, and R.

v Spark offers unified libraries with well-


documented APIs that include the
following modules as core components:
Spark SQL, Spark Structured Streaming,
Spark MLlib, and GraphX, combining all the
workloads running under one engine.

v You can write a single Spark application


that can do it all—no need for distinct
engines for disparate workloads, no need
to learn separate APIs. With Spark, you get Unified
a unified processing engine for your
workloads.
v Spark focuses on its fast, parallel
computation engine rather than on
storage.

v That means you can use Spark to read


data stored in a variety of storage
systems, including:

ü Azure Storage ü Apache


ü Amazon S3 Hadoop

Cloud storage Distributed


systems file systems
Computing
Engine
v Spark’s final component is its
libraries, which build on its
design as a unified engine to
provide a unified API for
common data analysis tasks.

v Spark includes libraries for SQL


and structured data (Spark
SQL), machine learning(MLib),
stream processing (Spark
streaming), graph analytics
(GraphX).

Libraries
3. Running Spark

You can use Spark from Python, Java, Scala, R, or SQL. Spark itself is written in
Scala and runs on the Java Virtual Machine (JVM), so therefore to run Spark all
you need is an installation of Java

If you want to use the Python API, you will also need a Python
interpreter (version 2.7 or later). If you want to use R, you will
need a version of R on your machine.
Thank you

You might also like