0% found this document useful (0 votes)
9 views33 pages

Spark Graphs and Probabilistic Structures

Lecture with beautiful chart

Uploaded by

Ziheng Zhang
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views33 pages

Spark Graphs and Probabilistic Structures

Lecture with beautiful chart

Uploaded by

Ziheng Zhang
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MF810 Advanced Programming

Data Structure and Algorithms

Lecture 10
Graph and Probabilistic Data Structure

Jun Fan
junfan@[Link]

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 1


Review

Apache Spark
 Spark Parquet
 Spark Dataframe
 Spark SQL
 e-commence analysis example
 Spark MLlib
 Spark DAG
Case Study – 2010 Stock Flash Crash
 Circuit Breaker

Lecture 9 – 3/21/2024 MF810 Advanced Programming – J Fan 2


Circuit Breakers – by countries

Country Benchmark Threshold Duration Volatility


USA S&P 500 -7% 15m 15.9%
-13% 15m 1% daily
-20% Close
India Nifty 50 +/-10% 15m 16.7%
+/-15% 15m 1% daily
+/-20% Close
Korea KOSPI -8% 20m 16.2%
-15% 20m 1% daily
-20% Close
China (Jan CSI300 +/-5% 15m 25.0%
2016) +/-7% Close 1.6% daily

Lecture 9 – 3/21/2024 MF810 Advanced Programming – J Fan 3


Fragmented Time Management

 Set a goal
 Prioritize
 Collect small learning pieces
 Don’t force yourself, relax and pick the pieces you need at
the moment
 Make all you learned into a system

Lecture 9 – 3/21/2024 MF810 Advanced Programming – J Fan 4


Agenda

 Graph
 Introduction
 DFS/BFS
 Dijkstra’s
 Spinning Trees
 Notebook example
 Hash function (review)
 Probabilistic Data Structure
 Bloom Filters
 Count-min Sketch
 HyperLogLog

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 5


Graphs: Introduction

Konigsberg Bridge Problem

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 6


Graphs: Introduction

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 7


Graphs: Application in Data Analysis

Typical Applications in Data Analysis


• Supply Chain
• Social Networks

• Fraud Detection

• Knowledge Graph

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 8


Graph Data Structure

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 9


Graphs: Traversal - Breadth First
Search

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 10


Graphs: Traversal - Depth First
Search

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 11


Graphs: Shortest Path - Dijkstra’s
Algorithm

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 12


Graphs: Shortest Path - Dijkstra’s
Algorithm

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 13


Graphs: Spanning Trees

Spanning Tree
A subset of graph G that has all vertices covered with minimum possible number of
edges. A spanning tree has no cycles and cannot be disconnected.

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 14


Graphs: Minimum Spanning Trees

Minimum Spanning Tree Problem

Given an undirected graph and edge weights

, find a spanning tree of minimum weight ∈

Approaches:
• Graph traversal (e.g. DFS) tracking edges that form a tree
• Iterate edges and add to output any edge that doesn’t create a cycle

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 15


Graphs: Prim’s Algorithm

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 16


Graphs: Kruskal’s Algorithm

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 17


Graph Data Processing: GraphX

GraphX is the Spark API for graphs and graph-parallel computation.

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 18


Graph Data Processing: GraphFrames
Apache Spark GraphFrames is a graph processing library where vertices and
edges are represented as DataFrames, allowing the storage of arbitrary data
with each vertex and edge. APIs exist for Scala, Java and Python to provide and
extend functionality on top of GraphX

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 19


Relational Database vs Graph Database
• Data is stored as graphs in a Graph database, unlike rows and columns in a
traditional database
• Graphs are Natural language friendly, represents relationships in simple
English
• The query performance in a Graph database is relatively faster than in a
traditional database
• Graph visuals easily unlock network/relationship insights
• Graphs are flexible to add new data attributes, which isn’t possible in a
traditional database (which has a rigid schema)

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 20


Agenda

 Graph
 Introduction
 DFS/BFS
 Dijkstra’s
 Spinning Trees
 Notebook example
 Hash function (review)
 Probabilistic Data Structure
 Bloom Filters
 Count-min Sketch
 HyperLogLog

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 21


Data Structure – Hash Table

Lecture 2 – 1/25/2024 MF810 Advanced Programming – J Fan 22


Data Structure – Hash Table

What makes a good hash function?


A good hash function satisfies the assumption of simple uniform hashing,
where each key is equally likely to hash to any of the m slots, independent of
where any other key has hashed.

Specific properties of a “good” hash function:


• O(1) time: Fast
• Uniform distribution among keys: well distributed
• Deterministic: not random
• Keys are derived independent of patterns in the input data: no state
• Low number of collisions

Lecture 2 – 1/25/2024 MF810 Advanced Programming – J Fan 23


Data Structure – Types of Hash Functions

Lecture 2 – 1/25/2024 MF810 Advanced Programming – J Fan 24


Agenda

 Graph
 Introduction
 DFS/BFS
 Dijkstra’s
 Spinning Trees
 Notebook example
 Hash function (review)
 Probabilistic Data Structure
 Bloom Filters
 Count-min Sketch
 HyperLogLog

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 25


Probabilistic Data Structures: Introduction

Probabilistic data structures are algorithms or data structures that provide


approximated answers with a trade-off between accuracy and efficiency,
especially in terms of memory and computational overhead.

They are extremely useful for big data and streaming application because
they allow to dramatically decrease the amount of memory needed.

Advantages:
• Memory efficiency
• Easily parallelizable
• Constant query time
Disadvantages:
• Complexity
• Probability of error
• Limited Functionality

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 26


Is this tinyurl alias available?

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 27


Probabilistic Structures: Bloom Filter

A space efficient, probabilistic data structure used to determine whether a member is


present in a set. They can tell you whether an element is definitely not in the set, but
only possibly present in the set.

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 28


Counting in very large dataset

• How many times have we seen the word “metauniverse” in Meta earning
transcript?
• What’s the number of trades for Meta v.s. the number of trades for Nvidia in
NASDAQ exchange?

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 29


Probabilistic Structures: Count-Min Sketch

Given a stream of length and a parameter (such that >> ), in a single pass
over the stream, we want to find any elements that appear at least times.

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 30


Count the unique BU IDs

Boston University’s Alumni network spans the globe with more than 430,000
alumni. Jun conducted a survey and accidentally spam the alumni email inbox
by sending multiple survey requests. As the result, some alumni responded
the survey multiple times.

Question: How to count the number of unique BU IDs from the survey?

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 31


Coin flipping

How many times of flipping is expected to produce a sequence of TTT?

Flip 1 Flip 2 Flip 3


H H H
H H T
H T H
H T T
T H H
T H T
T T H
T T T

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 32


Probabilistic Algorithm: HyperLogLog

[Link]

Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 33

You might also like