MF810 Advanced Programming
Data Structure and Algorithms
Lecture 10
Graph and Probabilistic Data Structure
Jun Fan
junfan@[Link]
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 1
Review
Apache Spark
Spark Parquet
Spark Dataframe
Spark SQL
e-commence analysis example
Spark MLlib
Spark DAG
Case Study – 2010 Stock Flash Crash
Circuit Breaker
Lecture 9 – 3/21/2024 MF810 Advanced Programming – J Fan 2
Circuit Breakers – by countries
Country Benchmark Threshold Duration Volatility
USA S&P 500 -7% 15m 15.9%
-13% 15m 1% daily
-20% Close
India Nifty 50 +/-10% 15m 16.7%
+/-15% 15m 1% daily
+/-20% Close
Korea KOSPI -8% 20m 16.2%
-15% 20m 1% daily
-20% Close
China (Jan CSI300 +/-5% 15m 25.0%
2016) +/-7% Close 1.6% daily
Lecture 9 – 3/21/2024 MF810 Advanced Programming – J Fan 3
Fragmented Time Management
Set a goal
Prioritize
Collect small learning pieces
Don’t force yourself, relax and pick the pieces you need at
the moment
Make all you learned into a system
Lecture 9 – 3/21/2024 MF810 Advanced Programming – J Fan 4
Agenda
Graph
Introduction
DFS/BFS
Dijkstra’s
Spinning Trees
Notebook example
Hash function (review)
Probabilistic Data Structure
Bloom Filters
Count-min Sketch
HyperLogLog
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 5
Graphs: Introduction
Konigsberg Bridge Problem
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 6
Graphs: Introduction
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 7
Graphs: Application in Data Analysis
Typical Applications in Data Analysis
• Supply Chain
• Social Networks
• Fraud Detection
• Knowledge Graph
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 8
Graph Data Structure
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 9
Graphs: Traversal - Breadth First
Search
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 10
Graphs: Traversal - Depth First
Search
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 11
Graphs: Shortest Path - Dijkstra’s
Algorithm
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 12
Graphs: Shortest Path - Dijkstra’s
Algorithm
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 13
Graphs: Spanning Trees
Spanning Tree
A subset of graph G that has all vertices covered with minimum possible number of
edges. A spanning tree has no cycles and cannot be disconnected.
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 14
Graphs: Minimum Spanning Trees
Minimum Spanning Tree Problem
Given an undirected graph and edge weights
, find a spanning tree of minimum weight ∈
Approaches:
• Graph traversal (e.g. DFS) tracking edges that form a tree
• Iterate edges and add to output any edge that doesn’t create a cycle
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 15
Graphs: Prim’s Algorithm
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 16
Graphs: Kruskal’s Algorithm
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 17
Graph Data Processing: GraphX
GraphX is the Spark API for graphs and graph-parallel computation.
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 18
Graph Data Processing: GraphFrames
Apache Spark GraphFrames is a graph processing library where vertices and
edges are represented as DataFrames, allowing the storage of arbitrary data
with each vertex and edge. APIs exist for Scala, Java and Python to provide and
extend functionality on top of GraphX
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 19
Relational Database vs Graph Database
• Data is stored as graphs in a Graph database, unlike rows and columns in a
traditional database
• Graphs are Natural language friendly, represents relationships in simple
English
• The query performance in a Graph database is relatively faster than in a
traditional database
• Graph visuals easily unlock network/relationship insights
• Graphs are flexible to add new data attributes, which isn’t possible in a
traditional database (which has a rigid schema)
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 20
Agenda
Graph
Introduction
DFS/BFS
Dijkstra’s
Spinning Trees
Notebook example
Hash function (review)
Probabilistic Data Structure
Bloom Filters
Count-min Sketch
HyperLogLog
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 21
Data Structure – Hash Table
Lecture 2 – 1/25/2024 MF810 Advanced Programming – J Fan 22
Data Structure – Hash Table
What makes a good hash function?
A good hash function satisfies the assumption of simple uniform hashing,
where each key is equally likely to hash to any of the m slots, independent of
where any other key has hashed.
Specific properties of a “good” hash function:
• O(1) time: Fast
• Uniform distribution among keys: well distributed
• Deterministic: not random
• Keys are derived independent of patterns in the input data: no state
• Low number of collisions
Lecture 2 – 1/25/2024 MF810 Advanced Programming – J Fan 23
Data Structure – Types of Hash Functions
Lecture 2 – 1/25/2024 MF810 Advanced Programming – J Fan 24
Agenda
Graph
Introduction
DFS/BFS
Dijkstra’s
Spinning Trees
Notebook example
Hash function (review)
Probabilistic Data Structure
Bloom Filters
Count-min Sketch
HyperLogLog
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 25
Probabilistic Data Structures: Introduction
Probabilistic data structures are algorithms or data structures that provide
approximated answers with a trade-off between accuracy and efficiency,
especially in terms of memory and computational overhead.
They are extremely useful for big data and streaming application because
they allow to dramatically decrease the amount of memory needed.
Advantages:
• Memory efficiency
• Easily parallelizable
• Constant query time
Disadvantages:
• Complexity
• Probability of error
• Limited Functionality
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 26
Is this tinyurl alias available?
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 27
Probabilistic Structures: Bloom Filter
A space efficient, probabilistic data structure used to determine whether a member is
present in a set. They can tell you whether an element is definitely not in the set, but
only possibly present in the set.
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 28
Counting in very large dataset
• How many times have we seen the word “metauniverse” in Meta earning
transcript?
• What’s the number of trades for Meta v.s. the number of trades for Nvidia in
NASDAQ exchange?
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 29
Probabilistic Structures: Count-Min Sketch
Given a stream of length and a parameter (such that >> ), in a single pass
over the stream, we want to find any elements that appear at least times.
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 30
Count the unique BU IDs
Boston University’s Alumni network spans the globe with more than 430,000
alumni. Jun conducted a survey and accidentally spam the alumni email inbox
by sending multiple survey requests. As the result, some alumni responded
the survey multiple times.
Question: How to count the number of unique BU IDs from the survey?
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 31
Coin flipping
How many times of flipping is expected to produce a sequence of TTT?
Flip 1 Flip 2 Flip 3
H H H
H H T
H T H
H T T
T H H
T H T
T T H
T T T
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 32
Probabilistic Algorithm: HyperLogLog
[Link]
Lecture 10 – 3/28/2024 MF810 Advanced Programming – J Fan 33