2/23/2024
Stream Processing
Class Rules
• You can do anything except:
• Make noises (chatting, singing…)
• Feel free to interrupt me if you have questions .
• According to the university policy,taking attendance is needed.
• Important: you are required to have an 80% attendance to be able to
seat for the final exam.
1
2/23/2024
Course Assessment
Temporary according to the situation:
Final exam:50%
Assignment:20%,individually
Project:30%,2-3 members per group,report and presentation are
required.
Important:cheating and plagiarism will get no marks.
A few suggestions….
• Your final grade is based on points – not on an
accumulation of grades.
• You start the class with zero points and earn your
way to your final grade
• If you have an issue or problem, communicate –
send me an email
• If you know you’re not going to meet the deadline
for a quiz or assignment – email me BEFORE the
deadline
2
2/23/2024
BIG DATA, ANALYTICS
Data Deluge
3
2/23/2024
How do I find the relevant data?
7
C o p y r i ght © S A S I n s titu te In c . A ll r igh ts r e s e r v e d .
Big Data Explained
"Big data is what happened when the
cost of storing information became less than
the cost of making the decision
to throw it away.”
- George Dyson
Science Historian and TED Speaker
8
C o p y r i ght © S A S I n s titu te In c . A ll r igh ts r e s e r v e d .
4
2/23/2024
Big Data: What Is It?
The SAS definition of big data:
The point at which the volume, velocity, and variety of data exceed an
organization’s storage or computation capacity for accurate and timely
decision making
Here are some factors associated
with big data:
• data volume
• data velocity
• data variety
• data variability
• data complexity
9
C o p y r i ght © S A S I n s titu te In c . A ll r igh ts r e s e r v e d .
Data Volume
Data volumes are increasing due to use
of the following:
• social media (Facebook, Twitter,
Instagram)
• machines talking to machines
• improvements in the manufacturing
process (quality control)
• automated tracking devices
• streaming data feeds
10
5
2/23/2024
Data Velocity
• business processes that are
more automated
• mergers and acquisitions
• more use of social media
• more use of self-service
applications
• integration of business
applications
11
Data Variety
• structured data
• unstructured data
• business applications
• unstructured text documents
(articles, blogs, and so on)
• emails
• digital images
• video and audio clips
• streaming data
• stock ticker data
• RFID tag data
• sensor data
12
6
2/23/2024
Data Variability
• The flow of data changes over time (seasonality, peak
response, social media trends, and so on).
• Data values change over time. How much history do you
keep?
• Data values are different across data sources.
• Data is stored in different formats.
• Data standards change across time.
• What was “valid” five years ago might not be “valid” today.
13
Data Complexity
• Data comes from a variety of
systems in a variety of
formats. This can make it
difficult to merge, cleanse,
and transform data in a
uniform manner.
14
7
2/23/2024
Evolution to Big Data
• Traditional to Big Data Infrastructure
• Database servers and traditional data processing tools
• Distributed data systems across horizontally coupled,
independent resources to achieve the scalability needed for
the efficient processing of extensive data sets
• Onsite and cloud computing solutions
15
Evolution to Big Data
16
8
2/23/2024
Data Streaming
17
Smart Cities and Homes Connected Customer
Communications Surveillance
Connected Car/ I nternet Building
Management
T
Transportation
OF
hings Agriculture
Energy
Manufacturing
Finance /
Insurance
Retail
Health Care 18
C o p y r i ght © S A S I n s titu te In c . A ll r igh ts r e s e r v e d .
18
9
2/23/2024
Most IoT Data Remains Unused
• Data from sensors in manufacturing can provide information to detect
conditions requiring attention.
• Sensors are pervasive: from wearables to rocket engines.
• Sensor data remains largely untapped (not being used for prediction and
optimization).
• Imagine a structure that would allow sensor data to be processed as it gets
produced.
• Therein lies an opportunity.
19
Traditional Analytics at Rest
Data Data Storage
ETL Deploy
Alerts - Reports
Decisioning
20
10
2/23/2024
Streaming Analytics
Stream – Understand – Act
Data Data Storage
ETL Deploy
Alerts - Reports
Decisioning
Deploy
Enrich
Streaming Data Store
Streaming Model Execution
21
Streaming Data
• The world is getting more instrumented and connected
• Digital data from various hardware (e.g., sensors) or software
flooding in the format of flowing big streams.
• Examples: financial markets, surveillance
• systems, manufacturing, smart cities, …
• Need to collect, process, and analyze big streams to extract
valuable information, discover new insights in real-time, and
detect emerging patterns and outliers
22
11
2/23/2024
Real-time Data Analytics
23
What is Streaming Data
• Streaming data is data that keeps flowing with no discrete beginning or
end.
• Eg. Data from environment sensors, body sensors, surveillance camera,
log files, transactions, …
• Streaming data source emits data records
• continuously rather than in batches.
• Most streaming data sources send data in small sizes (often in kilobytes)
continuously as the data is generated.
• Usually, the data need to be processed on the fly
24
12
2/23/2024
Characteristics of Data Streams
Unbounded data
• Conceptually infinite, ever-growing set of data items/events
• Practically continuous stream of data, which needs to be
processed/analyzed
Push model
• The source controls data production and procession
• Publish/subscribe model
Concept of time
• Often need to reason about when data is produced and when processed
data should be output
• Processing time, ingestion time, event time
25
Data Value Continuum
Data exists on a time continuum.
The “things” we do with data are strongly correlated to its age.
The value of data changes from the individual item to the aggregate over this time line.
26
13
2/23/2024
Data Value Chain
27
Data Streaming
• Traditionally, data is moved in batches.
• Batch processing processes large volumes of batched data with long
latency.
• For many streaming data, batching processing can not be used since it
is either prohibitively large to store and process in batch or the data
can be stale when processed.
• Data streaming (or data stream processing, DSP) is the processing of
streaming data on the fly. (visualizing, summarizing, analytics, …)
28
14
2/23/2024
Benefits of Data Streaming
• Good for time series analysis
• Well-suited for IoT data streams processing
• Can be used for real-time aggregation, correlation, filtering, or
sampling.
• Enable the analysis of data in real time to gain insights into a wide range
of activities.
• May accompany with planned actions based on the results of real-time
analytics.
• Can feedback to improve the effectiveness of future monitoring,
analytics and actions.
29
Patterns that Drives Most Streaming Use Cases
30
15
2/23/2024
Static vs Streaming
• In static data computation, questions are asked of static data.
• In streaming data computation, data are continuously evaluated by
static questions.
31
Batch vs. Real-time Processing
32
16
2/23/2024
Challenges of Streaming
• Streaming data management
• May have only one chance to examine the data
• Arbitrary and interactive exploration
• Real-time analytics
• Recency matter: alerts on recent changes
• Availability
33
Challenges of DSP
• Streaming architecture and pipeline
• Streaming data ingestion and handling (adaptors, data formats, schema,
cleaning, flow control, …)
• Stream processing algorithms design, testing, validation,
deployment, and life-cyclemgnt.
• Scalability on volume and velocity.
• Elastic processing and load variations mgnt.
• Fault tolerance and processing guarantees
• Self-adapt at run-time for pattern shift
• Auto feedback and learning
• Security and privacy
34
17