0% found this document useful (0 votes)
19 views121 pages

IoT Data Analytics: Tools and Techniques

Uploaded by

resmimr
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views121 pages

IoT Data Analytics: Tools and Techniques

Uploaded by

resmimr
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

MODULE -4

IOT
• Data and Analytics for IoT, An Introduction to Data Analytics for IoT,
Machine Learning, Big Data Analytics Tools and Technology, Edge
Streaming Analytics, Network Analytics, Securing IoT, A Brief History
of OT Security, Common Challenges in OT Security, Differences
between IT and OT Security Practices and Systems, Formal Risk
Analysis Structures: OCTAVE and FAIR.
An Introduction to Data Analytics for IoT
•  In the world of IoT, the creation of massive amounts of data from
sensors is common and one of the biggest challenges—not only from a
transport perspective but also from a data management standpoint

•  Modern jet engines are fitted with thousands of sensors that


generate a whopping 10GB of data per second

•  Analyzing this amount of data in the most efficient manner possible


falls under the umbrella of data analytics
• Not all data is the same; it can be categorized and thus analyzed in
different ways.

•  Depending on how data is categorized, various data analytics


tools and processing methods can be applied.

•  Two important categorizations from an IoT


perspective are whether the data is structured or unstructured
and whether it is in motion or at rest.
Structured Versus Unstructured Data
 Structured data and unstructured data are important classifications as they
typically require different toolsets from a data analytics perspective

 Structured data means that the data follows a model or schema that
defines how the data is represented or organized, meaning it fits well with a
traditional relational database management system (RDBMS).

In many cases you will find structured data in a simple tabular form—for
example, a spreadsheet where data occupies a specific cell and can be
explicitly defined and referenced

• Structured data can be found in most computing systems
and includes everything from banking transaction and invoices
to computer log files and router configurations.

•  IoT sensor data often uses structured values, such as


temperature, pressure, humidity, and so on, which are all sent
in a known format.

•  Structured data is easily formatted,stored,queried,and


processed
Because of the highly organizational format of structured data, a wide
array of data analytics tools are readily available for processing this type
of data.

 From custom scripts to commercial software like Microsoft

Excel and Tableau


Unstructured data
• Unstructured data lacks a logical schema for understanding and decoding
the data through traditional programming means.

•  Examples of this data type include text, speech, images, and video.

•  As a general rule, any data that does not fit neatly into a predefined data
model is classified as unstructured data
• According to some estimates, around 80% of a business’s data is
unstructured.

•  Because of this fact, data analytics methods that can be applied to


unstructured data, such as cognitive computing and machine learning, are
deservedly garnering a lot of attention.

•  With machine learning applications, such as natural language processing


(NLP), you can decode speech.

•  With image/facial recognition applications, you can extract


critical information from still images and video
• Smart objects in IoT networks generate both structured and
unstructured data.

•  Structured data is more easily managed and processed due to


its well-defined organization.

•  On the other hand, unstructured data can be harder to deal


with and typically requires very different analytics tools for processing
the data
Data in Motion Versus Data at Rest
•  Data in IoT networks is either in transit (“data in motion”) or being
held or stored (“data at rest”).

•  Examples of data in motion include traditional client/server


exchanges, such as web browsing and file transfers, and email.

•  Data saved to a hard drive, storage array, or USB drive is data at rest.
• From an IoT perspective, the data from smart objects is considered data in
motion as it passes through the network en route to its final destination.
•  This is often processed at the edge, using fog computing.

•  When data is processed at the edge, it may be filtered and deleted


or forwarded on for further processing and possible storage at a fog node or in
the data center.

• Data does not come to rest at the edge.

• When data arrives at the data center, it is possible to process it in


real-time, just like at the edge, while it is still in motion.
• Data at rest in IoT networks can be typically found in IoT brokers or in
some sort of storage array at the data center

•  Hadoop not only helps with data processing but also data
• storage
IoT Data Analytics Overview

•  The true importance of IoT data from smart objects is realized only
when the analysis of the data leads to actionable business intelligence
and insights.

• Data analysis is typically broken down by the types of results that are
produced
Types of Data Analysis Results
Four types of data analysis results
 Descriptive:
•  Descriptive data analysis tells you what is happening, either now or in
the past.
•  For example, a thermometer in a truck engine reports temperature
values every second.
•  From a descriptive analysis perspective, you can pull this data at
any moment to gain insight into the current operating condition of the truck
engine.

• If the temperature value is too high, then there may be a cooling problem
or the engine may be experiencing too much load.
Diagnostic:
•  When you are interested in the “why,” diagnostic data analysis
can provide the answer.
•  Continuing with the example of the temperature sensor in the
truck engine, you might wonder why the truck engine failed.

•  Diagnostic analysis might show that the temperature of


the engine was too high, and the engine overheated.

•  Applying diagnostic analysis across the data generated by a wide


range of smart objects can provide a clear picture of why a
problem or an event occurred
Predictive:
•  Predictive analysis aims to foretell problems or issues before they occur.

•  For example, with historical values of temperatures for the truck


engine, predictive analysis could provide an estimate on the remaining life of
certain components in the engine.

•  These components could then be proactively replaced before failure


occurs.

•  Or perhaps if temperature values of the truck engine start to rise


slowly over time, this could indicate the need for an oil change or some other
sort of engine cooling maintenance.
Prescriptive:
 Prescriptive analysis goes a step beyond predictive and recommends
solutions for upcoming problems.

•  A prescriptive analysis of the temperature data from a truck engine


might calculate various alternatives to cost-effectively maintain our truck

•  These calculations could range from the cost necessary for more frequent
oil changes and cooling maintenance to installing new cooling equipment on
the engine or upgrading to a lease on a model with a more powerful engine.

•  Prescriptive analysis looks at a variety of factors and makes the


appropriate recommendation
IoT Data Analytics Challenges

• Problems by using RDMS in IoT

• 1. Scaling Problems (performance issues, costly to resolve,


• req more h/w, architechture changes)

• 2. Volatility of Data (change in schema)- Increse the size of table


• Machine learning is, in fact, part of a larger set of technologies
commonly grouped under the term artificial intelligence (AI).

•  AI includes any technology that allows a computing system to


mimic human intelligence using any technique, from very
advanced logic to basic “if-then-else” decision loops.

•  Any computer that uses rules to make decisions belong to this group
•  ML is concerned with any process where the computer needs to
receive a set of data that is processed to help perform a task with more
efficiency.

•  ML is a vast field but can be simply divided in two main categories:


supervised and unsupervised learning
Supervised Learning
•  In supervised learning, the machine is trained with input for
which there is a known correct answer.

•  For example, suppose that you are training a system to recognize


when there is a human in a mine tunnel.

•  A sensor equipped with a basic camera can capture shapes and return them
to a computing system that is responsible for determining whether the shape
is a human or something else (such as a vehicle, a pile of ore, a rock, a piece of
wood, and so on.).
• With supervised learning techniques, hundreds or thousands of images are
fed into the machine, and each image is labelled (human or nonhuman in
this case).

•  This is called the training set.

•  An algorithm is used to determine common parameters and common


differences between the images.
•  The comparison is usually done at the scale of the entire image, or pixel by
pixel.

•  Images are resized to have the same characteristics (resolution, color


depth, position of the central figure, and so on), and each point is analyzed.
•  Each new image is compared to the set of known “good images,” and a
deviation is calculated to determine how different, the new
image is from the average human image and, therefore, the
probability that what is shown is a human figure. This process is
called classification.

•  After training, the machine should be able to recognize human shapes.


Before real field deployments, the machine is usually tested with
unlabeled pictures— this is called the validation or the test set,
depending on the ML system used—to verify that the recognition
level is at acceptable thresholds. If the machine does not reach the
level of success expected, more training is needed
• suppose we have a dataset of different types of shapes which includes
square, rectangle, triangle, and Polygon. Now the first step is that we need
to train the model for each shape.

• If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
• If the given shape has three sides, then it will be labelled as a triangle.
• If the given shape has six equal sides then it will be labelled as hexagon.
• Now, after training, we test our model using the test set, and the task of the
model is to identify the shape.

• The machine is already trained on all types of shapes, and when it finds a
new shape, it classifies the shape on the bases of a number of sides, and
Steps Involved in Supervised Learning:
• First Determine the type of training dataset
• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and validation
dataset.
• Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
• Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
• Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of training
datasets.
• Evaluate the accuracy of the model by providing the test set. If the model
predicts the correct output, which means our model is accurate.
Advantages of Supervised learning:

• With the help of supervised learning, the model can predict the output
on the basis of prior experiences.
• In supervised learning, we can have an exact idea about the classes of
objects.
• Supervised learning model helps us to solve various real-world problems
such as fraud detection, spam filtering, etc.
Disadvantages of supervised learning:

• Supervised learning models are not suitable for handling the complex
tasks.
• Supervised learning cannot predict the correct output if the test data
is different from the training dataset.
• Training required lots of computation times.
• In supervised learning, we need enough knowledge about the classes
of object.
• miss classification
Unsupervised learning

• Unsupervised learning is a type of machine


learning in which models are trained using
unlabeled dataset and are allowed to act on
that data without any supervision.
• Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have
the input data but no corresponding output data. The goal of
unsupervised learning is to find the underlying structure of dataset,
group that data according to similarities, and represent that dataset
in a compressed format.
Why use Unsupervised Learning?
• Unsupervised learning is much similar as a human learns to think by their
own experiences, which makes it closer to the real AI.

• Unsupervised learning works on unlabeled and uncategorized data which


make unsupervised learning more important.

• In real-world, we do not always have input data with the corresponding


output so to solve such cases, we need unsupervised learning.
• Here, we have taken an unlabeled input data, which means it is not
categorized and corresponding outputs are also not given.

• Now, this unlabeled input data is fed to the machine learning model in order
to train it. Firstly, it will interpret the raw data to find the hidden patterns
from the data and then will apply suitable algorithms such as k-means
clustering, Decision tree, etc.

• Once it applies the suitable algorithm, the algorithm divides the data objects
into groups according to the similarities and difference between the objects.
•An artificial neural network consists of a
pool of simple processing units which
communicate by sending signals to each
other over a large number of weighted
connections.
Big data analytics

• Big data analytics describes the process of uncovering


trends, patterns, and correlations in large amounts of
raw data to help make data-informed decisions. These
processes use familiar statistical analysis techniques—
like clustering and regression—and apply them to
more extensive datasets with the help of newer tools.
• 1,073,741,824 bytes (binary).-giga bytes

• 1, 099, 511, 627, 776 bytes -terra bytes


• An MPP, or massively parallel processing, database is a database
that is optimized to be processed in parallel for many operations
to be performed by many processing units at a time.

• MPP is the coordinated processing of a program by multiple


processors working on different parts of the program. Each
processor has its own operating system (OS) and memory
How MPP databases work
• MPP databases use multicore processors, multiple processors and
servers, and storage appliances equipped for parallel processing.

• That combination enables reading many pieces of data across


many processing units at the same time for enhanced speed. This
method is necessary because the frequencies of processors are
hitting the limits of the technologies used and are slow to increase.
• In splitting up processing among multiple nodes, one node acts as the
leader node.

• This node communicates with all other compute nodes and instructs
them.

• The compute nodes listen to the leader node and run queries. They
also divide large tasks into smaller, more manageable tasks (chunks)
and work on these tasks independently and simultaneously (i.e., in
parallel) to speed up processing and deliver query results faster.

• Adding more processors to the database along with a high-bandwidth


connection between the nodes further accelerates processing, which
can provide huge performance and processing benefits for a large
database.
edge streaming analytics
What is Edge Computing?

• Edge computing is a form of computing that is done on site or


near a particular data source, minimizing the need for data to
be processed in a remote data center.
• How does edge computing work?

• Compared to traditional forms of compute, edge computing offers


businesses and other organizations a faster, more efficient way to
process data using enterprise-grade applications. In the past, edge
points generated massive amounts of data that often went unused.
Now that IT architecture can be decentralized with mobile computing
and the Internet of Things (IoT), companies can gain near real-time
insights with less latency and lower cloud server bandwidth demands
—all while adding an additional layer of security for sensitive data.
What is Edge Streaming Analytics?

• Edge streaming analytics is ingesting a continuous data stream


as it’s being created on a device to quickly filter and analyze it
in real time. Organizations often use this kind of distributed
computation system to get immediate decisions on data that is
too substantial to transfer quickly to the cloud.
• Most organizations run a complex and interconnected system of
devices, all creating data that can build into a massive glut of
information if it’s not continuously processed. By running data
through an analytics algorithm as it’s created at the edge of a
corporate network, you can gain faster insights to find new ways to
improve efficiency, engage customers, and develop new business.
Why are edge streaming analytics needed?
• As the on-demand economy continues to urge companies to deliver more quickly than
ever, businesses need to deliver better services at the point of consumption and avoid any
lags caused by using remote data centers or clouds.

• And as the number of connected devices deployed by organizations increases, the volume
of data that needs to be processed is growing too, which can quickly overwhelm central
data management systems.

• Edge analytics help enterprises improve real-time business analytics and facilitate faster
What are the benefits of edge streaming
analytics?
• Improved uptime
• Because the data is processed on-site rather than being transmitted to a far-
off central location, and because enterprise IT is able to look at hardware
performance data constantly, it can help organizations develop the foresight
to predict and head off failures and avoid unplanned downtime.

• Speed
• Sensors can automatically shut down a machine or take corrective action
when a repair is needed. Edge streaming analytics can also speed information
to the team to fix it, rather than sending the alert to a central processing
location first. And in scientific or engineering enterprises, the rapid-fire
generation of real information can accelerate innovation and human progress.
• Scalability
• Because the computational workload is handled at each device, the
overall burden is shared across the ecosystem so it can be processed
much more efficiently.
• Cost
• By distributing the data processing across edge computing infrastructure,
an organization can reduce data transmission and storage costs. In
addition, by learning about the health and performance of devices in
real-time, repairs and maintenance costs can be tailored to need rather
than a broader schedule, which leads to lower operational expenses.
• Security
• Because data is processed at the device, it doesn’t need to be
transmitted across the network, which exposes it to risk. Raw data never
leaves the device that created it.
• Safety
• When even the tiniest error or delay could spell catastrophe, such as in
autonomous driving, local oversight, turn by turn, is critical.
• Edge analytics vs. edge computing
• Edge computing is based on the idea that data collection and data
processing can be performed near the location where the data is
either being created or consumed. Edge analytics uses these
same devices and the data that they have already produced. An
analytics model performs a deeper analysis of the data than what
was initially performed. These analytics capabilities enable the
creation of actionable insights, often directly on the device
Network Analytics,
Network Analytics
•  Another form of analytics that is extremely important in managing IoT
systems is network-based analytics

•  Network analytics is concerned with discovering patterns in the


communication flows from a network traffic perspective.

•  Network analytics has the power to analyze details of


communications patterns made by protocols and correlate
this across the network.

•  It allows you to understand what should be considered normal behavior in a


network and to quickly identify anomalies that suggest network problems due
to suboptimal paths, intrusive malware, or excessive congestion.
Securing IoT
Securing IoT
• Information technology (IT) environments have faced active attacks
and information security threats for many decades, and the incidents
and lessons learned are well-known and documented.

•  Operational technology (OT) environments were traditionally kept


in silos and had only limited connection to other networks.

•  Thus, the history of cyber attacks on OT systems is much shorter and


has far fewer incidents documented
•  Security in the OT world also addresses a wider scope than in the IT
world. For example, in OT, the word security is almost synonymous
with safety
• A Brief History of OT Security
•  Common Challenges in OT Security
•  How IT and OT Security Practices and Systems Vary
•  Formal Risk Analysis Structures: OCTAVE and FAIR
•  The Phased Application of Security in an
Operational
• Environment
A Brief History of OT Security
•  Cybersecurity incidents in industrial environments can result
in physical consequences that can cause threats to human lives
as well as damage to equipment, infrastructure, and the
environment.

•  While there are certainly traditional IT-related security


threats in industrial environments, it is the physical
manifestations and impacts of the OT security incidents that
capture media attention and elicit broad-based public concern
• Historically, attackers were skilled individuals with deep knowledge of
technology and the systems they were attacking.

•  However, as technology has advanced, tools have been created to


make attacks much easier to carry out.

•  To further complicate matters, these tools have become more broadly


available and more easily obtainable.
• Compounding this problem, many of the legacy protocols used in
IoT environments are many decades old, and there was no
thought of security when they were first developed.

•  This means that attackers with limited or no technical


capabilities now have the potential to launch cyber attacks,
greatly increasing the frequency of attacks and the overall threat
to end operators
Common Challenges in OT Security??
• in this section we are discussing about problems in operational
technology...and
• Some common industrial protocols and their respective security concerns
 Modbus
DNP3
ICCP
IEC
Common Challenges in OT Security
•  Erosion of Network Architecture
•  Two of the major challenges in securing industrial environments have
been initial design and ongoing maintenance.

•  The initial design challenges arose from the concept that


networks were safe due to physical separation from the
enterprise with minimal or no connectivity to the outside
world, and the assumption that attackers lacked sufficient
knowledge to carry out security attacks.
•  The challenge, and the biggest threat to network security, is
standards and best practices either being misunderstood or
the network being poorly maintained.

•  It is more common that, over time, what may have been a


solid design to begin with is eroded through ad hoc updates
and individual changes to hardware and machinery without
consideration for the broader network impact
Insecure Operational Protocols
•  Many industrial control protocols, were designed without
inherent strong security requirements

•  Furthermore, their operation was often within an assumed


secure network.

•  In addition to any inherent weaknesses or vulnerabilities,


their operational environment may not have been designed
with secured access control in mind
• Industrial protocols, such as supervisory control and data
acquisition (SCADA) ,particularly the older variants, suffer
from common security issues.

•  Three examples of this are, lack of authentication between


communication endpoints, no means of securing and
protecting data at rest or in motion, and insufficient
granularity of control to properly specify recipients or avoid
default broadcast approaches
• The structure and operation of most of these protocols is
often publicly available.

•  While they may have been originated by a private firm, for


the sake of interoperability, they are typically published for
others to implement.

•  Thus, it becomes a relatively simple matter to compromise


the protocols themselves and introduce malicious actors that
may use them to compromise control systems
• Some common industrial protocols and their respective security concerns
• Modbus
• Modbus is a serial communication protocol developed by Modicon
•  Modbus is commonly found in many industries, such as
utilities and manufacturing environments, and has multiple
variants (for example, serial, TCP/IP).

•  It was created by the first programmable logic controller


(PLC) vendor, Modicon, and has been in use since the 1970s.

•  It is one of the most widely used protocols in industrial


deployments, and its development is governed by the
Modbus Organization
DNP3 (Distributed Network Protocol)

•  DNP3 has placed great emphasis on the reliable delivery of


messages

•  In the case of DNP3, participants allow for unsolicited


responses, which could trigger an undesired response.

•  The missing security element here is the ability to establish


trust in the system’s state and thus the ability to trust the
veracity of the information being presented
• DNP3 stands for Distributed Network Protocol 3rd
version.

• The basic architecture of a DNP3 setup is Master /


Outstation*, although that could include one:one
and one:many configurations.
• Main DNP3 Capabilities

• DNP3 can request and respond with multiple data types in


single messages
• Response without request (unsolicited messages)
• It allows multiple masters and peer-to-peer operations
• It supports time synchronization and a standard time
format
• It includes only changed data in response messages
ICCP (Inter-Control Center Communications Protocol)
• the Inter-control Center Communications Protocol (ICCP) was developed to
enable data exchange over Wide Area Networks between utility control
centers, Independent System Operators (ISOs), Regional Transmission
Operators (RTOs), and other Generators.
• ICCP allows to exchange real time and historical data including status,
measured values, scheduling data and operator commands amongst
others.
• With ever increasing interconnectivity across international borders and
grid operation zones, inter-utility real time data exchange has become
critical to the operation of interconnected systems.
• ICCP is a client/server based protocol operating at the application layer
in the OSI model, supporting any interfaces, transport and network
services that fit the OSI model.
• One control center (client) sends a request to another control centre
(server) for the data to be exchanged., while both control centers may
be both clients and servers.
International Electrotechnical Commission (IEC) Protocols

•  Three message types were initially defined: MMS


• (Manufacturing Message Specification), GOOSE (Generic
Object Oriented Substation Event), and SV (Sampled Values).

•  Both GOOSE and SV operate via a publisher/subscriber


model, with no reliability mechanism to ensure that data has been received

•  Authentication is embedded in MMS, but it is based on clear_x0002_text


passwords, and authentication is not available in GOOSE or SV
•# OT Network
Characteristics Impacting
Security
OT Network Characteristics Impacting
Security
•  While IT and OT networks are beginning to converge, they still maintain
many divergent characteristics in terms of how they operate and the traffic
they handle.

•  These differences influence how they are treated in the context of a


security strategy.
•  IT networks
•  traffic traverse far
•  They frequently traverse the network through layers of switches and
eventually make their way to a set of local or remote servers, which they
may connect to directly
• OT networks
•  By comparison, in an OT environment (Levels 0–3), there are typically
two types of operational traffic.

•  The first is local traffic that may be contained within a specific


package or area to provide local monitoring and closed-loop control.

•  This is the traffic that is used for real time (or near-real-time)
processes and does not need to leave the process control levels.

•  The second type of traffic is used for monitoring and control of areas
or zones or the overall system.

You might also like