Understanding Big Data Analytics Concepts
Understanding Big Data Analytics Concepts
What is analytics?
• Analytics is the systematic discovery, interpretation,
and use of meaningful patterns in data.
• It also entails organizing and processing data as well as
extracting patterns in data towards effective problem
solving and decision making.
DATA Explicit
Creating concepts
INFORMATION
Depth of meaning
Creating context
KNOWLEDGE
Creating Patterns
WISDOM
Creating Principles
Tacit
TRUTH
Data analytics vs. Big data
analytics
• Data analytics is the broad process of extracting
meaningful insights from data,
• while big data analytics focuses specifically on analyzing very
large, complex datasets.
• Big data analytics employs advanced techniques like
machine learning, deep learning and data mining to process
these datasets effectively
• Both data analytics and big data analytics aim to provide
valuable insights for decision-making
Key Similarities and Differences
Feature Data Analytics Big Data Analytics
Data Size Can handle various data sizes Primarily deals with very large
datasets
Data Can handle structured, semi- Often deals with diverse, complex
structured, unstructured data that is too large or complex for
traditional methods
Tools Can use standard software like May utilize platforms like Hadoop,
SQL, Excel, or specialized Spark, and cloud-based solutions
analytical tools.
Introducing Big Meaning of big data analytics, Data analytics vs. Big data analytics,
Data analytics Types of big data analytics, Classification of analytics, Challenges
to big data analytics, How Big Data Analytics Works, application of
big data analytics, future trends
Big Data Hadoop system architecture, HDFS (Hadoop Distributed File
Technologies System), MapReduce computational model, Apache Spark in
memory data analytics, NoSQL database management system
Large-Scale Introduction to Supervised learning, machine learning vs. deep
predictive learning, probabilistic modeling, artificial neural networks, deep
modeling learning, model parameters and hyperparameters optimization,
Regularization
Large-Scale Introduction to Unsupervised learning, evaluation techniques, K-
descriptive means & K-medoids clustering, hierarchical clustering, density
modeling based clustering
Evaluation
• Assignments & Presentation 20%
• Project
30%
• Final exam 40%
• Knowledge sharing 10%
(Class attendance & participation)
Presentation assignment
• Instruction: As per the given topic, review at least 5+
journal articles & prepare presentation slides on the
following topics;
• (i) Introduce what it means, i.e. overview and definition of
the concept;
• (ii) explain why we need it, pros & cons, significance;
• (iii) discuss how it works, architecture, & approaches
followed;
• (iv) concluding remarks (show strength & weakness of the
concept with the way forward);
• (iv) reference.
Presentation assignment
No Name Topic Date
25
Big data is a
collection of data
sets so large and
complex in
volume,
velocity, and
variety, that
traditional data
management
systems cannot
store, process,
and analyze
them.
Characteristics of Big Data (5
Vs)
• Volume – The sheer amount of data
• Large amounts of data (terabytes, petabytes)
• Velocity – The speed at which data is generated and
processed.
• High-speed data generation (real-time streaming)
• Variety – The different types and formats of data
• Different data types (structured, unstructured, semi-
structured)
• Veracity – Data quality and reliability of the data
• Value – Extracting meaningful insights
• The potential insights and business benefits that can
be derived from the data.
Big Data
• Big data is a collection of data sets so large and complex that it
becomes difficult to process using on-hand database
management tools.
56 V’s of Big data
Two types of big data
• Big data is divided into data at rest and data in motion.
• Data at rest:
• This refers to data that has been collected from various sources
and is then analyzed after the event occurs.
• The point where the data is analyzed and the point where
action is taken on it occur at two separate times.
• Data in motion:
• The collection process for data in motion is similar to that of
data at rest; however, the difference lies in the analytics.
• In this case, the analytics occur in real-time as the event
happens.
Stages of Big data
• Data Generation: concerns how data are being generated, this is to mean large
diverse and complex dataset that is generated from different data sources.
• However there are technical challenges in collecting, processing and analyzing these
datasets.
• Each component of this value chain presents various challenges that require
deep research into, mostly because of the heterogeneous and complex
character of the data involved.
Big Data Challenges
Classification of big data
challenges
• Challenges of big data can be classified into:
data management and data analytics.
• Data management involves processes and
supporting technologies to acquire and store data
and to prepare and retrieve it for analysis.
• Data analytics refers to techniques used to discover
and acquire intelligence from big data.
• Needs to handle efficiently and effectively using
big data analytics
Big Data Analytics
• NoSQL databases are a new way of thinking about data that is non-
relational, schema-less, and can be distributed and fault tolerant.
• data came in all shapes and sizes — structured, semi-structured, and
unstructured — and defining the schema in advance became nearly
impossible. NoSQL databases allow developers to store huge amounts of
unstructured data, giving them a lot of flexibility.
• refers to non-relational databases that store data in a non-tabular
format, rather than in rule-based, relational tables like relational
databases do.
• NoSQL databases store data in a more natural and flexible way.
NoSQL, as opposed to SQL, is a database management approach,
whereas SQL is just a query language, similar to the query languages
of NoSQL databases.
• four major types of NoSQL databases have emerged: document
databases, key-value databases, wide-column stores, and graph
databases.
NoSQL
• Due to the exponential growth of digitization, businesses now collect as much
unstructured data as possible. To be able to analyze and derive
actionable real-time insights from such big data, businesses need modern
solutions that go beyond simple storage.
• Businesses need a platform that can easily scale, transform, and visualize data; create
dashboards, reports, and charts; and work with AI & BI tools to accelerate their
business productivity.
• Due to their flexible and distributed nature, NoSQL databases (for example,
MongoDB) shine in these tasks.
Document-oriented databases
• A document-oriented database stores data in documents such that each
document contains pairs of fields and values. The values can typically be a
variety of types, including things like strings, numbers, booleans, arrays, or
even other objects. A document database offers a flexible data model, much
suited for semi-structured and typically unstructured data sets.
• Examples of document databases are MongoDB and Couchbase.
• A typical document will look like: {
"_id": "12345",
"name": "foo bar",
"email": "foo@[Link]",
"address": {
"street": "123 foo street",
"city": "some city",
"state": "some state",
"zip": "123456"
},
"hobbies": ["music", "guitar", "reading"]
}
Key-value databases
• A key-value store is a simpler type of database where each item
contains keys and values. Each key is unique and associated with a
single value. They are used for caching and session management and
provide high performance in reads and writes because they tend to
store things in memory.
• Examples are Amazon DynamoDB and Redis. A simple view of data
stored in a key-value database is given below:
Key: user:12345
Value: {"name": "foo bar", "email": "foo@[Link]", "designation": "software
developer"}
Wide-column stores
• Wide-column stores store data in tables, rows, and dynamic columns.
The data is stored in tables. However, unlike traditional SQL
databases, wide-column stores are flexible, where different rows can
have different sets of columns. These databases can employ column
compression techniques to reduce the storage space and enhance
performance. The wide rows and columns enable efficient retrieval of
sparse and wide data.
• Some examples of wide-column stores are Apache Cassandra and
HBase. A typical example of how data is stored in a wide-column is as
follows:
Graph databases
• A graph database stores data in the form of nodes and edges. Nodes typically
store information about people, places, and things (like nouns), while edges
store information about the relationships between the nodes.
• Examples of graph databases are Neo4J, Amazon Neptune & MongoDB.
Below is an example of how data is stored:
RDBMS vs. NoSQL databases
• There are a variety of differences between relational database management systems and non-relational databases.
• Data modeling
• NoSQL: Data models vary based on the type of NoSQL database used — for example, key-value, document, graph, and wide-column — making the model
suitable for semi-structured and unstructured data.
• RDBMS: RDBMS uses a tabular data structure, with data represented as a set of rows and columns, making the model suitable for structured data.
• Schema
• NoSQL: It provides a flexible schema where each set of documents/row-column/key-value pairs can contain different types of data. It’s easier to change
schema, if required, due to the flexibility.
• RDBMS: This is a fixed schema where every row should contain the same predefined column types. It is difficult to change the schema once data is stored.
• Query language
• NoSQL: It varies based on the type of NoSQL database used. For example, MongoDB has MQL, and Neo4J uses Cypher.
• RDBMS: This uses structured query language (SQL).
• Scalability
• NoSQL: NoSQL is designed for vertical and horizontal scaling.
• RDBMS: RDBMS is designed for vertical scaling. However, it can extend limited capabilities for horizontal scaling.
• Data relationships
• NoSQL: Relationships can be nested, explicit, or implicit.
• RDBMS: Relationships are defined through foreign keys and accessed using joins.
• Transaction type
• NoSQL: Transactions are either ACID- or BASE-compliant.
• RDBMS: Transactions are ACID-compliant.
• Performance
• NoSQL: NoSQL is suitable for real-time processing, big data analytics, and distributed environments.
• RDBMS: RDBMS is suitable for read-heavy and transaction workloads.
• Data consistency
• NoSQL: This offers eventual consistency, in most cases.
• RDBMS: This offers high data consistency.
• Distributed computing
• NoSQL: One of the main reasons to introduce NoSQL was for distributed computing, and NoSQL databases support distributed data storage, vertical and
horizontal scaling through sharding, replication, and clustering.
• RDBMS: RDBMS supports distributed computing through clustering and replication. However, it’s less scalable and flexible as it’s not traditionally designed
to support distributed architecture.
• Fault tolerance
• NoSQL: NoSQL has built-in fault tolerance and high availability due to data replication.
• RDBMS: RDBMS uses replication, backup, and recovery mechanisms. However, as they are designed for these, additional measures like disaster recovery
mechanisms may need to be implemented during application development.
• Data partitioning
• NoSQL: It’s done through sharding and replication.
• RDBMS: It supports table-based partitioning and partition pruning.
• Data to object mapping
• NoSQL: NoSQL stores the data in a variety of ways — for example, as JSON documents, wide-column stores, or key-value pairs. It provides abstraction
through the ODM (object-data mapping) frameworks to work with NoSQL data in an object-oriented manner.
• RDBMS: RDBMS relies more on data-to-object mapping so that there is seamless integration between the database columns and the object-oriented
application code.
Relational database vs NoSQL
database
• Assume example
storing information about a user and their hobbies. We need to store a
user's first name, last name, cell phone number, city, and hobbies.
• In a RDBMS, two tables are created: Users & Hobbies tables
• In order to retrieve all of the information about a user and their hobbies, information
from the Users table and Hobbies table will need to be joined together.
• The data model for a NoSQL database will depend on the type of NoSQL database
selected. Let's store the same data about a user and their hobbies in a document
database like MongoDB.
• In order to retrieve all of the information about a user and their hobbies, a single
document can be retrieved from the database. No joins are required, resulting in faster
queries.
Modeling
78
Bayesian Learning
CONDITIONAL PROBABILITY
• The issue is, How likely is it that an event will happen?
• Sample Space S
• An event A and C are a subset of S
• Prior knowledge and observed data can be combined
arg max P (C ) P (Outl sunny | C ) P (Temp cool | C ) P ( Hum high | C ) P (Wind strong | C )
C[ yes , no ]
• Compare P(yes/Ai) and P(no/Ai), and select the one with max prob
• P(yes)*P(sunny/yes)*P(cool/yes)*P(high/yes)*P(strong/yes)= 0.0053
• P(no)*P(sunny/no)*P(cool/no)*P(high/no)*P(strong/no)= 0.0206
Answer: Play tennis = no
Naive Bayesian Classifier
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Robust to isolated noise points
• Handle missing values by ignoring the instance during probability
estimate calculations
• Robust to irrelevant attributes
• Disadvantages
• Class conditional independence assumption may not hold for some
attributes, therefore loss of accuracy
• Practically dependencies exist among variables
• E.g. hospitals: patients: profile: age, family history, etc. symptoms:
fever, cough etc. Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian
classifier
• How to deal with these dependencies? Bayesian Belief Networks
92
Assignment
• Show with example how Bayesian Belief Networks
(BBNs) work
• Your report should,
• (i) introduce BBNs,
• (ii) show algorithm,
• (iii) work out using example scenario,
• (iv) conclusion,
• (v) reference
93
Neural Network
94
The Power of Brain vs. Machine
• While the human brain is superior in creativity,
emotional intelligence, and complex problem solving,
computers are superior in processing speed,
logical reasoning, and accuracy in computation
• The Brain
– Creativity
– Association
– Complexity
– Noise Tolerance
• The Machine
– Calculation
– Precision
– Logic
95
Features of the Brain
• Ten billion (1010) neurons
Neuron switching time >10-3secs
• Face Recognition ~0.1secs
• On average, each neuron has several thousand
connections
• Hundreds of operations per second
• High degree of parallel computation
• Compensated for problems by massive
parallelism
• Distributed representations
• Die off frequently (never replaced)
96
Neural Network classifier
Input layers
• It is represented as a
layered set of Hidden
interconnected layers
processors. These
processor nodes has a
relationship with the
neurons of the brain.
• Each node has a weighted
connection to several Output
other nodes in adjacent layer
layers.
• Individual nodes take the
input received from
connected nodes and use
the weights together to
compute output values.
98
Architecture of Neural network
• Neural networks are used to look for patterns in data, learn
these patterns, & then classify new patterns & make forecasts
• A network with the input and output layer only is called
single-layered neural network. Whereas, a multilayer neural
network is a generalized one with one or more hidden layer.
• A network containing two hidden layers is called a three-layer neural
network, and so on.
Single layered NN Multilayer NN
n
x1 x1
w1 o ( wi xi )
x2 i 1 x2
w2
x3 w3 1 x3
( y)
1 e y Input Hidden Output
nodes nodes nodes
A Multilayer Neural Network
• Input Layer: corresponds with class attribute that are with
normalized attributes values.
• There are as many nodes as class attributes, X = {x1, x2, …. xm}, where m is the
number of attributes.
• Hidden Layer
– neither its input nor its output
can be observed from outside.
– The number of nodes in the
hidden layer & the number of
hidden layers depends on
implementation.
– Hidden layers are what make NNs
"deep" & enable them to learn
complex data representations.
– Hidden layers enable to extract
the relevant information from
the input data that is necessary
for making predictions or
decisions.
• Output Layer – corresponds to the class attribute. There are as
many nodes as classes (values of the class attribute).
–Ok, where k= 1, 2,.. n, where n is number of classes
Steps followed in NN
• The neuron is the basic information processing unit of a NN. It
consists of:
1 A set of links, describing the neuron inputs, with neurons
connection weights W1, W2, …, Wm
2. An adder function (linear combiner) for computing the weighted
m
sum of the inputs :
y w jx j
j1
108
Training the neural network
• The purpose is to learn to generalize using a set of sample
patterns where the desired output is known.
• Back Propagation (short for, backward propagation of
errors) is the most commonly used method for training
multilayer feed forward NN.
• Back propagation learns by iteratively processing a set of training
data (samples).
• For each sample, weights are modified to minimize the error
between the desired output and the actual output.
• After propagating an input through the network, the error
is calculated and the error is propagated back through the
network while the weights are adjusted in order to make
the error smaller.
109
Training Algorithm
• The learning algorithm is as follows
• Initialize the weights and threshold to small random
numbers.
• Present a vector x to the neuron inputs and calculate the
output using the adder function. m
y w jx j
j 1
• Apply the activation function (in this case step function)
such that
0 if y 0
y
1 if y 0
• Update the weights according to the error.
W j W j * ( yT y ) * x j
ANN Training Example
Bias 1st input 2nd input Target
Given the following two inputs x1, x2; (x1) (x2) output
find equation that helps to draw the
boundary? -1 0 0 0
• Let say we have the following initializations:
W1(0) = 0.92, W2(0) = 0.62, W0(0) = 0.22, -1 1 0 0
ή = 0.1 -1 0 1 1
-1 1 1 1
• Training – epoch 1:
• Training – epoch 3:
y1 = 0.72*0 + 0.62*0 – 0.42 = -0.42 y = 0
y2 = 0.72*1 + 0.62*0 – 0.42 = 0.4 y = 1 X
• Finally:
y1 = 0.52*0 + 0.72*0 – 0.52 = -0.52 y = 0
y2 = 0.52*1 + 0.72*0 – 0.52 = -0.0 y = 0
y3 = 0.52*0 + 0.72*1 – 0.52 = 0.2 y= 1
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72 y= 1
ANN Training Example
1+ + 1+ +
x2 x2
0o x1 1
o 0o x1 1
o
Pros and Cons of Neural Network
• Useful for learning complex data like handwriting, speech
and image recognition
Cons
Pros
Slow training time
+ Can learn more complicated
Hard to interpret &
class boundaries understand the learned
+ Fast application function (weights)
+ Can handle large number of
Hard to implement: trial &
features error for choosing number of
nodes
Neural Network needs long time for training.
Neural Network has a high tolerance to noisy and
incomplete data
118
Machine Learning vs. Deep Learning
• AI is a broad field; machine learning is a subset (and an application)
of AI & Deep learning is a subset of machine learning
• Machine learning is
more explicitly used as
a means to extract
knowledge from data
through simpler
methods such as
decision trees, linear
regression, neural
networks
• Deep learning uses the
more advanced
methods found in
artificial neural
networks.
Deep Learning vs. Machine Learning
ML DL
Problem Helps to solve less-complex tasks help to solve the most complex tasks
data volume Small datasets: ML achieves Big data: effectiveness of DL models
meaningful results with thousands depend on millions of data points
of data points (terabytes and petabytes)
Computing Good computing power (runs on Requires more computational power, like
power CPU) GPU, TPU, DPU, QPU
Self-supervised learning
• Self-supervised learning (SSL) is a machine learning
technique where a model learns representations or
features directly from the input data without explicit
supervision or labelled targets.
• Unlike supervised learning, where models are trained
on labelled data (input-output pairs)
and unsupervised learning, which deals with
unlabeled data,
• SSL utilizes the inherent structure or characteristics within the
data to generate supervisory signals
Benefits of Self-Supervised Learning
• Self-supervised learning (SSL) introduces a paradigm shift in ML, offering a range of
advantages that redefine how models learn from data without explicit supervision.
• 1. Addressing Data Scarcity and Labeled Data Challenges
• Mitigating the Need for Extensive Labeled Data: SSL reduces dependence on
large, annotated datasets, making it feasible to train models even when labelled
data is scarce or costly.
• Leveraging Unlabeled Data: SSL efficiently utilizes vast pools of unlabeled data,
tapping into their latent information to generate valuable supervisory signals for
training.
• 2. Improving Model Generalization and Performance
• Learning Richer Representations: SSL facilitates the extraction of high-quality,
nuanced representations directly from raw data, enhancing a model’s ability to
generalize across diverse tasks and datasets.
• Enhanced Transfer Learning: Models trained using SSL often exhibit superior
transfer learning capabilities, as the learned representations are more adaptable
and applicable to new, unseen domains or tasks.
• 3. Reducing Human Intervention and Labor-Intensive Labeling Processes
• Cost and Time Efficiency: By minimizing the need for manual labelling efforts, SSL
streamlines the training process, reducing time and monetary investments
associated with data annotation.
• Automation and Scalability: SSL’s reliance on self-generated tasks enables
automated learning processes, facilitating scalability across domains without
Deep learning
• Deep learning is a method that teaches computers to process data in
a way that is inspired by the human brain.
• A neural network is the underlying technology in deep learning. It
consists of interconnected nodes or neurons in a layered structure.
• Deep learning models can recognize complex patterns in pictures,
text, sounds, and other data to produce accurate insights and
predictions.
Deep Neural Networks (DNN)
• Deep Neural Network is with
multiple hidden layers
between the input & output
layers.
• Deep neural network is simply
a feed forward network with
many hidden layers.
Learning Rate Used to control the step size during gradient descent, and the learning rate 0.0001 to 0.001
used. Learning rate is fine-tuned to optimize the convergence rate and (grid search)
reduce the likelihood of divergence. For transformer-based models,
smaller learning rates were prioritized to avoid overfitting due to the high
complexity of pretrained embedding.
Batch Size The number of training samples used in one iteration, allowing the model to 16, 32, 64, 128
process multiple input samples simultaneously, thereby speeding up the (grid search)
training process while maintaining a good level of generalization. Batch size is
necessary to balance training speed & memory needs. While a bigger batch
size is used where resources permitted faster convergence, a smaller batch
size is explored in complicated models to control computing restrictions.
Optimizer used to fine-tunes a neural network's parameters during training. Adam Optimizer is Adam,SGA
used in most cases
Dropout involves temporarily removing nodes (input or hidden) in a NN, along with 0.3 to 0.5
their connections, creating a new architecture from the original network. It
reduces overfitting while ensuring generalization capability. Dropout rate is
carefully adjusted to achieve robust sequence labelling.
• There are various methods to decide the digits inside the kernel. This will
depend on the effect you want to achieve such as detecting edges,
blurring, sharpening
Effects of Kernel
How Pooling Layers Work
• Imagine you have a large image and want to make it smaller but keep all the
important features like edges and colors.
• The pooling layer operates independently on every depth slice of the input. It
resizes it spatially, using the Max or Average of the values in a window slide
over the input data.
• In this example, given a 2x2 kernel the pooling operation reduces the feature
map from (6 × 6) to (2 × 2).
9 8 6 6 8 9 9 8 6 6 8 9
8 2 1 1 2 8 8 2 1 1 2 8
6 1 0 0 1 6
Convolved 6 1 0 0 1 6
Feature
6 1 0 0 1 6 (6 x 6) 6 1 0 0 1 6
8 2 1 1 2 8 8 2 1 1 2 8
9 8 6 6 8 9 9 8 6 6 8 9
output output
9 6 9 7 4 7
Max 6 0 6 Average 5 0 5
Values Values
9 6 9 7 4 7
Architecture of a CNN: Fully
connected layer
• The fully connected layer is
responsible for classifying images
based on the features extracted in
the previous layers.
• Without dense layers, CNNs would not be able to perform tasks, such as
images classification, smile detection, human activity recognition or making
predictions based on visual inputs.
• Dense layers allow each neuron to interact with all neurons in the previous layer. In
contrast, sparse layers only allow each neuron to interact with a subset of the neurons
in the previous layer
• Not all layers in a CNN are fully connected. Because fully connected layers
have many parameters, applying this approach throughout the entire network
would create unnecessary density, increase the risk of overfitting and make
the network very expensive to train in terms of memory and computation.
• Limiting the number of fully connected layers balances computational efficiency and
generalization ability with the capability to learn complex patterns.
Architecture of a CNN: Fully
connected layer
• While convolutional layers are good
at detecting features in input data,
– dense layers are essential for integrating these
features into final classification decision, say
predictions.
• Fully connected layers (dense
layers) are designed to
operate on 1-dimensional
data, hence,
• Flattening is a necessary step to
transit from the
multidimensional tensors
produced by convolutional
layers to the format required for
dense layers.
Flattening layers
• After convolutional and pooling layers have extracted relevant
features from the input image we have to turn this high-dimensional
feature map into a format suitable for feeding into fully connected layers.
• Here is where flattening layers come into action
• Flattening layer takes the entire feature map and reorganizes it into a single, long
vector.
Flattening layers
Examples of CNN Models
• Example applications of CNN include
• image classification (e.g., AlexNet, VGG, ResNet, MobileNet)
• object detection (e.g., Fast R-CNN, Mask R-CNN, YOLO, SSD).
• AlexNet. For image classification, as the first CNN neural network to win
the ImageNet Challenge in 2012, AlexNet consists of five convolution layers
and three fully connected layers. Thus, AlexNet requires 61 million weights
and 724 million MACs (multiply-add computation) to classify the image with a
size of 227×227.
• VGG-16. To achieve higher accuracy, VGG-16 is trained to a deeper structure
of 16 layers consisting of 13 convolution layers and three fully connected
layers. This requires 138 million weights and 15.5G MACs to classify the image
with a size of 224×224.
• GoogleNet. To improve accuracy while reducing the computation of DNN
inference, GoogleNet introduces an inception module composed of different-
sized filters. As a result, GoogleNet achieves a better accuracy performance
than VGG-16 while only requiring seven million weights and 1.43G MACs to
process the image with the same size.
• ResNet. the state-of-the-art effort, ResNet uses the “shortcut” structure to
reach a human-level accuracy with a top-5 error rate below 5%. In addition,
the “shortcut” module can solve the gradient vanishing problem during the
training of the model, making it possible to train a DNN model with a deeper
CNN Application in Healthcare
•Transfer learning is a
popular approach
in deep learning, as it
enables the training of
deep neural networks
with less data.
– where an already
developed ML model is
reused in another task.
Advantages of Transfer Learning
• Training a model takes a large amount of computer resources, data and time. Using a pretrained
model as a starting point helps cut down on all three, as developers don't have to start from
scratch, training a large model on what would be an even bigger data set.
• Reduces data needs. By using pretrained models that were already trained on their own large data sets,
transfer learning enables developers to create new models even when they don't have access to massive
amounts of labeled data.
• Speeds up the training process. Transfer learning speeds up the training process of a new model, as it
starts with pre-learned features, leading to less time required to learn a new task.
• Reduces computational cost. Transfer learning reduces the costs of building models by enabling them to
reuse previously trained parameters. This process is more efficient than training a model from scratch.
• ML algorithms are typically designed to address isolated tasks. Through transfer learning, methods
are developed to transfer knowledge from one or more of these source tasks to improve learning
in a related target task.
• Developers can choose to reuse in-house ML models, or they can download them from other developers
who have published them on online repositories or hubs. Knowledge from an already trained ML model
must be similar to the new task to be transferable. For example, the knowledge gained from recognizing
an image of a dog in a supervised ML system could be transferred to a new system to recognize images of
cats. The new system filters out images it already recognizes as a dog.
• Provides performance improvements. In cases where the target task is closely related to the
source task, performance can improve due to the knowledge it gains from training on the first task.
• Prevents overfitting. Overfitting occurs when a model fits too closely to its training data, making
the model unable to make accurate generalizations. By starting with a well-trained model, transfer
learning helps prevent overfitting, especially when target data sets are small.
• Provides versatility. Retrained models consist of knowledge gained from one or more previous
data sets. This can potentially lead to better performance on different tasks. Transfer learning can
also be applied to different ML tasks, such as image recognition and natural language processing
(NLP).
Types of transfer learning
• Transfer learning can be accomplished in several ways.
• One way is to find a related learned task -- labeled as Task B -- that has
plenty of transferable labeled data. The new model is then trained on Task
B. After this training, the model has a starting point for solving its initial
task, Task A.
• Another way to accomplish transfer learning is to use a pretrained model.
This process is easier, as it involves the use of an already trained model.
The pretrained model should have been trained using a large data set to
solve a similar task as task A. Models can be imported from other
developers who have published them online.
• A third approach, called feature extraction or representation learning, uses
deep learning to identify the most important features for Task A, which
then serves as a representation of the task. Features are normally created
manually, but deep learning automatically extracts features. The learned
representation can be used for other tasks as well.
Classification of transfer learning
One way of classifying transfer learning
• Transductive transfer.
• Target tasks are the same but use different data sets.
• Inductive transfer.
• Source and target tasks are different, regardless of the data set. Source and target data are typically
labeled.
• Unsupervised transfer.
• Source and target tasks are different, but the process uses unlabeled source and target data.
Unsupervised learning is useful in settings where manually labeling data is impractical.
Transfer learning can also be classified into near and far transfers.
• Near transfers are when the source and target tasks are closely related,
• while far transfers are when source and target tasks are vaguely related.
• If the tasks are closely related, this means they share similar data structures, features or
domains.
Another way to classify transfer learning is based on how well the knowledge
from a pretrained model facilitates performance on a new task. These are
classified as positive, negative and neutral transfers:
• Positive transfers occur when the knowledge gained from the source task
actively improves the performance on the target task.
• Negative transfers see a decrease in the performance of the new task.
• Neutral transfers occur when the knowledge gained from the source tasks has
little to no impact on the performance of the target task.
Key use cases for transfer learning
• Deep learning. Transfer learning is commonly used for deep learning neural
networks to help solve problems with limited data. Deep learning models
typically require large amounts of training data, which can be difficult and
expensive to acquire.
• NLP. Using transfer learning to train NLP models can improve performance by
transferring knowledge across tasks related to machine translation, sentiment
analysis and text classification.
• Computer vision. Pretrained models are useful for training computer vision tasks
like image segmentation, facial recognition and object detection, if the source
and target tasks are related.
• Image recognition. Transfer learning can improve the performance of models trained on limited
labeled data, which is useful in situations with limited data, such as medical imaging.
• Speech recognition. Models previously trained on large speech data sets are
useful for creating more versatile models. For example, a pretrained model could
be adapted to recognize specific languages, accents or dialects.
• Object detection. Pretrained models that were trained to identify specific objects
in images or videos could hasten the training of a new model. For example, a
pretrained model used to detect mammals could be added to a data set used to
identify different types of animals.
Future of transfer learning
• The future of transfer learning includes the following trends, which
might further shape ML and the development of ML models:
• The increased use of multimodal transfer learning. Models are
designed to learn from multiple types of data simultaneously. These can
include text, image and audio data sets, for example, which leads to
more versatile ML and artificial intelligence (AI) systems.
• Federated transfer learning. This combines transfer and federated
learning.
• Federated transfer learning enables models to transfer knowledge between
decentralized data sources but does so in a way that keeps local data private. This
enables multiple organizations to collaborate to improve their models across
decentralized data sources while also maintaining data privacy.
• Lifelong transfer learning. This creates a model that can continuously
learn and adapt to new tasks and data over time.
• Zero-shot and few-shot transfer learning. Both methods are designed
to enable ML models to perform well with minimal or no training data.
• Zero-shot revolves around the concept of predicting labels for unseen data
classes, and few-shot learning involves learning from only a small amount of data
per class. Using this practice, models can rapidly learn to make effective
generalizations with little data. This practice has the potential to reduce the
reliance organizations have on collecting large data sets for training.
Descriptive modeling using
Clustering algorithms
Clustering
• Clustering is a data mining (machine learning)
technique that finds similarities between data
according to the characteristics found in the data &
groups similar data objects into one cluster
• Given a set of data points, each
having a set of attributes, and a
similarity measure among them,
group the points into some number
of clusters, so that
• Data points in the same cluster are
similar to one another.
• Data points in separate clusters are
dissimilar to one another.
169
Clustering: Document Clustering
• Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
• Approach:
Identify content-bearing terms in each document.
Form a similarity measure based on the frequencies of different terms and use it to cluster
documents.
• Application:
Information Retrieval can utilize the clusters to relate a new document or search term to clustered
documents.
Hard vs. soft clustering
• Hard clustering: Each
document belongs to
exactly one cluster
• More common and easier
to do
n q
dis( X ,Y ) q (| x y |)
i 1 i i
where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-
dimensional data objects; n is size of vector attributes of the
data object; q= 1,2,3,…
182
Similarity & Dissimilarity Between Objects
n 2
dis( X ,Y ) (| x y |)
i 1 i i
183
The need for representative
• Key problem: as you build clusters, how do you represent the location
of each cluster, to tell which pair of clusters is closest?
• For each cluster assign a centroid (closest to all other points)= average
of its points.
187
The K-Means Clustering Algorithm
Given k (number of clusters), the k-means algorithm is
implemented as follows:
• Select K cluster points randomly as initial centroids
• Repeat until the centroid don’t change
• Compute similarity between each instance and
each cluster
• Assign each instance to the cluster with the
nearest seed point
• Recompute the centroids of each K clusters of the
current partition (the centroid is the center, i.e.,
mean point, of the cluster)
188
Example Problem
• Cluster the following eight points (with (x, y)
representing locations) into three clusters :
A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8)
A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9).
• Assume that initial cluster centers are:
A1(2, 10), A4(8,4) and A7(1, 2).
• The distance function between two points Aj=(x1, y1)
and Ci=(x2, y2) is defined as:
dis(Aj, Ci) = |x2 – x1| + |y2 – y1| .
• Use k-means algorithm to find optimal centroids to group
the given data into three clusters.
Iteration 1
First we list all points in the first column of the table below. The initial
cluster centers - centroids, are (2, 10), (8,4) and (1, 2) - chosen
randomly.
Data Points Cluster 1 with Cluster 2 with Cluster 3 with Cluster
centroid (2,10) centroid (8, 4) centroid (1, 2)
A1 (2, 10) 0 12 9 1
A2 (2, 5) 5 7 4 3
A3 (8, 4) 12 0 9 2
A4 (5, 8) 5 7 10 1
A5 (7, 5) 10 2 9 2
A6 (6, 4) 10 2 7 2
A7 (1, 2) 9 9 0 3
A8 (4, 9) 3 9 10 1
Next, we will calculate the distance from each points to each of the
three centroids, by using the distance function:
dis(point i,mean j)=|x2 – x1| + |y2 – y1|
Second epoch
• Using the new centroid compute cluster members again.
Data Points Cluster 1 Cluster 2 Cluster 3 Cluster
with centroid with centroid with centroid
(3.67, 9) (7, 4.33) (1.5, 3.5)
A1 (2, 10) 2.67 10.67 7 1
A2 (2, 5) 5.67 5.67 2 3
A3 (8, 4) 2
A4 (5, 8) 1
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
• After
A8 the(4,29)
nd
epoch the results would be: 1
cluster 1: {A1,A4,A8} with new centroid=(3.67,9);
cluster 2: {A3,A5,A6} with new centroid = (7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
Final results
• Finally in the 2th epoch there is no change of members of
clusters and centroids. So the algorithm stops.
• The result of clustering is shown in the following figure
Density based clustering
• Density based clustering attempts to detect areas
• where data points are concentrated and where they are separated
by areas that are sparse
• Known density based clustering is DBSCAN:
• distance between nearest points
• Two parameters required for DBSCAN algorithm
• eps: defines the neighborhood around a data point. i.e. two
data points are considered neighbors if
Dis(x, y) <= ‘eps’
• If the eps value is chosen too small then a large part of the data will be considered as an outlier.
• If it is chosen very large then the clusters will merge and the majority of the data points will be in the
same clusters.
• MinPts: Minimum number of neighbors (data points) within
eps radius. The larger the dataset, the larger value of MinPts
must be chosen.
• As a general rule, the minimum MinPts can be derived from the number of dimensions D in the
dataset as, MinPts >= D+1.
DBSCAN algorithm
• In this algorithm, there are
three types of data points to
be identified.
• Core Point: A point is a core point if it has more
than MinPts data points within eps.
• Border Point: A data point which has fewer than
MinPts within eps but it is in the neighborhood of
a core point.
• Noise or outlier: A point which is not a core point
or border point.
• Let’s repeat the above process for every point in the dataset and find
out the neighborhood of each.
DBSCAN Algorithm in action
DBSCAN Algorithm result
DBSCAN Algorithm final result
Project (Demo: June 2)
• Requirement:
–Select a problem that requires use of images or unstructured text.
–Prepare a dataset to conduct experiment and construct the
intended model.
–Use DL algorithms and pretrained models to construct or update a
model using Python
•Concept presentation
209
THANK YOU
(PHDS2023@[Link])
Python
Python is a high-level, general-purpose programming
language. Its design philosophy emphasizes code
readability with the use of significant indentation
What software we need to use
python for different tasks?
• Anaconda
• Jupyter Notebook
• Different packages (Libraries), such as
• scikit-learn (for ML algorithms),
• opencv (for DIP),
• numpy (high dimensional data manipulation),
• pandas (for data processing),
• HDF5 (store and manipulate data)
• matplotlib (data visualization), etc.
Installing Anaconda on
Windows
• Anaconda is a package manager, an
environment manager, and Python
distribution that contains a collection of
many open source packages.
This is advantageous, when you are working
on a project, you may need many different
packages (scikit-learn, numpy, scipy, pandas
to name a few), which an installation of
Anaconda comes preinstalled with.
Installing Anaconda on Windows
• If you need additional packages after installing
Anaconda, you can use Anaconda's package
manager,
• conda, or pip to install those packages (pip install
PACKAGE).
pip install scikit-learn
pip install pandas or pip3 install pandas
From Jupyter notebook: !pip install pandas
This is highly advantageous as you don't have to
manage dependencies between multiple packages
yourself. Conda even makes it easy to switch
between Python 2 and 3.
• In fact, an installation of Anaconda is also the
recommended way to install Jupyter Notebooks.
Python modules for machine
learning, data mining and data
analytics
Python modules for machine learning,
data mining and data analytics
• Scikit-learn is probably the most useful library for machine
learning in Python. The sklearn library contains a lot of
efficient tools for machine learning and statistical modeling
including classification, regression, clustering and
dimensionality reduction.
• scikit-learn is a Python module for machine learning built on
top of SciPy
• scikit-learn requires:
• Python (>= 3.6)
• NumPy (>= 1.13.3)
• SciPy (>= 0.19.1)
• joblib (>= 0.11)
• threadpoolctl (>= 2.0.0)
Use sklearn for Classification
#Identify duplicates
print([Link]().sum())
#Remove duplicates
df_no_duplicates = df.drop_duplicates()
Class label encoding
• Convert each unique value category into a numeric value
# Split the dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split( X, y,
test_size=0.3, random_state=42, stratify=y )
Undersampling
• To apply the undersampling technique, we will use the
RandomUnderSampler algorithm, which randomly removes instances.
• It is available in the imbalanced-learn library.
• To apply under sampling
• create a RandomUnderSampler object with the random_state set to 42.
• This ensures that the random selection is reproducible when running the code multiple
times. We can also set the sampling strategy as ‘majority’, where only the majority class
will have instances removed.
• Finally, we will apply the RandomUnderSampling technique to the input data. The
fit_resample function fits the RandomUnderSampler object to the data and returns the
balanced data
# Import the necessary libraries
from imblearn.under_sampling import RandomUnderSampler
# to test using two data set given in array form [age, gender], 1 represent
M & 0 for F
import tensorflow as tf
import numpy as np
import cv2
import [Link] as plt
from [Link] import image
#file path
imageName = 'C:/Users/Milli/Desktop/python_code//[Link]'
#load image
img = image.load_img(imageName,target_size = (224,224))
[Link](img)
#img = [Link](imageName)
#img = [Link](img,(224,224))
#img = [Link](img,cv2.COLOR_BGR2RGB)
#convert to array
resized_img = image.img_to_array(img)
predictions = [Link](final_img)
#print(predictions)
data_augmentation = [Link](
[ # [Link]("horizontal"),
#[Link](0.1),
[Link]("horizontal"),
[Link](0.1),
]
Plant disease detection
import numpy as np
import [Link] as plt
for images, labels in train_set.take(1):
[Link](figsize=(12, 12))
first_image = images[0]
for i in range(12):
# subplot(3,4,i+1) means divided into 3 row, 4 column & creates into position i+1
ax = [Link](3, 4, i + 1)
augmented_image = data_augmentation(
tf.expand_dims(first_image, 0)
)
[Link](augmented_image[0].numpy().astype("int32"))
[Link]("off")
Plant disease detection
base_model = [Link](
weights='imagenet',
input_shape=(150, 150, 3),
include_top=False) #remove fully connected layer of CNN
base_model.trainable = False
inputs = [Link](shape=(150, 150, 3))
x = data_augmentation(inputs)
x = [Link].preprocess_input(x)
x = base_model(x, training=False)
x = [Link].GlobalAveragePooling2D()(x)
x = [Link](0.2)(x)
outputs = [Link](1)(x)
model = [Link](inputs, outputs)
Plant disease detection
#[Link](optimizer='adam',loss=[Link](from_logits=True),met
rics=[Link]())
[Link](optimizer='adam', loss='binary_crossentropy',
metrics=[[Link]()]) #test sgd in place of adam
[Link](train_set, epochs=5, validation_data=val_dataset)
base_model.trainable = True
#[Link](optimizer=[Link](1e-5),
# loss=[Link](from_logits=True),
# metrics=[Link]())
[Link](optimizer='adam',
loss=[Link](from_logits=True),
metrics=['accuracy'])
[Link]()
epochs = 5
acc = [Link]['accuracy']
val_acc = [Link]['val_accuracy']
loss = [Link]['loss']
val_loss = [Link]['val_loss']
epochs_range = range(epochs)
[Link](figsize=(8, 8))
[Link](1, 2, 1)
[Link](epochs_range, acc, label='Training Accuracy')
[Link](epochs_range, val_acc, label='Validation Accuracy')
[Link](loc='lower right')
[Link]('Training and Validation Accuracy')
[Link](1, 2, 2)
[Link](epochs_range, loss, label='Training Loss')
[Link](epochs_range, val_loss, label='Validation Loss')
[Link](loc='upper right')
[Link]('Training and Validation Loss')
[Link]()
Project
• This is a project that helps you to exercise Python for data analysis.
- Present in class the result obtained in your data analysis
- prepare a report (DOC & PDF) and upload it along with PPT, python code, data set, and
reviewed articles
•Requirement:
–Choose text or image dataset for the experiment using Python
–Use (i) 2 ML and 1 DL algorithms, or (ii) DL algorithms, (iii) pretrained models for the
experiment using Python
–compare the performance of the selected algorithms
•Project Report
• Write a report with the following sections:
• Abstract -- ½ page
• Introduce problem and objective of the project -- 2 pages
• Description of algorithms used for the experiment -- 3 pages
• Discussion of experimental result --- 3 pages
• Concluding remarks, with major recommendation --- 1 page
• Reference (use IEEE referencing style)
Clustering
• Machine learning algorithms can be broadly classified into two categories:
supervised (classification) and unsupervised learning (clustering).
• The difference between them happens because of presence of target
variable. In clustering, there is no target variable, class. The dataset only has
input or independent variables which describe the data.
• K-Means clustering is the most popular clustering algorithm. It is used when
we have unlabelled data which is data without defined categories or groups.
• The algorithm follows an easy or simple way to classify a given data set through a certain
number of clusters. K-Means algorithm works iteratively to assign each data point to one
of K groups based on the features that are provided. Data points are clustered based on
feature similarity.
Clustering
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from [Link] import MinMaxScaler
from [Link] import KMeans
import numpy as np
from [Link] import silhouette_score
[Link]()
[Link]
X_norm = [Link](X)
X_norm
kmeans.cluster_centers_
#centers = [Link](p.cluster_centers_)
#print(centers)
K = range(2, 9)
#fits = []
score = []
for k in K:
# train the model for current value of k on training data
model = KMeans(n_clusters = k)
model.fit_predict(X_norm)
print ('running',k)