Understanding Bayesian Learning Methods
Understanding Bayesian Learning Methods
Bayesian learning methods are relevant to study of machine learning for two
different reasons.
• First, Bayesian learning algorithms that calculate explicit probabilities for
hypotheses, such as the naive Bayes classifier, are among the most practical
approaches to certain types of learning problems
• The second reason is that they provide a useful perspective for understanding
many learning algorithms that do not explicitly manipulate probabilities.
Features of Bayesian Learning Methods
• Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example
• Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis. In Bayesian learning, prior knowledge is provided by
asserting (1) a prior probability for each candidate hypothesis, and (2) a probability
distribution over observed data for each possible hypothesis.
• Bayesian methods can accommodate hypotheses that make probabilistic predictions
• New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
• Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods
can be measured.
Practical difficulty in applying Bayesian methods
• One practical difficulty in applying Bayesian methods is that they typically require
initial knowledge of many probabilities. When these probabilities are not known
in advance, they are often estimated based on background knowledge, previously
available data, and assumptions about the form of the underlying distributions.
P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
P(h|D) decreases as P(D) increases, because the more probable it is that D will be
observed independent of h, the less evidence D provides in support of h.
Maximum a Posteriori (MAP) Hypothesis
P(D|h) is often called the likelihood of the data D given h, and any hypothesis that
maximizes P(D|h) is called a maximum likelihood (ML) hypothesis
Example
Consider a medical diagnosis problem in which there are two alternative hypotheses
• The patient has a particular form of cancer (denoted by cancer)
• The patient does not (denoted by ¬ cancer)
The available data is from a particular laboratory with two possible outcomes: +
(positive) and - (negative)
• Suppose a new patient is observed for whom the lab test returns a positive
(+) result.
• Should we diagnose the patient as having cancer or not?
BAYES THEOREM AND CONCEPT LEARNING
What is the relationship between Bayes theorem and the problem of concept
learning?
Since Bayes theorem provides a principled way to calculate the posterior probability
of each hypothesis given the training data and can use it as the basis for a
straightforward learning algorithm that calculates the probability for each possible
hypothesis, then outputs the most probable.
Brute-Force Bayes Concept Learning
Lets choose P(h) and for P(D|h) to be consistent with the following assumptions:
• The training data D is noise free (i.e., di = c(xi))
• The target concept c is contained in the hypothesis space H
• We have no a priori reason to believe that any hypothesis is more probable than any other.
What values should we specify for P(h)?
• Given no prior knowledge that one hypothesis is more likely than another, it
is reasonable to assign the same prior probability to every hypothesis h in H.
• Assume the target concept is contained in H and require that these
prior probabilities sum to 1.
What choice shall we make for P(D|h)?
• P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the
fixed set of instances (x1 . . . xm), given a world in which hypothesis h holds
• Since we assume noise-free training data, the probability of observing
classification di given h is just 1 if di = h(xi) and 0 if di # h(xi). Therefore,
Given these choices for P(h) and for P(D|h) we now have a fully-defined problem
for the above BRUTE-FORCE MAP LEARNING algorithm.
Example:
• FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis under the probability
distributions P(h) and P(D|h) defined above.
• Are there other probability distributions for P(h) and P(D|h) under which FIND-S outputs MAP
hypotheses? Yes.
• Because FIND-S outputs a maximally specific hypothesis from the version space, its output
hypothesis will be a MAP hypothesis relative to any prior probability distribution that favours
more specific hypotheses.
• Bayesian framework is a way to characterize the behaviour of learning algorithms
• By identifying probability distributions P(h) and P(D|h) under which the output
is a optimal hypothesis, implicit assumptions of the algorithm can be
characterized (Inductive Bias)
• Inductive inference is modelled by an equivalent probabilistic reasoning
system based on Bayes theorem
MAXIMUM LIKELIHOOD AND LEAST-SQUARED
ERROR HYPOTHESES
Consider the problem of learning a continuous-valued target function such as neural
network learning, linear regression, and polynomial curve fitting
A straightforward Bayesian analysis will show that under certain assumptions any
learning algorithm that minimizes the squared error between the output hypothesis
predictions and the training data will output a maximum likelihood (ML) hypothesis
Learning A Continuous-Valued Target Function
real- valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training examples of the
• Learner L considers an instance space X and a hypothesis space H consisting of some class of
form
<xi,di>
• The problem faced by L is to learn an unknown target function f : X → R
• A set of m training examples is provided, where the target value of each example is corrupted by
random noise drawn according to a Normal probability distribution with zero mean (di = f(xi) + ei)
• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable representing
the noise.
– It is assumed that the values of the ei are drawn independently and that they are
distributed according to a Normal distribution with zero mean.
• The task of the learner is to output a maximum likelihood hypothesis, or, equivalently, a MAP
hypothesis assuming all hypotheses are equally probable a priori.
Learning A Linear Function
Assuming training examples are mutually independent given h, we can write P(D|h) as the product of
the various (di|h)
Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each di
must also obey a Normal distribution around the true targetvalue f(xi). Because we are writing
the expression for P(D|h), we assume h is the correct description of f. Hence, µ = f(xi) = h(xi)
It is common to maximize the less complicated logarithm, which is justified because of the
monotonicity of function p.
The first term in this expression is a constant independent of h and can therefore be discarded
Maximizing this negative term is equivalent to minimizing the corresponding positive term.
Finally Discard constants that are independent of h
• the hML is one that minimizes the sum of the squared errors
Equation (4) to substitute for P(di |h, xi) in Equation (5) to obtain
We write an expression for the maximum likelihood hypothesis
Equation (7) describes the quantity that must be maximized in order to obtain the maximum
likelihood hypothesis in our current problem setting
Gradient Search to Maximize Likelihood in a Neural Net
Derive a weight-training rule for neural network learning that seeks to maximize G(h, D) using
gradient ascent
• The gradient of G(h, D) is given by the vector of partial derivatives of G(h, D) with respect to
the various network weights that define the hypothesis h represented by the learned network
• In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to unit j is
Suppose our neural network is constructed from a single layer of sigmoid units. Then,
where xijk is the kth input to unit j for the ith training example, and d(x) is the derivative of the sigmoid
squashing function.
Finally, substituting this expression into Equation (1), we obtain a simple expression for the
derivatives that constitute the gradient
Because we seek to maximize rather than minimize P(D|h), we perform gradient ascent rather than
gradient descent search. On each iteration of the search the weight vector is adjusted in the direction
of the gradient, using the weight update rule
where η is a small positive constant that determines the step size of the i gradient ascent search
It is interesting to compare this weight-update rule to the weight-update rule used by the
BACKPROPAGATION algorithm to minimize the sum of squared errors between predicted and
observed network outputs.
The BACKPROPAGATION update rule for output unit weights, re-expressed using our current
notation, is
MINIMUM DESCRIPTION LENGTH PRINCIPLE
• A Bayesian perspective on Occam’s razor
• Motivated by interpreting the definition of hMAP in the light of basic concepts from
information theory.
• This equation can be interpreted as a statement that short hypotheses are preferred, assuming
a particular representation scheme for encoding hypotheses and data
Introduction to a basic result of information theory
• -log2P(h): the description length of h under the optimal encoding for the hypothesis space H
LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H.
• -log2P(D | h): the description length of the training data D given hypothesis h, under the
optimal encoding fro the hypothesis space H: L CH (D|h) = −log2P(D| h) , where C D|h is the
optimal code for describing data D assuming that both the sender and receiver know the
hypothesis h.
Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given by the
description length of the hypothesis plus the description length of the data given the hypothesis.
where CH and CD|h are the optimal encodings for H and for D given h
The Minimum Description Length (MDL) principle recommends choosing the hypothesis that
minimizes the sum of these two description lengths of equ.
Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis
The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH, and if
we choose C2 to be the optimal encoding CD|h, then hMDL = hMAP
Bayes Optimal Classifier
Example
Gibbs Algorithm
The Gibbs algorithm, often referred to as Gibbs Sampling, is a Markov Chain Monte Carlo (MCMC) technique used to sample from
complex, high-dimensional probability distributions. It is particularly useful in machine learning and statistics for problems where direct
sampling is challenging.
How it works
1. Initialization: Start with an initial guess for all variables in the distribution.
2. Iterative Sampling:
o At each step, one variable is sampled while keeping all other variables fixed.
o This is done using the conditional probability distribution of the selected variable, given
Expectation-Maximization Algorithm
The Expectation-Maximization (EM) algorithm is a powerful iterative optimization technique used to estimate
unknown parameters in probabilistic models, particularly when the data is incomplete, noisy or contains hidden
(latent) variables. It works in two steps:
E-step (Expectation Step): Using the current parameter estimates, the algorithm calculates the expected
values of the missing or hidden variables. Essentially, it assigns probabilities or "responsibilities" to different
hidden outcomes given the observed data.
M-step (Maximization Step): With these updated expectations from the E-step, the algorithm then re-
estimates the model parameters by maximizing the expected log-likelihood. This improves how well the
model explains the observed data.
These two steps are repeated until convergence, which typically means that:
The parameter values stop changing significantly, or
Calculate the posterior probability of each latent variable based on the observed data.
Compute the log-likelihood of the observed data using the current parameter estimates.
If the changes in log-likelihood or parameters are below a set threshold, stop. If not repeat the E-step and M-
Applications
Clustering: Used in Gaussian Mixture Models (GMMs) to assign data points to clusters probabilistically.
Missing Data Imputation: Helps fill in missing values in datasets by estimating them iteratively.
Image Processing: Applied in image segmentation, denoising and restoration tasks where pixel classes are hidden.
Natural Language Processing (NLP): Used in tasks like word alignment in machine translation and topic modeling (LDA).
Hidden Markov Models (HMMs): EM’s variant, the Baum-Welch algorithm, estimates transition/emission probabilities for sequence
data.
Advantages
Monotonic improvement: Each iteration increases (or at least never decreases) the log-likelihood.
Handles incomplete data well: Works effectively even with missing or hidden variables.
Flexibility: Can be applied to many probabilistic models, not just mixtures of Gaussians.
Easy to implement: The E-step and M-step are conceptually simple and often have closed-form updates.
Disadvantages
Slow convergence: Convergence can be very gradual, especially near the optimum.
Initialization sensitive: Requires good initial parameter guesses; poor choices may yield bad solutions.
No guarantee of global best solution: Unlike some optimization methods, EM doesn’t guarantee reaching the absolute best
parameters.
Computationally intensive: For large datasets or complex models, repeated iterations can be costly.
UNIT V
Classification Models
Classification teaches a machine to sort things into categories. It learns by looking at examples with
labels (like emails marked "spam" or "not spam"). After learning, it can decide which category new
items belong to, like identifying if a new email is spam or not.
Types of Classification
[Link] Classification
This is the simplest kind of classification. In binary classification, the goal is to sort the data into two distinct
categories. Think of it like a simple choice between two options. Imagine a system that sorts emails into
either spam or not spam. It works by looking at different features of the email like certain keywords or sender
details, and decides whether it’s spam or not. It only chooses between these two options.
2. Multiclass Classification
Here, instead of just two categories, the data needs to be sorted into more than two categories. The model picks
the one that best matches the input. Think of an image recognition system that sorts pictures of animals into
categories like cat, dog, and bird.
Basically, machine looks at the features in the image (like shape, color, or texture) and chooses which
animal the picture is most likely to be based on the training it received.
3. Multi-Label Classification
In multi-label classification single piece of data can belong to multiple categories at once. Unlike
multiclass classification where each data point belongs to only one class, multi-label classification
allows datapoints to belong to multiple classes. A movie recommendation system could tag a
movie as both action and comedy. The system checks various features (like movie plot, actors, or
genre tags) and assigns multiple labels to a single piece of data, rather than just one.
classification works by training a model to learn patterns from labeled data, so it can
predict the category or class of new, unseen data. Here's how it works:
[Link] Collection: You start with a dataset where each item is labeled with the correct class
(for example, "cat" or "dog").
[Link] Extraction: The system identifies features (like color, shape, or texture) that help
distinguish one class from another. These features are what the model uses to make
predictions.
[Link] Training: Classification - machine learning algorithm uses the labeled data to
learn how to map the features to the correct class. It looks for patterns and relationships in
the data.
[Link] Evaluation: Once the model is trained, it's tested on new, unseen data to check
how accurately it can classify the items.
[Link]: After being trained and evaluated, the model can be used to predict the class
of new data based on the features it has learned.
[Link] Evaluation: Evaluating a classification model is a key step in machine learning. It
helps us check how well the model performs and how good it is at handling new, unseen
data. Depending on the problem and needs we can use different metrics to measure its
performance.
Classification Algorithms
Classification Algorithms
Linear Classifiers: Linear classifier models create a linear decision boundary between classes. They are simple
and computationally efficient. Some of the linear classification models are as follows:
Logistic Regression
Single-layer Perceptron
Non-linear Classifiers: Non-linear models create a non-linear decision boundary between classes. They can
capture more complex relationships between input features and target variable. Some of the non-
linear classification models are as follows:
K-Nearest Neighbours
Kernel SVM
Naive Bayes
Random Forests,
AdaBoost,
Bagging Classifier,
Voting Classifier,
1. Data Collection
The first step is to gather a dataset relevant to the classification problem. The data can come from struc
tured sources like databases or unstructured sources like text, images, or sensor readings. Ensuring that
the data is representative of the problem domain is essential for building a reliable classifier.
2. Data Preprocessing
Before training a model, data must be cleaned and prepared. Common preprocessing steps include:
Handling missing values (e.g., filling or removing missing entries)
Encoding categorical variables into numerical form (e.g., label encoding or one-hot encoding)
Normalizing or scaling features to ensure consistent ranges
Removing outliers or irrelevant data points
Preprocessing improves model accuracy and training efficiency.
6. Model Evaluation
After training, the model’s performance is evaluated using metrics such as:
Accuracy
Precision, Recall, and F1-score
ROC-AUC for binary classification
Confusion matrix analysis
Evaluation ensures the model generalizes well and helps in identifying if further tuning is required.
7. Hyperparameter Tuning
Adjust model hyperparameters using validation data to optimize performance. Techniques like grid sea
rch, random search, or Bayesian optimization can help find the best settings.
8. Model Deployment
Once the model achieves satisfactory performance on test data, it can be deployed to make predictions
on new, unseen instances. Continuous monitoring is important to ensure it remains accurate as data dis
tributions change over time.
9. Model Maintenance
Over time, retraining the model with updated data, handling drift, and monitoring metrics are necessar
y to maintain the classifier’s effectiveness in real-world scenarios.
In summary, classification learning involves a systematic process from collecting and preparing data
to selecting a model, training, evaluating, and deploying it, ensuring the classifier performs reliably on
unseen data.
Classification algorithms organize and understand complex datasets in machine learning. These algorithms are
essential for categorizing data into classes or labels, automating decision-making and pattern identification.
Classification algorithms are often used to detect email spam by analyzing email content. These algorithms
enable machines to quickly recognize spam trends and make real-time judgments, improving email security.
Some of the top-ranked machine learning algorithms for Classification are:
1. Logistic Regression
2. Decision Tree
3. Random Forest
4. Support Vector Machine (SVM)
5. Naive Bayes
6. K-Nearest Neighbors (KNN)
2. Decision Tree
Decision Trees are versatile and simple classification and regression techniques. Recursively splitting the dataset into key-criteria subgroups
provides a tree-like structure. Judgments at each node produce leaf nodes. Decision trees are easy to understand and depict, making them useful
for decision-making. Overfitting may occur, therefore trimming improves generality. A tree-like model of decisions and their consequences,
including chance event outcomes, resource costs and utility.
The algorithm used for both classification and regression tasks. They model decisions and their possible results as tree, with branches
representing choices and leaves representing outcomes.
Features of Decision Tree
1. Tree-Like Structure: Decision Trees have a flowchart-like structure, where each internal node represents a
"test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class
label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.
2. Simple to Understand and Interpret: One of the main advantages of Decision Trees is their simplicity and
ease of interpretation. They can be visualized, which makes it easy to understand how decisions are made and
explain the reasoning behind predictions.
3. Versatility: Decision Trees can handle both numerical and categorical data and can be used for both
regression and classification tasks, making them versatile across different types of data and problems.
4. Feature Importance: Decision Trees inherently perform feature selection, giving insights into the most
significant variables for making the predictions. The top nodes in a tree are the most important features,
providing a straightforward way to identify critical variables.
3. Random Forest
Random forest are an ensemble learning techniques that combines multiple decision trees
to improve predictive accuracy and control over-fitting. By aggregating the predictions of
numerous trees, Random Forests enhance the decision-making process, making them
robust against noise and bias.
Random Forest uses numerous decision trees to increase prediction accuracy and reduce
overfitting. It constructs many trees and integrates their predictions to create a reliable
model. Diversity is added by using a random dataset and characteristics in each tree.
Random Forests excel at high-dimensional data, feature importance metrics, and overfitting
resistance. Many fields use them for classification and regression .
Features of Random Forest
1. Ensemble Method: Random Forest uses the ensemble learning technique, where multiple learners
(decision trees, in this case) are trained to solve the same problem and combined to get better
results. The ensemble approach improves the model's accuracy and robustness.
2. Handling Both Types of Data: It can handle both categorical and continuous input and output
variables, making it versatile for different types of data.
3. Reduction in Overfitting: By averaging multiple trees, Random Forest reduces the risk of
overfitting, making the model more generalizable than a single decision tree.
4. Handling Missing Values: Random Forest can handle missing values. When it encounters a
missing value in a variable, it can use the median for numerical variables or the mode for categorical
variables of all samples reaching the node where the missing value is encountered.
glands. Many axons are insulated with myelin sheath, which speeds up signal transmission.
At the end of the axon, axon terminals or synaptic boutons form junctions with target cells to communicate sign
als via neurotransmitters.
Function of a Neuron
Neurons communicate through electrical and chemical signaling:
Resting Potential: Neurons maintain a negative membrane potential (around -70 mV) due to ion gradients across
ry the signal to the next neuron or effector cell. The neurotransmitters bind to receptors on the postsynaptic cell, g
enerating a new response.
Types of Neurons
Neurons are classified based on structure and function:
Sensory Neurons (Afferent): Carry information from sensory receptors to the central nervous system (CNS).
Motor Neurons (Efferent): Transmit commands from the CNS to muscles or glands.
Interneurons: Connect neurons within the CNS to integrate and process information.
Bipolar: One dendrite and one axon (found in sensory organs like the retina).
Unipolar: Single process that branches into axon and dendrite (found in sensory ganglia).
Hidden Layers: Each neuron in the hidden layers performs some calculation on the input, passing
the result to the next layer. The data is transformed and abstracted at each layer.
Output Layer: After passing through all the layers, the network gives its final prediction like
classifying an image as a cat or a dog.
Training and Testing:
During training, the network is shown examples like images of cats and learns to recognize patterns
in them.
After training, the network is tested on new data to check its performance. The better the network is
to recognize a cat, we show it thousands of images of cats. The network processes these images
and learns to identify the features that define a cat.
Once the network has been trained, we test it by providing new images to see if it can correctly
identify cats. The network’s prediction is then compared to the actual label (whether it's a cat or not).
If it makes an incorrect prediction, the network adjusts by fine-tuning the weights of the connections
between neurons using a process called backpropagation. This involves correcting the weights
based on the difference between the predicted and actual result.
This process repeats until the network can accurately recognize a cat in an image with minimal
error. Essentially, through constant training and feedback, the network becomes better at identifying
patterns and making predictions.
Co
mmon Activation Functions in ANNs
Activation functions are important in neural networks because they introduce non-linearity and helps
the network to learn complex patterns. Lets see some common activation functions used in ANNs:
1. Sigmoid Function: Outputs values between 0 and 1. It is used in binary classification tasks like
deciding if an image is a cat or not.
2. ReLU (Rectified Linear Unit) : A popular choice for hidden layers, it returns the input if positive and
zero otherwise. It helps to solve the vanishing gradient problem.
3. Tanh (Hyperbolic Tangent) : Similar to sigmoid but outputs values between -1 and 1. It is used in
hidden layers when a broader range of outputs is needed.
4. Softmax: Converts raw outputs into probabilities used in the final layer of a network for multi-class
classification tasks.
5. Leaky ReLU: A variant of ReLU that allows small negative values for inputs helps in preventing
“dead neurons” during training.
1. A set of 'i' synapses having weight w i. A signal xi forms the input to the i-th synapse having weight w i. The value of any weight may
be positive or negative. A positive weight has an extraordinary effect, while a negative weight has an inhibitory effect on the output of
the summation junction.
2. A summation junction for the input signals is weighted by the respective synaptic weight. Because it is a linear combiner or adder
of the weighted input signals, the output of the summation junction can be expressed as follows: ysum=∑i=1nwixiysum=∑i=1nwi
xi
3. A threshold activation function (or simply the activation function, also known as squashing function) results in an output signal
only when an input signal exceeding a specific threshold value comes as an input. It is similar in behaviour to the biological neuron
which transmits the signal only when the total input signal meets the firing threshold.
Simple McCulloch-Pitts neurons can be used to design logical operations. For that purpose, the
connection weights need to be correctly decided along with the threshold function (rather than the
threshold value of the activation function). For better understanding purpose, let me consider an
example:
John carries an umbrella if it is sunny or if it is raining. There are four given situations. I need to decide
when John will carry the umbrella. The situations are as follows:
First scenario: It is not raining, nor it is sunny
Second scenario: It is not raining, but it is sunny
To analyse the situations using the McCulloch-Pitts neural model, I can consider the input signals as
follows:
X1: Is it raining?
X2 : Is it sunny?
So, the value of both scenarios can be either 0 or 1. We can use the value of both weights X1 and
X2 as 1 and a threshold function as 1. So, the neural network model will look like:
2. Rosenblatt's Perceptron
Rosenblatt's perceptron is built around the McCulloch-Pitts neural model. The diagrammatic representation is as
follows:
The perceptron receives a set of input x1, x2,....., xn. The linear combiner or the adder mode computes the linear
combination of the inputs applied to the synapses with synaptic weights being w1, w2,......,wn. Then, the hard
limiter checks whether the resulting sum is positive or negative If the input of the hard limiter node is positive,
the output is +1, and if the input is negative, the output is -1. Mathematically the hard limiter input is:
Thus, we see that for a data set with linearly separable classes, perceptrons can always be
employed to solve classification problems using decision lines (for 2-dimensional space),
decision planes (for 3-dimensional space) or decision hyperplanes (for n-dimensional space).
Appropriate values of the synaptic weights can be obtained by training a perceptron.
Multi-layer perceptron: A basic perceptron works very successfully for data sets which
possess linearly separable patterns. However, in practical situations, that is an ideal situation
to have. This was exactly the point driven by Minsky and Papert in their work in 1969. They
showed that a basic perceptron is not able to learn to compute even a simple 2 bit XOR. So,
When outputs can be directed back as inputs to the same layer or preceding layer nodes, then it results in feedback networks. Recurrent networks are feedback networks with closed loops.
The above figure shows a single recurrent network having a single neuron with feedback to itself.
4. Single-layer recurrent network
The above network is a single-layer network with a feedback connection in which the processing element's output
can be directed back to itself or to another processing element or both. A recurrent neural network is a class of
artificial neural networks where connections between nodes form a directed graph along a sequence. This allows
it to exhibit dynamic temporal behavior for a time sequence. Unlike feedforward neural networks, RNNs can use
their internal state (memory) to process sequences of inputs.
5. Multilayer recurrent network
In this type of network, processing element output can be directed to the processing element in the same layer and
in the preceding layer forming a multilayer recurrent network. They perform the same task for every element of a
sequence, with the output being dependent on the previous computations. Inputs are not needed at each time step.
The main feature of a Recurrent Neural Network is its hidden state, which captures some information about a
sequence.
Learning Process in Artificial Neural Networks
By changing the weights of the links between neurons, neural networks may acquire knowledge. The network gets provided with a labeled
dataset throughout this procedure, referred to as training, and the weights are repeatedly updated depending on any mistakes or differences
between the network's assumptions and the true labels.
Forward Propagation ? The weighted total of the inputs is calculated at every neuron as the input information moves through the
network during its propagation forward. Following that, an activation function that induces nonlinearity in the network is applied to
these values. The introduction of non-linearities in various layers frequently uses activation algorithms like sigmoid, ReLU, and tanh.
Loss Function ? A loss function is used to calculate the difference between the outcome of the network and the real labels. The kind of
problem being addressed determines the loss function to be used. For instance, whereas categorical cross-entropy is appropriate for
multi-class classification, mean squared error (MSE) is frequently employed for regression assignments.
Backpropagation ? In neural networks, backpropagation is the key to acquiring knowledge. It includes applying the principle of chains
of mathematics to determine the gradients of the loss function about the weights of the network. The gradients show the scale and
trajectory of weight modifications needed to reduce loss.
Gradient Descent ? An optimisation procedure, such as gradient descent, is utilized for modifying the weights once the gradients are
known. The goal of gradient descent is to get to a minimal point on the loss curve by iteratively adjusting the weights in the contrary
direction of the gradients. The reliability of training is frequently increased by using gradient descent variants like stochastic gradient
descent (SGD) and Adam optimizer.
Iterative Training ? A given number of epochs or until completion is reached, the forward propagation, loss computation,
backpropagation, and weight alters processes are performed again repeatedly. The network can increase its ability to perform by
lowering the loss with every loop by improving its forecasting abilities.
It finds similarities in user behavior and preferences to recommend products, movies or music that align with their interests.
It can identify groups of customers with similar characteristics which allows businesses to target marketing campaigns and improve
customer service more effectively.
Advantages of Unsupervised learning
It doesn’t need labeled data so we can start working with large datasets more easily and quickly.
This handles large amounts of data and reduce it into simpler forms without losing important patterns which makes it manageable and
efficient.
It discovers patterns and relationships in the data that were previously unknown which offers valuable insights.
By analyzing unlabeled data, it shows meaningful trends and groups that help us to understand our data deeply.
Lack of clear guidance can lead to less precise results for complex problems.
After grouping the data, we may needs to check and label these groupings which can be time-consuming.
Missing data, outliers or noise in the data can easily affect the quality of the results.
Clustering
Clustering is an unsupervised machine learning technique that groups similar data points together into clusters
based on their characteristics, without using any labeled data. The objective is to ensure that data points within the
same cluster are more similar to each other than to those in different clusters, enabling the discovery of natural
groupings and hidden patterns in complex datasets.
Goal: Discover the natural grouping or structure in unlabeled data without predefined categories.
How: Data points are assigned to clusters based on similarity or distance measures.
Similarity Measures: Can include Euclidean distance, cosine similarity or other metrics depending on data type
and clustering method.
Output: Each group is assigned a cluster ID, representing shared characteristics within the cluster.
Types of Clustering
Let's see the types of clustering, in hard clustering, each data point strictly belongs to exactly one cluster, no
overlap is allowed. This approach assigns a clear membership, making it easier to interpret and use for definitive
segmentation tasks. If clustering customer data into 2 segments, each customer belongs fully to either Cluster 1 or
Cluster 2 without partial memberships.
Use cases: Market segmentation, customer grouping, document clustering.
Limitations: Cannot represent ambiguity or overlap between groups; boundaries are crisp.
2. Soft Clustering: Soft clustering assigns each data point a probability or degree of membership to multiple
clusters simultaneously, allowing data points to partially belong to several groups.
Example: A data point may have a 70% membership in Cluster 1 and 30% in Cluster 2, reflecting uncertainty or
overlap in group characteristics.
Use cases: Situations with overlapping class boundaries, fuzzy categories like customer personas or medical
diagnosis.
Benefits: Captures ambiguity in data, models gradual transitions between clusters.
Topic modeling
Topic modeling is a technique in natural language processing (NLP) and machine learning that aims to uncover
latent thematic structures within a collection of texts. Topic modelling is a system learning technique that
robotically discovers the principle themes or "topics" that represents a huge collection of documents. The
intention of topic modelling is to discover the hidden semantic systems within textual content facts, permitting
customers to arrange, apprehend, and summarize the data in a manner that is each green and insightful.
A 'topic' is defined as a recurring pattern of words that best represents a theme within the documents.
Topic models are algorithms that scan the document collection to discover these topics. They provide a
way to quantify the structure of topics within the text and how these topics are related to each other.
records, inclusive of files, articles, and social media posts, which make up 80-90% of all new company facts.
It lets in companies to derive precious insights from this enormous trove of unstructured statistics that would
in any other case be tough to procedure manually.
Improving Content Organization and Retrieval: By robotically figuring out the primary subjects within a
corpus of text, subject matter modelling may be used to cluster and prepare big report collections, making it
simpler to look, navigate, and retrieve applicable statistics.
Enhancing Customer Experience and Personalization: Topic modelling can be carried out to patron
feedback, evaluations, and social media information to uncover the important thing topics and sentiments
which might be essential to clients. This data can then be used to improve merchandise, offerings, and
personalised suggestions.
Accelerating Research and Discovery : In educational and scientific domains, subject matter modelling has
been used to research massive bodies of literature, discover rising research trends, and discover connections
between disparate fields, accelerating the pace of studies and innovation.
Automating Repetitive Tasks : By mechanically categorizing and organizing text information based on
subjects, topic modelling can help automate many time-eating and repetitive duties, inclusive of customer
service ticket tagging, file class, and content material summarization.
Enabling Trend Analysis and Monitoring : Topic modelling may be used to music modifications in subject
matter distributions over the years, allowing groups to locate rising developments, shifts in public opinion,
and other patterns that can be applicable for strategic selection-making.
files, while LDA is a probabilistic model that assumes a generative process for the text statistics.
In general, LDA is considered greater bendy and sturdy, as it could handle a much wider variety of textual
and "is."
Punctuation Removal : Removing punctuation marks and special characters from the text.
Lemmatization: Reducing phrases to their base or dictionary form, to improve the consistency of the
vocabulary.
Step 3. Creating Document-Term Matrix: After preprocessing the textual content, the following step is to
create a document-time matrix, which represents the frequency of every phrase in every report. This matrix
serves because the input to the topic modelling algorithms.
Step 4: Model Fitting: Once the data is prepared, the next step is to match the topic modelling algorithm to the
facts. This includes specifying the number of subjects to be observed and going for walks the algorithm to reap
the topic representations.
For LSA, this entails applying Singular Value Decomposition (SVD) to the document-term matrix to extract
systems can suggest articles, books, or media that match a user's interests.
Document Classification: It helps in automatically classifying documents into predefined categories based on
their content.
Summarization: Topic modeling can assist in summarizing large collections of documents by highlighting