Understanding Intelligent Agents in AI
Understanding Intelligent Agents in AI
PART-A (2 Marks)
1. Define Artificial intelligence.
"It is a branch of computer science by which we can create intelligent
machines whichcan behave like a human, think like humans, and able to
make decisions."
2. Define Rationality.
4. Define Agent.
In artificial intelligence, an agent is a computer program or system that is
designed to perceive its environment, make decisions and take actions to
achieve a specific goal or set of goals. The agent operates autonomously,
meaning it is not directly controlled by a human operator.
6. Define payoff?
Payoff function is used for modeling human behavior. Payoff function for a
player is a mapping from the cross-product of players strategy spaces to the
players set of payoffs.
The term “rational agent,” however, is not only applied to a system. It can also
refer to a person, a company, or an application, practically anything or anyone
that makes rational decisions.
10. What is meant by informed search algorithm.
The algorithms of an informed search contain information regarding the
goal state. It helps an AI make more efficient and accurate searches. A
function obtains this data/info to estimate the closeness of a state to its goal in
the system. For example, Graph Search and Greedy Search.
Cognitive Aspects: Involves consciousness, emotions, intuition, empathy, creativity, and complex
judgment.
Learning: Humans learn through a combination of education, experience, and exposure, with the
ability to generalize knowledge and learn from fewer examples.
Adaptability: Can adapt to new and unexpected situations by generalizing knowledge and using
common sense.
Source: Machine-based, created by humans and driven by data, algorithms, and programming.
Learning: Learns and improves performance through data and algorithms, requiring large amounts of
training.
Adaptability: Limited to the data it's trained on; while it can adapt to new data, it can't truly generalize
across vastly different domains without specific programming.
Strengths: Excels at high-speed data processing, performing repetitive tasks, and recognizing
patterns in vast datasets.
Bidirectional search algorithm runs two simultaneous searches, one form initial state
called as forward-search and other from goal node called as backward-search, to find
the goal node. Bidirectional search replaces one single search graph with two small
subgraphs in which one starts the search from an initial vertex and other starts from
goal vertex. The search stops when these two graphs intersect each other.
Bidirectional search can use search techniques such as BFS, DFS, DLS, etc.
Advantages:
o Bidirectional search is fast.
o Bidirectional search requires less memory
Disadvantages:
o Implementation of the bidirectional search tree is difficult.
o In bidirectional search, one should know the goal state in advance .
Example
In the below search tree, bidirectional search algorithm is applied. This algorithm
divides one graph/tree into two sub-graphs. It starts traversing from node 1 in the
forward direction and starts from goal node 16 in the backward direction.
Completeness:
Bidirectional Search is complete if we use BFS in both searches.
Time Complexity: Time complexity of bidirectional search using BFS is O(bd ).
Space Complexity: Space complexity of bidirectional search is O(bd ).
Optimal: Bidirectional search is Optimal.
Rational Agents could be physical agents like the one described above or it could also be a
program that operates in a non-physical environment like an operating system. Imagine a bot
web site operator designed to scan Internet news sources and show the interesting items to its
users, while selling advertising space to generate revenue.
As another example, consider an online tutoring system
Observable – Full or Partial? If the agents sensors get full access then they do not need to
pre- store any information. Partial may be due to inaccuracy of sensors or incomplete
information about an environment, like limited access to enemy territory
Number of Agents – For the vacuum cleaner, it works in a single agent environment but for
driver-less taxis, every driver-less taxi is a separate agent and hence multi agent
environment
Deterministic – The number of unknowns in the environment which affect the predictability
of the environment. For example, floor
space for cleaning is mostly
deterministic, the furniture is where it is
most of the time but taxi driving on a
road is non-deterministic.
Discrete – Does the agent respond
when needed or does it have to
continuously scan the environment.
Driver-less is continuous, online tutor is
discrete
Static – How often does the
environment change. Can the agent
learn about the environment and
always do the same thing?
Episodic – If the response to a certain
precept is not dependent on the previous one i.e. it is stateless (static methods in Java)
then it is discrete. If the decision taken now influences the future decisions then it is a
sequential environment.
An intelligent agent is an autonomous entity which act upon an environment using sensors and
actuators for achieving goals. An intelligent agent may learn from the environment to achieve
their goals. A thermostat is an example of an intelligent agent.
Rational Agent:
A rational agent is an agent which has clear preference, models uncertainty, and acts in a way to
maximize its performance measure with all possible actions.
A rational agent is said to perform the right things. AI is about creating rational agents to use for
game theory and decision theory for various real-world scenarios.
For an AI agent, the rational action is most important because in AI reinforcement learning
algorithm, for each best possible action, agent gets the positive reward and for each wrong
action, an agent gets a negative reward.
Rationality:
The rationality of an agent is measured by its performance measure. Rationality can be judged on
the basis of following points:
Note: Rationality differs from Omniscience because an Omniscient agent knows the actual
outcome of its action and act accordingly, which is not possible in reality.
Structure of an AI Agent
The task of AI is to design an agent program which implements the agent function. The structure
of an intelligent agent is a combination of architecture and agent program. It can be viewed as:
Following are the main three terms involved in the structure of an AI agent:
f:P* → A
PEAS Representation
PEAS is a type of model on which an AI agent works upon. When we define an AI agent or
rational agent, then we can group its properties under PEAS representation model. It is made up
of four words:
o P: Performance measure
o E: Environment
o A: Actuators
o S: Sensors
Here performance measure is the objective for the success of an agent's behavior.
2.
o Cleanness o Room o Wheels o Camera
Vacuum
Cleaner
o Efficiency o Table o Brushes o Dirt detection
o Battery life o Wood floor o Vacuum sensor
o Security o Carpet Extractor o Cliff sensor
o Various o Bump Sensor
obstacles o Infrared Wall
Sensor
Example:
In the below tree structure, we have shown the traversing of the tree using BFS algorithm from
the root node S to goal node K. BFS search algorithm traverse in layers, so it will follow the path
which is shown by the dotted arrow, and the traversed path will be:
S---> A--->B---->C--->D---->G--->H--->E---->F---->I ---- >K
Advantages:
Space Complexity: Space complexity of BFS algorithm is given by the Memory size of frontier
which is O(bd).
Completeness: BFS is complete, which means if the shallowest goal node is at some finite
depth, then BFS will find a solution.
Optimality: BFS is optimal if path cost is a non-decreasing function of the depth of the node.
o Depth-first search isa recursive algorithm for traversing a tree or graph data structure.
o It is called the depth-first search because it starts from the root node and follows
each path to its greatest depth node before moving to the next path.
o DFS uses a stack data structure for its implementation.
o The process of the DFS algorithm is similar to the BFS algorithm.
Note: Backtracking is an algorithm technique for finding all possible solutions using recursion.
Advantage:
o DFS requires very less memory as it only needs to store a stack of the nodes on the path
from root node to the current node.
o It takes less time to reach to the goal node than BFS algorithm (if it traverses in the right
path).
Disadvantage:
o There is the possibility that many states keep re-occurring, and there is no guarantee of
finding the solution.
o DFS algorithm goes for deep down searching and sometime it may go to the infinite loop.
Completeness: DFS search algorithm is complete within finite state space as it will expand
every node within a limited search tree.
Time Complexity: Time complexity of DFS will be equivalent to the node traversed by the
algorithm. It is given by:
Where, m= maximum depth of any node and this can be much larger than d (Shallowest
solution depth)
Space Complexity: DFS algorithm needs to store only single path from the root node, hence
space complexity of DFS is equivalent to the size of the fringe set, which is O(bm).
Optimal: DFS search algorithm is non-optimal, as it may generate a large number of steps or
high cost to reach to the goal node.
Example:
In the below search tree, we have shown the flow of depth-first search, and it will follow the
order as:
It will start searching from root node S, and traverse A, then B, then D and E, after traversing E,
it will backtrack the tree as E has no other successor and still goal node is not found. After
backtracking it will traverse node C and then G, and here it will terminate as it found goal node.
5. Explain detail about properties of agent Environments.
An environment is everything in the world which surrounds the agent, but it is not a part of an
agent itself. An environment can be described as a situation in which an agent is present.
The environment is where agent lives, operate and provide the agent with something to sense and
act upon it. An environment is mostly said to be non-feministic.
Features of Environment
Environment can have various features from the point of view of an agent:
o If an agent sensor can sense or access the complete state of an environment at each
point of time then it is a fully observable environment, else it is partially observable.
o A fully observable environment is easy as there is no need to maintain the internal state
to keep track history of the world.
o An agent with no sensors in all environments then such an environment is
called as unobservable.
2. Deterministic vs Stochastic:
o If an agent's current state and selected action can completely determine the next state
of the environment, then such environment is called a deterministic environment.
o A stochastic environment is random in nature and cannot be determined completely by
an agent.
o In a deterministic, fully observable environment, agent does not need to worry
about uncertainty.
3. Episodic vs Sequential:
4. Single-agent vs Multi-agent
o If only one agent is involved in an environment, and operating by itself then such
an environment is called single agent environment.
o However, if multiple agents are operating in an environment, then such an environment
is called a multi-agent environment.
o The agent design problems in the multi-agent environment are different from single
agent environment.
5. Static vs Dynamic:
o If the environment can change itself while an agent is deliberating then such
environment is called a dynamic environment else it is called a static environment.
o Static environments are easy to deal because an agent does not need to continue
looking at the world while deciding for an action.
o However for dynamic environment, agents need to keep looking at the world at
each action.
o Taxi driving is an example of a dynamic environment whereas Crossword puzzles are an
example of a static environment.
6. Discrete vs Continuous:
o If in an environment there are a finite number of percepts and actions that can be
performed within it, then such an environment is called a discrete environment else it is
called continuous environment.
o A chess gamecomes under discrete environment as there is a finite number of moves
that can be performed.
o A self-driving car is an example of a continuous environment.
7. Known vs Unknown
o If an agent can obtain complete and accurate information about the state's
environment, then such an environment is called an Accessible environment else it is
called inaccessible.
o An empty room whose state can be defined by its temperature is an example of an
accessible [Link] about an event on earth is an example of
Inaccessible environment.
[Link] details about any two uninformed search algorithm with an example
2. Depth-Limited Search Algorithm:
A depth-limited search algorithm is similar to depth-first search with a predetermined limit. Depth-
limited search can solve the drawback of the infinite path in the Depth-first search. In this algorithm,
the node at the depth limit will treat as it has no successor nodes further.
Depth-limited search can be terminated with two Conditions of failure:
o Standard failure value: It indicates that problem does not have any solution.
o Cutoff failure value: It defines no solution for the problem within a given depth limit.
Advantages:
Depth-limited search is Memory efficient.
Disadvantages:
Example:
Completeness: DLS search algorithm is complete if the solution is above the depth-limit.
Optimal: Depth-limited search can be viewed as a special case of DFS, and it is also not optimal
even if ℓ>d.
o Uniform cost search is optimal because at every state the path with the least cost
is chosen.
Disadvantages:
o It does not care about the number of steps involve in searching and only concerned
about path cost. Due to which this algorithm may be stuck in an infinite loop.
Example:
Completeness:
Uniform-cost search is complete, such as if there is a solution, UCS will find it.
Time Complexity:
Let C* is Cost of the optimal solution, and ε is each step to get closer to the goal node. Then
the number of steps is = C*/ε+1. Here we have taken +1, as we start from state 0 and end to
C*/ε.
Space Complexity:
The same logic is for space complexity so, the worst-case space complexity of Uniform-cost
search is O(b1 + [C*/ε]).
Optimal:
Uniform-cost search is always optimal as it only selects a path with the lowest path cost.
UNIT II PROBLEM SOLVING
Heuristic search strategies – Heuristic functions. Local search and optimization problems –
local search in continuous space – search with non-deterministic actions – search in partially
observable environments – online search agents and unknown environments.
PART-A (2 Marks)
1. What is heuristic functions?
Heuristics function: Heuristic is a function which is used in
Informed Search, andit finds the most promising path.
It takes the current state of the agent as its input and
produces the estimation ofhow close agent is from the goal.
The heuristic method, however, might not always give the
best solution, but itguaranteed to find a good solution in
reasonable time.
Heuristic function estimates how close a state is to the goal.
It is represented by h(n), and it calculates the cost of an optimal
path between thepair of states. The value of the heuristic function
is always positive.
Admissibility of the heuristic
function is given as:h(n) <=
h*(n)
Here h(n) is heuristic cost, and h*(n) is the estimated cost. Hence
heuristic costshould be less than or equal to the estimated cost.
Advantages:
A* search algorithm is the best algorithm than other search
algorithms.
A* search algorithm is optimal and complete.
This algorithm can solve very complex problems.
Disadvantages
:
It does not always produce the shortest path as it mostly
based on heuristics andapproximation.
A* search algorithm has some complexity issues.
The main drawback of A* is memory requirement as
it keeps all generatednodes in the memory, so it is not
practical for various large-scale problems.
5. Write a advantages and disadvantages of greedy search.
The main advantage of the greedy method is that it is relatively easy to
implement and understand. However, there are some disadvantages to
using this method. First, the greedy method is not guaranteed to find the
best solution. Second, it can be quite slow.
Greedy Best-First Search: This algorithm expands the node that appears
closest to the goal based only on the heuristic function,
h(n)h of n
ℎ(𝑛)
𝑓(𝑛)=𝑔(𝑛)+ℎ(𝑛)
, where
g(n)g of n
𝑔(𝑛)
h(n)h of n
ℎ(𝑛)
Greedy best-first search algorithm always selects the path which appears best at that moment. It
is the combination of depth-first search and breadth-first search algorithms. It uses the heuristic
function and search. Best-first search allows us to take the advantages of both algorithms. With
the help of best-first search, at each step, we can choose the most promising node. In the best
first search algorithm, we expand the node which is closest to the goal node and the closest cost
is estimated by heuristic function, i.e.
f(n)= g(n).
Disadvantages:
o It can behave as an unguided depth-first search in the worst case scenario.
o It can get stuck in a loop as DFS.
o This algorithm is not optimal.
Example:
Consider the below search problem, and we will traverse it using greedy best-first search. At
each iteration, each node is expanded using evaluation function f(n)=h(n) , which is given in the
below table.
In this search example, we are using two lists which are OPEN and CLOSED Lists.
Following are the iteration for traversing the above example.
Hence the final solution path will be: S----> B----->F------- > G
Time Complexity: The worst case time complexity of Greedy best first search is O(bm).
Space Complexity: The worst case space complexity of Greedy best first search is
O(bm). Where, m is the maximum depth of the search space.
Complete: Greedy best-first search is also incomplete, even if the given state space is finite.
expands less search tree and provides optimal result faster. A* algorithm is similar to UCS
except that it uses g(n)+h(n) instead of g(n).
In A* search algorithm, we use search heuristic as well as the cost to reach the node. Hence we
can combine both costs as following, and this sum is called as a fitness number.
Algorithm of A* search:
Step 2: Check if the OPEN list is empty or not, if the list is empty then return failure and stops.
Step 3: Select the node from the OPEN list which has the smallest value of evaluation function
(g+h), if node n is goal node then return success and stop, otherwise
Step 4: Expand node n and generate all of its successors, and put n into the closed list. For each
successor n', check whether n' is already in the OPEN or CLOSED list, if not then compute
evaluation function for n' and place into Open list.
Step 5: Else if node n' is already in OPEN and CLOSED, then it should be attached to the back
pointer which reflects the lowest g(n') value.
Advantages:
Disadvantages:
o It does not always produce the shortest path as it mostly based on heuristics
and approximation.
o A* search algorithm has some complexity issues.
o The main drawback of A* is memory requirement as it keeps all generated nodes in the
memory, so it is not practical for various large-scale problems.
Example:
In this example, we will traverse the given graph using the A* algorithm. The heuristic value of
all states is given in the below table so we will calculate the f(n) of each state using the formula
f(n)= g(n) + h(n), where g(n) is the cost to reach any node from start state.
Solution:
Initialization: {(S, 5)}
Iteration3: {(S--> A-->C--->G, 6), (S--> A-->C--->D, 11), (S--> A-->B, 7), (S-->G, 10)}
Iteration 4 will give the final result, as S--->A--->C--->G it provides the optimal path with cost 6.
Points to remember:
o A* algorithm returns the path which occurred first, and it does not search for
all remaining paths.
o The efficiency of A* algorithm depends on the quality of heuristic.
o A* algorithm expands all nodes which satisfy the condition f(n)<="" li="">
o Admissible: the first condition requires for optimality is that h(n) should be an
admissible heuristic for A* tree search. An admissible heuristic is optimistic in nature.
o Consistency: Second required condition is consistency for only A* graph-search.
If the heuristic function is admissible, then A* tree search will always find the least cost path.
Time Complexity: The time complexity of A* search algorithm depends on heuristic function,
and the number of nodes expanded is exponential to the depth of solution d. So the time
complexity is O(b^d), where b is the branching factor.
Space Complexity: Th
.3. Explain detail about Heuristic functions in detail.
Heuristic Functions in AI: As we have already seen that an informed search make use of
heuristic functions in order to reach the goal node in a more prominent way. Therefore, there
are several pathways in a search tree to reach the goal node from the current node. The
selection of a good heuristic function matters certainly. A good heuristic function is determined
by its efficiency. More is the information about the problem, more is the processing time.
Some toy problems, such as 8-puzzle, 8-queen, tic-tac-toe, etc., can be solved more efficiently
with the help of a heuristic function. Let’s see how
Consider the following 8-puzzle problem where we have a start state and a goal state. Our task is
to slide the tiles of the current/start state and place it in an order followed in the goal state. There
can be four moves either left, right, up, or down. There can be several ways to convert the
current/start state to the goal state, but, we can use a heuristic function h(n) to solve the problem
more efficiently.
It is a technique for optimizing the mathematical problems. Hill Climbing is widely used when a
good heuristic is available.
It is also called greedy local search as it only looks to its good immediate neighbor state and not
beyond that. The steps of a simple hill-climbing algorithm are listed below:
Step 1: Evaluate the initial state. If it is the goal state, then return success and Stop.
Step 2: Loop Until a solution is found or there is no new operator left to apply.
Else if it is better than the current state, then assign a new state as a current state.
Else if not better than the current state, then return to step2.
Step 5: Exit.
5. Explain detail about Steepest-Ascent hill climbing and stochastic hill climbing algorithm.
Hill Climbing Algorithm: Hill climbing search is a local search problem. The purpose of the hill
climbing search is to climb a hill and reach the topmost peak/ point of that hill. It is based on
the heuristic search technique where the person who is climbing up on the hill estimates the
direction which will lead him to the highest peak.
State-space Landscape of Hill climbing algorithm
To understand the concept of hill climbing algorithm, consider the below landscape representing
the goal state/peak and the current state of the climber. The topographical regions shown in the
figure can be defined as:
Global Maximum: It is the highest point on the hill, which is the goal state.
Local Maximum: It is the peak higher than all other peaks but lower than the global
maximum.
Flat local maximum: It is the flat area over the hill where it has no uphill or downhill. It
is a saturated point of the hill.
Shoulder: It is also a flat area where the summit is possible.
Current state: It is the current position of the person.
Local Maxima: It is that peak of the mountain which is highest than all its neighboring
states but lower than the global maxima. It is not the goal peak because there is another peak higher than
it.
Plateau: It is a flat surface area where no uphill exists. It becomes difficult for the
climber to decide that in which direction he should move to reach the goal point.
Sometimes, the person gets lost in the flat area.
Ridges: It is a challenging problem where the person finds two or more local maxima of
the same height commonly. It becomes difficult for the person to navigate the right
point and stuck to that point itself.
6. Explain detail about Online search Agent with an example.
2. Online Search Agents and Unknown Environments:
7. Write brief about following local search algorithms: (i)Simulated annealing
(ii) Local beam search
Which leads to a solution state required to reach the goal node. But beyond these “classical
search algorithms," we have some “local search algorithms” where the path cost does not
matters, and only focus on solution-state needed to reach the goal node.
A local search algorithm completes its task by traversing on a single current node rather than
multiple paths and following the neighbors of that node generally.
Although local search algorithms are not systematic, still they have the
following two advantages:
Local search algorithms use a very little or constant amount of memory as they
operate only on a single path.
Most often, they find a reasonable solution in large or infinite state spaces where
the classical or systematic algorithms do not work.
Does the local search algorithm work for a pure optimized problem?
Yes, the local search algorithm works for pure optimized problems. A pure optimization problem
is one where all the nodes can give a solution. But the target is to find the best state out of all
according to the objective function. Unfortunately, the pure optimization problem fails to find
high-quality solutions to reach the goal state from the current state.
Note: An objective function is a function whose value is either minimized or maximized in
different contexts of the optimization problems. In the case of search algorithms, an objective
function can be the path cost for reaching the goal node, etc.
Working of a Local search algorithm
Let's understand the working of a local search algorithm with the help of an example:
Consider the below state-space landscape having both:
The local search algorithm explores the above landscape by finding the following two points:
Global Minimum: If the elevation corresponds to the cost, then the task is to find
the lowest valley, which is known as Global Minimum.
Global Maxima: If the elevation corresponds to an objective function, then it finds the
highest peak which is called as Global Maxima. It is the highest point in the valley.
Hill-climbing Search
Simulated Annealing
Local Beam Search
Hill Climbing Algorithm: Hill climbing search is a local search problem. The purpose of the hill
climbing search is to climb a hill and reach the topmost peak/ point of that hill. It is based on
the heuristic search technique where the person who is climbing up on the hill estimates the
direction which will lead him to the highest peak.
State-space Landscape of Hill climbing algorithm
To understand the concept of hill climbing algorithm, consider the below landscape representing
the goal state/peak and the current state of the climber. The topographical regions shown in the
figure can be defined as:
Global Maximum: It is the highest point on the hill, which is the goal state.
Local Maximum: It is the peak higher than all other peaks but lower than the global
maximum.
Flat local maximum: It is the flat area over the hill where it has no uphill or downhill. It
is a saturated point of the hill.
Shoulder: It is also a flat area where the summit is possible.
Current state: It is the current position of the person.
Types of Hill climbing search algorithm
There are following types of hill-climbing search:
Simple hill climbing search
Simple hill climbing is the simplest technique to climb a hill. The task is to reach the highest
peak of the mountain. Here, the movement of the climber depends on his move/steps. If he finds
his next step better than the previous one, he continues to move else remain in the same state.
This search focus only on his previous and next step.
Simple hill climbing Algorithm
Local Maxima: It is that peak of the mountain which is highest than all its neighboring
states but lower than the global maxima. It is not the goal peak because there is another peak higher than
it.
Plateau: It is a flat surface area where no uphill exists. It becomes difficult for the
climber to decide that in which direction he should move to reach the goal point.
Sometimes, the person gets lost in the flat area.
Ridges: It is a challenging problem where the person finds two or more local maxima of
the same height commonly. It becomes difficult for the person to navigate the right
point and stuck to that point itself.
Simulated Annealing
Simulated annealing is similar to the hill climbing algorithm. It works on the current situation. It
picks a random move instead of picking the best move. If the move leads to the improvement
of the current situation, it is always accepted as a step towards the solution state, else it accepts
the move having a probability less than 1. This search technique was first used in 1980 to
solve VLSI layout problems. It is also applied for factory scheduling and other large
optimization tasks.
Local Beam Search
Local beam search is quite different from random-restart search. It keeps track of k states instead
of just one. It selects k randomly generated states, and expand them at each step. If any state is a
goal state, the search stops with success. Else it selects the best k successors from the complete
list and repeats the same process. In random-restart search where each search process runs
independently, but in local beam search, the necessary information is shared between the parallel
search processes.
Disadvantages of Local Beam search
This search can suffer from a lack of diversity among the k states.
It is an expensive version of hill climbing search.
UNIT III GAME PLAYING AND CSP
Game theory – optimal decisions in games – Alpha-beta search – Monte-carlo tree search –
Stochastic games – partially observable games. Constraint satisfaction problems – constraint
propagation backtracking search for CSP – local search for CSP – structure of CSP.
PART-A (2 Marks)
1. Define game theory.
Game : Any set of circumstances that has a result dependent on
the actions of two or moredecision-makers (players)
We first consider games with two players, whom we call MAX and
MIN for reasons thatwill soon become obvious.
MAX moves first, and then they take turns moving until the game is over.
At the end of the game, points are awarded to the winning player and
penalties are givento the loser.
2. List out the types of games in AI.
Perfect information: A game with the perfect information is that in
which agents can look into the complete board. Agents have all the
information about the game, and they can see each other moves also.
Examples are Chess, Checkers, Go, etc.
1. As the tree growth becomes rapid after a few iterations, it requires a huge
amount ofmemory.
2. There is a bit of a reliability issue with Monte Carlo Tree Search. In
certain scenarios,there might be a single branch or path, that might
lead to loss against the opposition
when implemented for those turn-based games. This is mainly due to the vast
amount ofcombinations and each of the nodes might not be visited enough
number of times to understand its result or outcome in the long run.
3. MCTS algorithm needs a huge number of iterations to be able to effectively
decide the most efficient path. So, there is a bit of a speed issue there.
Dice are rolled at the beginning of a player’s turn to determine the legal moves.
Card games provide many examples of stochastic partial observability, where the
missing information is generated randomly. For example, in many games, cards
are dealt randomly at the
beginning of the game, with each player receiving a hand that is not
visible to the other players. Such games include bridge, whist, hearts, and
some forms of poker.
At first sight, it might seem that these card games are just like dice games: the
cards are dealt randomly and determine the moves available to each player,
but all the “dice” are rolled at the beginning! Even though this analogy turns
out to be incorrect, it suggests an effective algorithm: consider all possible
deals of the invisible cards; solve each one as if it were a fully observable
game; and then choose the move that has the best outcome averaged over all
the deals. Suppose that each deal s occurs with probability P(s); then the
move we want is
In AI, a decomposition tree can refer to two main concepts: a visual analysis tool that helps
users explore data hierarchically and perform root cause analysis, particularly in Power BI, and a graph theory
technique where a graph is mapped to a tree structure to simplify complex problems. The visual tool
uses artificial intelligence to automatically suggest dimensions to drill into, while the graph theory
technique is applied to algorithms for problems like NP-hard problems and constraint satisfaction
PART-B (15 Marks)
1. Explain detail about game theory and its applications.
1. Game Theory:
Game theory is basically a branch of mathematics that is used to typical strategic interaction
between different players (agents), all of which are equally rational, in a context with
predefined rules (of playing or maneuvering) and outcomes. Every player or agent is a rational
entity who is selfish and tries to maximize the reward to be obtained using a particular
strategy. All the players abide by certain rules in order to receive a predefined playoff- a
reward after a certain outcome. Hence, a GAME can be defined as a set of players, actions,
strategies, and a final playoff for which all the players are competing.
Game Theory has now become a describing factor for both Machine Learning algorithms and
many daily life situations.
Consider the SVM (Support Vector Machine) for instance. According to Game Theory, the
SVM is a game between 2 players where one player challenges the other to find the best hyper-
plane after providing the most difficult points for classification. The final playoff of this game
is a solution that will be a trade-off between the strategic abilities of both players competing.
Nash equilibrium:
Nash equilibrium can be considered the essence of Game Theory. It is basically a state, a point
of equilibrium of collaboration of multiple players in a game. Nash Equilibrium guarantees
maximum profit to each player.
Let us try to understand this with the help of Generative Adversarial Networks (GANs).
What is GAN?
It is a combination of two neural networks: the Discriminator and the Generator. The
Generator Neural Network is fed input images which it analyzes and then produces new
sample images, which are made to represent the actual input images as close as possible. Once
the images have been produced, they are sent to the Discriminator Neural Network. This neural
network judges the images sent to it and classifies them as generated images and actual input
images. If the image is classified as the original image, the DNN changes its parameters of
judging. If the image is classified as a generated image, the image is rejected and returned to
the GNN. The GNN then alters its parameters in order to improve the quality of the image
produced.
This is a competitive process which goes on until both neural networks do not require to make
any changes in their parameters and there can be no further improvement in both neural
networks. This state of no further improvement is known as NASH EQUILIBRIUM. In other
words, GAN is a 2-player competitive game where both players are continuously optimizing
themselves to find a Nash Equilibrium.
Now that we are aware of the basics of Game Theory, let us try to understand how Nash
Equilibrium is attained in a simultaneous game. There are many examples but the most famous
is the Prisoner’s Dilemma. There are some more examples such as the Closed-bag exchange
Game, the Friend or For Game, and the iterated Snowdrift Game.
In all these games, two players are involved and the final playoff is a result of a decision that
has to be made by both players. Both players have to make a choice between defection and co-
operation. If both players cooperate, the final playoff will turn out to be positive for both.
However, if both defect, the final playoff will be negative for both players. If there is a
combination of one player defecting and the other co-operating, the final playoff will be
positive for one and negative for another.
Here, Nash Equilibrium plays an important role. Only if both players jot out a strategy that
benefits each other and provide both with a positive playoff, the solution to this problem will
be optimal.
There are many more real examples and a number of pieces of code that try to solve this
dilemma. The basic essence, however, is the attainment of the Nash Equilibrium in an
uncomfortable situation.
Types of Games:
Currently, there are about 5 types of classification of games. They are as follows:
1. Zero-Sum and Non-Zero Sum Games: In non-zero-sum games, there are multiple players
and all of them have the option to gain a benefit due to any move by another player. In zero-
sum games, however, if one player earns something, the other players are bound to lose
a key playoff.
2. Simultaneous and Sequential Games: Sequential games are the more popular games where
every player is aware of the movement of another player. Simultaneous games are more
difficult as in them, the players are involved in a concurrent game. BOARD GAMES are the
perfect example of sequential games and are also referred to as turn-based or extensive-
form games.
Humans’ intellectual capacities have been engaged by games for as long as civilization has
existed, sometimes to an alarming degree. Games are an intriguing subject for AI researchers
because of their abstract character. A game’s state is simple to depict, and actors are usually
limited to a small number of actions with predetermined results. Physical games, such as
croquet and ice hockey, contain significantly more intricate descriptions, a much wider variety
of possible actions, and rather ambiguous regulations defining the legality of activities. With
the exception of robot soccer, these physical games have not piqued the AI community’s
interest.
Games are usually intriguing because they are difficult to solve. Chess, for example, has an
average branching factor of around 35, and games frequently stretch to 50 moves per player,
therefore the search tree has roughly 35100 or 10154 nodes (despite the search graph having
“only” about 1040 unique nodes). As a result, games, like the real world, necessitate the ability
to make some sort of decision even when calculating the best option is impossible.
Inefficiency is also heavily punished in games. Whereas a half-efficient implementation of A
search will merely take twice as long to complete, a chess software that is half as efficient in
utilizing its available time will almost certainly be beaten to death, all other factors being
equal. As a result of this research, a number of intriguing suggestions for making the most use
of time have emerged.
Let us start with games with two players, whom we’ll refer to as MAX and MIN for obvious
reasons. MAX is the first to move, and then they take turns until the game is finished. At the
conclusion of the game, the victorious player receives points, while the loser receives
penalties. A game can be formalized as a type of search problem that has the following
elements:
S0: The initial state of the game, which describes how it is set up at the start.
Player (s): Defines which player in a state has the move.
Actions (s): Returns a state’s set of legal moves.
Result (s, a): A transition model that defines a move’s outcome.
Terminal-Test (s): A terminal test that returns true if the game is over but false otherwise.
Terminal states are those in which the game has come to a conclusion.
Utility (s, p): A utility function (also known as a payout function or objective function )
determines the final numeric value for a game that concludes in the terminal state s for
player p. The result in chess is a win, a loss, or a draw, with values of +1, 0, or 1/2.
Backgammon’s payoffs range from 0 to +192, but certain games have a greater range of
possible outcomes. A zero-sum game is defined (confusingly) as one in which the total
reward to all players is the same for each game instance. Chess is a zero-sum game because
each game has a payoff of 0 + 1, 1 + 0, or 1/2 + 1/2. “Constant-sum” would have been a
preferable name, 22 but zero-sum is the usual term and makes sense if each participant is
charged 1.
The game tree for the game is defined by the beginning state, ACTIONS function, and
RESULT function—a tree in which the nodes are game states and the edges represent
movements. The figure below depicts a portion of the tic-tac-toe game tree (noughts and
crosses). MAX may make nine different maneuvers from his starting position. The game
alternates between MAXs setting an X and MINs placing an O until we reach leaf nodes
corresponding to terminal states, such as one player having three in a row or all of the
squares being filled. The utility value of the terminal state from the perspective of MAX is
shown by the number on each leaf node; high values are thought to be beneficial for MAX
and bad for MIN
The game tree for tic-tac-toe is relatively short, with just 9! = 362,880 terminal nodes.
However, because there are over 1040 nodes in chess, the game tree is better viewed as a
theoretical construct that cannot be realized in the actual world. But, no matter how big the
game tree is, MAX’s goal is to find a solid move. A tree that is superimposed on the whole
game tree and examines enough nodes to allow a player to identify what move to make is
referred to as a search tree.
A sequence of actions leading to a goal state—a terminal state that is a win—would be the best
solution in a typical search problem. MIN has something to say about it in an adversarial
search. MAX must therefore devise a contingent strategy that specifies M A X’s initial state
move, then MAX’s movements in the states resulting from every conceivable MIN response,
then MAX’s moves in the states resulting from every possible MIN reaction to those moves,
and so on. This is quite similar to the AND-OR search method, with MAX acting as OR and
MIN acting as AND. When playing an infallible opponent, an optimal strategy produces results
that are as least as excellent as any other plan. We’ll start by demonstrating how to find the
best plan.
We’ll move to the trivial game in the figure below since even a simple game like tic-tac-toe is
too complex for us to draw the full game tree on one page. MAX’s root node moves are
designated by the letters a1, a2, and a3. MIN’s probable answers to a1 are b1, b2, b3, and so
on. This game is over after MAX and MIN each make one move. (In game terms, this tree
consists of two half-moves and is one move deep, each of which is referred to as a ply.) The
terminal states in this game have utility values ranging from 2 to 14.
The optimal strategy can be found from the minimax value of each node, which we express as
MINIMAX, given a game tree (n). Assuming that both players play optimally from there
through the finish of the game, the utility (for MAX) of being in the corresponding state is the
node’s minimax value. The usefulness of a terminal state is obviously its minimax value.
Furthermore, if given the option, MAX prefers to shift to a maximum value state, whereas
MIN wants to move to a minimum value state. So here’s what we’ve got:
Let’s use these definitions to analyze the game tree shown in the figure above. The game’s
UTILITY function provides utility values to the terminal nodes on the bottom level. Because
the first MIN node, B, has three successor states with values of 3, 12, and 8, its minimax value
is 3. Minimax value 2 is also used by the other two MIN nodes. The root node is a MAX node,
with minimax values of 3, 2, and 2, resulting in a minimax value of 3. We can also find the
root of the minimax decision: action a1 is the best option for MAX since it leads to the highest
minimax value.
This concept of optimal MAX play requires that MIN plays optimally as well—it maximizes
MAX’s worst-case outcome. What happens if MIN isn’t performing at its best? Then it’s a
simple matter of demonstrating that MAX can perform even better. Other strategies may
outperform the minimax method against suboptimal opponents, but they will always
outperform optimal opponents.
α>=β
Let's take an example of two-player search tree to understand the working of Alpha-beta
pruning
Step 1: At the first step the, Max player will start first move from node A where α= -∞ and
β= +∞, these value of alpha and beta passed down to node B where again α= -∞ and β= +∞,
and Node B passes the same value to its child D.
Step 2: At Node D, the value of α will be calculated as its turn for Max. The value of α is
compared with firstly 2 and then 3, and the max (2, 3) = 3 will be the value of α at node D
and node value will also 3.
Step 3: Now algorithm backtrack to node B, where the value of β will change as this is a
turn of Min, Now β= +∞, will compare with the available subsequent nodes value, i.e. min
(∞, 3) = 3, hence at node B now α= -∞, and β= 3.
In the next step, algorithm traverse the next successor of Node B which is node E, and the
values of α= -∞, and β= 3 will also be passed.
Step 4: At node E, Max will take its turn, and the value of alpha will change. The current
value of alpha will be compared with 5, so max (-∞, 5) = 5, hence at node E α= 5 and β= 3,
where α>=β, so the right successor of E will be pruned, and algorithm will not traverse it,
and the value at node E will be 5.
Step 5: At next step, algorithm again backtrack the tree, from node B to node A. At node
A, the value of alpha will be changed the maximum available value is 3 as max (-∞, 3)= 3,
and β= +∞, these two values now passes to right successor of A which is Node C.
At node C, α=3 and β= +∞, and the same values will be passed on to node F.
Step 6: At node F, again the value of α will be compared with left child which is 0, and
max(3,0)= 3, and then compared with right child which is 1, and max(3,1)= 3 still α
remains 3, but the node value of F will become 1.
Step 7: Node F returns the node value 1 to node C, at C α= 3 and β= +∞, here the value of
beta will be changed, it will compare with 1 so min (∞, 1) = 1. Now at C, α=3 and β= 1,
and again it satisfies the condition α>=β, so the next child of C which is G will be pruned,
and the algorithm will not compute the entire sub-tree G.
Step 8: C now returns the value of 1 to A here the best value for A is max (3, 1) = 3.
Following is the final game tree which is the showing the nodes which are computed and
nodes which has never computed. Hence the optimal value for the maximizer is 3 for this
example.
The effectiveness of alpha-beta pruning is highly dependent on the order in which each
node is examined. Move order is an important aspect of alpha-beta pruning.
Worst ordering: In some cases, alpha-beta pruning algorithm does not prune any of the
leaves of the tree, and works exactly as minimax algorithm. In this case, it also consumes
more time because of alpha-beta factors, such a move of pruning is called worst ordering.
In this case, the best move occurs on the right side of the tree. The time complexity for
such an order is O(bm).
Ideal ordering: The ideal ordering for alpha-beta pruning occurs when lots of pruning
happens in the tree, and best moves occur at the left side of the tree. We apply DFS hence it
first search left of the tree and go deep twice as minimax algorithm in the same amount of
time. Complexity in ideal ordering is O(bm/2).
Monte Carlo Tree Search (MCTS) is a search technique in the field of Artificial Intelligence
(AI). It is a probabilistic and heuristic driven search algorithm that combines the classic tree
search implementations alongside machine learning principles of reinforcement learning.
In tree search, there’s always the possibility that the current best action is actually not the most
optimal action. In such cases, MCTS algorithm becomes useful as it continues to evaluate other
alternatives periodically during the learning phase by executing them, instead of the current
perceived optimal strategy. This is known as the ” exploration-exploitation trade-off “. It
exploits the actions and strategies that is found to be the best till now but also must continue to
explore the local space of alternative decisions and find out if they could replace the current
best.
Exploration helps in exploring and discovering the unexplored parts of the tree, which could
result in finding a more optimal path. In other words, we can say that exploration expands the
tree’s breadth more than its depth. Exploration can be useful to ensure that MCTS is not
overlooking any potentially better paths. But it quickly becomes inefficient in situations with
large number of steps or repetitions. In order to avoid that, it is balanced out by exploitation.
Exploitation sticks to a single path that has the greatest estimated value. This is a greedy
approach and this will extend the tree’s depth more than its breadth. In simple words, UCB
formula applied to trees helps to balance the exploration-exploitation trade-off by periodically
exploring relatively unexplored nodes of the tree and discovering potentially more optimal
paths than the one it is currently exploiting.
For this characteristic, MCTS becomes particularly useful in making optimal decisions in
Artificial Intelligence (AI) problems.
Selection: In this process, the MCTS algorithm traverses the current tree from the root node
using a specific strategy. The strategy uses an evaluation function to optimally select nodes
with the highest estimated value. MCTS uses the Upper Confidence Bound (UCB) formula
applied to trees as the strategy in the selection process to traverse the tree. It balances the
exploration-exploitation trade-off. During tree traversal, a node is selected based on some
parameters that return the maximum value. The parameters are characterized by the
formula that is typically used for this purpose is given below.
where;
Si = value of a node i
xi = empirical mean of a node i
C = a constant
t = total number of simulations
When traversing a tree during the selection process, the child node that returns the greatest
value from the above equation will be one that will get selected. During traversal, once a
child node is found which is also a leaf node, the MCTS jumps into the expansion step.
Expansion: In this process, a new child node is added to the tree to that node which was
optimally reached during the selection process.
Simulation: In this process, a simulation is performed by choosing moves or strategies until
a result or predefined state is achieved.
Backpropagation: After determining the value of the newly added node, the remaining tree
must be updated. So, the backpropagation process is performed, where it backpropagates
from the new node to the root node. During the process, the number of simulation stored in
each node is incremented. Also, if the new node’s simulation results in a win, then the
number of wins is also incremented.
The above steps can be visually understood by the diagram given below:
These types of algorithms are particularly useful in turn based games where there is no element
of chance in the game mechanics, such as Tic Tac Toe, Connect 4, Checkers, Chess, Go, etc.
This has recently been used by Artificial Intelligence Programs like AlphaGo, to play against
the world’s top Go players. But, its application is not limited to games only. It can be used in
any situation which is described by state-action pairs and simulations used to forecast
outcomes.
As we can see, the MCTS algorithm reduces to a very few set of functions which we can use
any choice of games or in any optimizing strategy.
1. As the tree growth becomes rapid after a few iterations, it requires a huge amount
of memory.
2. There is a bit of a reliability issue with Monte Carlo Tree Search. In certain scenarios, there
might be a single branch or path, that might lead to loss against the opposition when
implemented for those turn-based games. This is mainly due to the vast amount of
combinations and each of the nodes might not be visited enough number of times to
understand its result or outcome in the long run.
3. MCTS algorithm needs a huge number of iterations to be able to effectively decide the most
efficient path. So, there is a bit of a speed issue there.
A state-space
The notion of the solution.
Discrete Domain: It is an infinite domain which can have one state for multiple
variables. For example, a start state can be allocated infinite times for each variable.
Finite Domain: It is a finite domain which can have continuous states describing one
domain for one specific variable. It is also called a continuous domain.
Unary Constraints: It is the simplest type of constraints that restricts the value of a
single variable.
Binary Constraints: It is the constraint type which relates two variables. A value x2 will
contain a value which lies between x1 and x3.
Global Constraints: It is the constraint type which involves an arbitrary number of
variables.
Some special types of solution algorithms are used to solve the following types of
constraints:
Linear Constraints: These type of constraints are commonly used in linear
programming where each variable containing an integer value exists in linear form
only.
Non-linear Constraints: These type of constraints are used in non-linear programming
where each variable (an integer value) exists in a non-linear form.
Constraint Propagation
In local state-spaces, the choice is only one, i.e., to search for a solution. But in CSP, we have
two choices either:
Node Consistency: A single variable is said to be node consistent if all the values in the
Arc Consistency: A variable is arc consistent if every value in its domain satisfies the
binary constraints of the variables.
Path Consistency: When the evaluation of a set of two variable with respect to a third
variable can be extended over another variable, satisfying all the binary constraints. It is
similar to arc consistency.
k-consistency: This type of consistency is used to define the notion of stronger forms of propagation.
Here, we examine the k-consistency of the variables
Constraint Propagation
In local state-spaces, the choice is only one, i.e., to search for a solution. But in CSP, we have
two choices either:
Node Consistency: A single variable is said to be node consistent if all the values in the
Arc Consistency: A variable is arc consistent if every value in its domain satisfies the
binary constraints of the variables.
Path Consistency: When the evaluation of a set of two variable with respect to a third
variable can be extended over another variable, satisfying all the binary constraints. It is
similar to arc consistency.
k-consistency: This type of consistency is used to define the notion of stronger forms of
propagation. Here, we examine the k-consistency of the variables.
CSP Problems
Constraint satisfaction includes those problems which contains some constraints while solving
the problem. CSP includes the following problems:
Graph Coloring: The problem where the constraint is that no adjacent sides can have
the same color.
Sudoku Playing: The gameplay where the constraint is that no number from 0-9 can be
repeated in the same row or column.
n-queen problem: In n-queen problem, the constraint is that no queen should be
placed either diagonally, in the same row or column.
Cryptarithmetic Problem: This problem has one most important constraint that is, we
cannot assign a different digit to the same character. All digits should contain a unique
alphabet.
[Link] detail about Monte Carlo Tree search algorithm
Monte Carlo Tree Search (MCTS) is a search technique in the field of Artificial Intelligence
(AI). It is a probabilistic and heuristic driven search algorithm that combines the classic tree
search implementations alongside machine learning principles of reinforcement learning.
In tree search, there’s always the possibility that the current best action is actually not the most
optimal action. In such cases, MCTS algorithm becomes useful as it continues to evaluate other
alternatives periodically during the learning phase by executing them, instead of the current
perceived optimal strategy. This is known as the ” exploration-exploitation trade-off “. It
exploits the actions and strategies that is found to be the best till now but also must continue to
explore the local space of alternative decisions and find out if they could replace the current
best.
Exploration helps in exploring and discovering the unexplored parts of the tree, which could
result in finding a more optimal path. In other words, we can say that exploration expands the
tree’s breadth more than its depth. Exploration can be useful to ensure that MCTS is not
overlooking any potentially better paths. But it quickly becomes inefficient in situations with
large number of steps or repetitions. In order to avoid that, it is balanced out by exploitation.
Exploitation sticks to a single path that has the greatest estimated value. This is a greedy
approach and this will extend the tree’s depth more than its breadth. In simple words, UCB
formula applied to trees helps to balance the exploration-exploitation trade-off by periodically
exploring relatively unexplored nodes of the tree and discovering potentially more optimal
paths than the one it is currently exploiting.
For this characteristic, MCTS becomes particularly useful in making optimal decisions in
Artificial Intelligence (AI) problems.
Selection: In this process, the MCTS algorithm traverses the current tree from the
root node using a specific strategy. The strategy uses an evaluation function to
optimally select nodes with the highest estimated value. MCTS uses the Upper
Confidence Bound (UCB) formula applied to trees as the strategy in the selection
process to traverse the tree. It balances the exploration-exploitation trade-off.
During tree traversal, a node is selected based on some parameters that return the
maximum value. The parameters are characterized by the formula that is typically
used for this purpose is given below.
where;
Si = value of a node i
xi = empirical mean of
a node i C = a constant
t = total number of simulations
When traversing a tree during the selection process, the child node that returns the
greatest value from the above equation will be one that will get selected. During
traversal, once a child node is found which is also a leaf node, the MCTS jumps
into the expansion step.
Expansion: In this process, a new child node is added to the tree to that node which
was optimally reached during the selection process.
Simulation: In this process, a simulation is performed by choosing moves or
strategies until a result or predefined state is achieved.
Backpropagation: After determining the value of the newly added node, the
remaining tree must be updated. So, the backpropagation process is performed,
where it backpropagates from the new node to the root node. During the process,
the number of simulation stored in each node is incremented. Also, if the new
node’s simulation results in a win, then the number of wins is also incremented.
The above steps can be visually understood by the diagram given below:
UNIT IV LOGICAL REASONING
Knowledge-based agents – propositional logic – propositional theorem proving –
propositional model checking – agents based on propositional logic. First-order
logic – syntax and semantics Knowledge representation and engineering –
inferences in first-order logic – Forward chaining Backward chaining –
resolution.
PART-A (2 Marks)
1. Define reasoning.
The reasoning is the mental process of deriving logical conclusion
and making predictions from available knowledge, facts, and
beliefs.
Or we can say, "Reasoning is a way to infer facts from existing data."
It is a general process of thinking rationally, to find valid conclusions.
In artificial intelligence, the reasoning is essential so that the machine can also
think rationally as a human brain, and can perform like a human.
When a system is required to do something, that it has not been explicitly
told how to do,, it must figure out what it needs to know from what it already
knows.
Fact 1 : Robins are Birds
Fact 2 : All birds have wings
Question : Do Robins have wings?
Example:
a) It is Sunday.
b) The Sun rises from West (False proposition)
c) 3+3= 7(False proposition)
d) 5 is a prime number.
6. Define Davis-Putnam algorithm.
The Davis–Putnam algorithm was developed by Martin Davis and Hilary Putnam for checking the
validity of a first-order logic formula using a resolution-based decision procedure for propositional
logic.
7. Define hybrid agent.
A Hybrid Agent is a new kind of real estate professional that utilizes artificial intelligence to automate,
digitize and increase daily production and income.
In this type of chaining, the inference engine starts by evaluating existing facts, derivations, and
conditions before deducing new information. An endpoint (goal) is achieved through the
manipulation of knowledge that exists in the knowledge base. Forward chaining can be used in
planning, monitoring, controlling, and interpretingapplications.
12. Define backward chaining.
Backward chaining is a concept in artificial intelligence that involves
backtrackingfrom the endpoint or goal to steps that led to the endpoint.
In propositional logic for AI, common inference rules include Modus Ponens (If A implies
B, and A is true, then B is true), Modus Tollens (If A implies B, and B is false, then A is
false), and Hypothetical Syllogism (If A implies B, and B implies C, then A implies C). Other
rules are Disjunctive Syllogism (If A or B is true, and A is false, then B is true)
and Conjunction (If A is true and B is true, then A and B are true).
o An intelligent agent needs knowledge about the real world for taking decisions and
reasoning to act efficiently.
o Knowledge-based agents are those agents who have the capability of maintaining
an internal state of knowledge, reason over that knowledge, update their
knowledge after observations and take actions. These agents can represent the
world with some formal representation and act intelligently.
o Knowledge-based agents are composed of two main parts:
o Knowledge-base and
o Inference system.
Knowledge-base is required for updating knowledge for an agent to learn with experiences
and take action as per the knowledge.
Inference system
Inference means deriving new sentences from old. Inference system allows us to add a new
sentence to the knowledge base. A sentence is a proposition about the world. Inference
system applies logical rules to the KB to deduce new information.
Inference system generates new facts so that an agent can update the KB. An inference
system works mainly in two rules which are given as:
o Forward chaining
o Backward chaining
Following are three operations which are performed by KBA in order to show the intelligent behavior:
1. TELL: This operation tells the knowledge base what it perceives from the environment.
2. ASK: This operation asks the knowledge base what action it should perform.
3. Perform: It performs the selected action.
function KB-AGENT(percept):
t, a counter, initially
0, indicating time TELL(KB,
MAKE-PERCEPT-
SENTENCE(percept, t))
Action = ASK(KB, MAKE-
ACTION-QUERY(t))
TELL(KB, MAKE-ACTION-
SENTENCE(action, t))
t=t+1
return action
The knowledge-based agent takes percept as input and returns an action as output. The
agent maintains the knowledge base, KB, and it initially has some background knowledge of
the real world. It also has a counter to indicate the time for the whole process, and this
counter is initialized with zero.
Each time when the function is called, it performs its three operations:
A knowledge-based agent can be viewed at different levels which are given below:
1. Knowledge level
Knowledge level is the first level of knowledge-based agent, and in this level, we need to
specify what the agent knows, and what the agent goals are. With these specifications, we can
fix its behavior. For example, suppose an automated taxi agent needs to go from a station A
to station B, and he knows the way from A to B, so this comes at the knowledge level.
2. Logical level:
are encoded into different logics. At the logical level, an encoding of knowledge into logical
sentences occurs. At the logical level we can expect to the automated taxi agent to reach to
the destination B.
3. Implementation level:
This is the physical representation of logic and knowledge. At the implementation level agent
perform actions as per logical and knowledge level. At this level, an automated taxi agent
actually implement his knowledge and logic so that he can reach to the destination.
However, in the real world, a successful agent can be built by combining both declarative and
procedural approaches, and declarative knowledge can often be compiled into more efficient
procedural code.
2. Propositional logic
Propositional logic (PL) is the simplest form of logic where all the statements are made by
propositions. A proposition is a declarative statement which is either true or false. It is a
technique of knowledge representation in logical and mathematical form.
Example:
a) It is Sunday.
b) The Sun rises from West (False proposition)
c) 3+3= 7(False proposition)
d) 5 is a prime number.
The syntax of propositional logic defines the allowable sentences for the knowledge
representation. There are two types of Propositions:
a. Atomic Propositions
b. Compound propositions
Example:
Example:
Logical Connectives:
Logical connectives are used to connect two simpler propositions or representing a sentence
logically. We can create compound propositions with the help of logical connectives. There
are mainly five connectives, which are given as follows:
P= Rohan is intelligent,
Q= Rohan is hardworking. → P∧ Q.
In propositional logic, we need to know the truth values of propositions in all possible
scenarios. We can combine all the possible combination with logical connectives, and the
representation of these combinations in a tabular
called Truth table. Following are the truth table for all logical
connectives:
Truth table with three propositions:
We can build a proposition composing three propositions P, Q, and R. This truth table is
made-up of 8n Tuples as we have taken three proposition symbols.
Precedence of connectives:
Just like arithmetic operators, there is a precedence order for propositional connectors or
logical operators. This order should be followed while evaluating a propositional problem.
Following is the list of the precedence order for operators:
Precedence Operators
Note: For better understanding use parenthesis to make sure of the correct
interpretations. Such as ¬R∨ Q, It can be interpreted as (¬R) ∨ Q
Logical equivalence:
Logical equivalence is one of the features of propositional logic. Two propositions are said to
be logically equivalent if and only if the columns in the truth table are identical to each other.
Let's take two propositions A and B, so for logical equivalence, we can write it as A⇔B. In
below truth table we can see that column for ¬A∨ B and A→B, are identical hence A is
Equivalent to B
Properties of Operators:
o Commutativity:
o P∧ Q= Q ∧ P, or
o P ∨ Q = Q ∨ P.
o Associativity:
o (P ∧ Q) ∧ R= P ∧ (Q ∧ R),
o (P ∨ Q) ∨ R= P ∨ (Q ∨ R)
o Identity element:
o P ∧ True = P,
o P ∨ True= True.
o Distributive:
o P∧ (Q ∨ R) = (P ∧ Q) ∨ (P ∧ R).
o P ∨ (Q ∧ R) = (P ∨ Q) ∧ (P ∨ R).
o DE Morgan's Law:
o ¬ (P ∧ Q) = (¬P) ∨ (¬Q)
o ¬ (P ∨ Q) = (¬ P) ∧ (¬Q).
o Double-negation elimination:
o ¬ (¬P) = P.
o We cannot represent relations like ALL, some, or none with propositional logic. Example:
a. All the girls are intelligent.
b. Some apples are sweet.
Propositional logic has limited expressive power.
In propositional logic, we cannot describe statements in terms of their properties or logical relationships.
a. based on backtracking
b. based on local hill-climbing search
1. A complete
backtracking
algorithm
David-Putnam
algorithm
(DPLL):
DPLL embodies 3 improvements over the scheme of TT-ENTAILS?: Early termination, pure
symbol heuristic, unit clause heuristic.
Tricks that enable SAT solvers to scale up to large variable and value
problems: Component analysis, ordering, intelligent
backtracking, random restarts, clever indexing.
Local search algorithms can be applied directly to the SAT problem, provided that choose
the right evaluation function. (We can choose an evaluation function that counts the number
of unsatisfied clauses.)
These algorithms take steps in the space of complete assignments, flipping the truth value of
one symbol at a time. The space usually contains many local minima, to escape from which
various forms of randomness are required. Local search methods such as WALKSAT can be
used to find solutions. Such algorithm are sound but not complete. WALKSAT: one of the
simplest and most effective algorithms.
On every iteration, the algorithm picks an unsatisfied clause, and chooses randomly between
2 ways to pick a symbol to flip:
Either a. a “min-conflicts” step that minimizes the number of unsatisfied clauses in the new state;
Or b. a “random walk” step that picks the symbol randomly.
When the algorithm returns a model, the input
sentence is indeed satifiable; When the algorithm
Either a. The sentence is unsatisfiable;
returns failure, there are 2 possible causes:
Or b. We need to give the algorithm more time.
An overconstrained problem has many clauses relative to the number of variables and
is likely to have no solutions.
The notation CNFk(m, n) denotes a k-CNF sentence with m clauses and n symbols. (with n
variables and k literals per clause).
Given a source of random sentences, where the clauses are chosen uniformly,
independently and without replacement from among all clauses with k different literals,
which are positive or negative at random.
Hardness: problems right at the threshold > overconstrained problems > underconstrained p
Satifiability threshold conjecture: A theory says that for every k≥3, there is a threshold ratio
rk, such that as n goes to infinity, the probability that CNFk(n, rn) is satisfiable becomes 1 for
all values or r below the threshold, and 0 for all values above. (remains unproven)
atemporal variables: Symbols associated with permanent aspects of the world do not need a time
superscript.
Effect axioms: specify the outcome of an action at the next time step.
Frame problem: some information lost because the effect axioms fails to state what
remains unchanged as the result of an action.
Solution: add frame axioms explicity asserting all the propositions that remain the same.
Representation frame problem: The proliferation of frame axioms is inefficient, the set of
frame axioms will be O(mn) in a world with m different actions and n fluents.
Solution: because the world exhibits locaility (for humans each action typically changes no
more than some number k of those fluents.) Define the transition model with a set of axioms
of size O(mk) rather than size O(mn).
Inferential frame problem: The problem of projecting forward the results of a t step lan of
action in time O(kt) rather than O(nt).
Solution: change one’s focus from writing axioms about actions to writing axioms about fluents.
For each fluent F, we will have an axiom that defines the truth value of Ft+1 in terms of fluents
at time t and the action that may have occurred at time t.
Qualification problem: specifying all unusual exceptions that could cause the action to fail.
2. A hybrid agent
Hybrid agent: combines the ability to deduce various aspect of the state of the world with
condition-action rules, and with problem-solving algorithms.
At each time step, the new percept sentence is added along with all the axioms that
depend on t (such as the successor-state axioms).
Then the agent use logical inference by ASKING questions of the KB (to work out which
squares are safe and which have yet to be visited).
The main body of the agent program constructs a plan based on a decreasing priority of goals:
1. If there is a glitter, construct a plan to grab the gold, follow a route back to the initial
location and climb out of the cave;
2. Otherwise if there is no current plan, plan a route (with A* search) to the closest safe
square unvisited yet, making sure the route goes through only safe squares;
3. If there are no safe squares to explore, if still has an arrow, try to make a safe square
by shooting at one of the possible wumpus locations.
4. If this fails, look for a square to explore that is not provably unsafe.
5. If there is no such square, the mission is impossible, then retreat to the initial location and climb out of the
cave.
Weakness: The computational expense goes up as time goes by.
Belief state: Some representation of the set of all possible current state of the world. (used to replace the past history of
percepts and all their ramifications)
We use a logical sentence involving the proposition symbols associated with the current time step and the temporal
symbols.
Logical state estimation involves maintaining a logical sentence that describes the set of possible states consistent with
the observation history. Each update step requires inference using the transition model of the environment, which is built
from successor-state axioms that specify how each fluent changes.
State estimation: The process of updating the belief state as new percepts arrive.
Exact state estimation may require logical formulas whose size is exponential in the number of symbols.
One common scheme for approximate state estimation: to represent belief state as conjunctions of literals (1-CNF
formulas).
The agent simply tries to prove Xt and ¬Xt for each symbol Xt, given the belief state at t-1.
The conjunction of provable literals becomes the new belief state, and the previous belief state is discarded. (This
scheme may lose some information as time goes along.)
The set of possible states represented by the 1-CNF belief state includes all states that are in fact possible given the full
percept history. The 1-CNF belief state acts as a simple outer envelope, or conservative approximation.
Decisions within a logical agent can be made by SAT solving: finding possible models specifying future action sequences
that reach the goal. This approach works only for fully observable or sensorless environment.
SATPLAN finds models for a sentence containing the initial sate, the goal, the successor-state axioms, and the action
exclusion axioms.
(Because the agent does not know how many steps it will take to reach the goal, the algorithm tries each possible number
of steps t up to some maximum conceivable plan length Tmax.)
Precondition axioms: stating that an action occurrence requires the preconditions to be satisfied, added to avoid
generating plans with illegal actions.
Action exclusion axioms: added to avoid the creation of plans with multiple simultaneous actions that interfere with each
other.
Propositional logic does not scale to environments of unbounded size because it lacks the expressive power to deal
concisely with time, space and universal patterns of relationshipgs among objects.
In the topic of Propositional logic, we have seen that how to represent statements using propositional logic. But
unfortunately, in propositional logic, we can only represent the facts, which are either true or false. PL is not sufficient to
represent the complex sentences or natural language statements. The propositional logic has very limited expressive
power. Consider the following sentence, which we cannot represent using PL logic.
To represent the above statements, PL logic is not sufficient, so we required some more powerful logic, such as first- order
logic.
First-Order logic:
o FOL is sufficiently expressive to represent the natural language statements in a concise way.
o First-order logic is also known as Predicate logic or First-order predicate logic. First-order logic is a powerful language
that develops information about the objects in a more easy way and can also express the relationship between those
objects.
o First-order logic (like natural language) does not only assume that the world contains facts like propositional logic but
also assumes the following things in the world:
o Objects: A, B, people, numbers, colors, wars, theories, squares, pits, wumpus, ......
o Relations: It can be unary relation such as: red, round, is adjacent, or n-any relation such as: the
sister of, brother of, has color, comes between
o Function: Father of, best friend, third inning of, end of, ......
o As a natural language, first-order logic also has two main parts:
a. Syntax
b. Semantics
The syntax of FOL determines which collection of symbols is a logical expression in first-order logic. The basic
syntactic elements of first-order logic are symbols. We write statements in short-hand notation in FOL.
Variables x, y, z, a, b,....
Connectives ∧, ∨, ¬, ⇒, ⇔
Equality ==
Quantifier ∀, ∃
Atomic sentences:
o Atomic sentences are the most basic sentences of first-order logic. These sentences are formed from a predicate symbol
followed by a parenthesis with a sequence of terms.
o We can represent atomic sentences as Predicate (term1, term2, ....... , term n).
Complex Sentences:
Consider the statement: "x is an integer.", it consists of two parts, the first part x is the subject of the statement and
second part "is an integer," is known as a predicate.
Quantifiers in First-order logic:
o A quantifier is a language element which generates quantification, and quantification specifies the quantity of specimen
in the universe of discourse.
o These are the symbols that permit to determine or identify the range and scope of the variable in the logical expression.
There are two types of quantifier:
a. Universal Quantifier, (for all, everyone, everything)
b. Existential quantifier, (for some, at least one).
Universal Quantifier:
Universal quantifier is a symbol of logical representation, which specifies that the statement within its range is true for
everything or every instance of a particular thing.
o For all x
o For each x
o For every x.
Example:
Let a variable x which refers to a cat so all x can be represented in UOD as below:
∀x man(x) → drink (x, coffee).
It will be read as: There are all x where x is a man who drink coffee.
Existential Quantifier:
Existential quantifiers are the type of quantifiers, which express that the statement within its scope is true for at least one
instance of something.
It is denoted by the logical operator ∃, which resembles as inverted E. When it is used with a predicate variable then it is
called as an existential quantifier.
If x is a variable, then existential quantifier will be ∃x or ∃(x). And it will be read as:
Example:
It will be read as: There are some x where x is a boy who is intelligent.
Points to remember:
Properties of Quantifiers:
And since there are all birds who fly so it will be represented as follows.
∀x bird(x) →fly(x).
Since there is only one student who failed in Mathematics, so we will use following representation for this:
∃(x) [ student(x) → failed (x, Mathematics) ∧∀ (y) [¬(x==y) ∧ student(y) → ¬failed (x, Mathematics)].
The quantifiers interact with variables which appear in a suitable way. There are two types of variables in First-order logic
which are given below:
Free Variable: A variable is said to be a free variable in a formula if it occurs outside the scope of the quantifier.
Bound Variable: A variable is said to be a bound variable in a formula if it occurs within the scope of the quantifier.
6. Explain detail about forward and backward chaining algorithm with an example.
In artificial intelligence, forward and backward chaining is one of the important topics, but before understanding
forward and backward chaining lets first understand that from where these two terms came.
Inference engine:
The inference engine is the component of the intelligent system in artificial intelligence, which applies logical rules to
the knowledge base to infer new information from known facts. The first inference engine was part of the expert
system. Inference engine commonly proceeds in two modes, which are:
A. Forward chaining
B. Backward chaining
Horn clause and definite clause are the forms of sentences, which enables knowledge base to use a more restricted
and efficient inference algorithm. Logical inference algorithms use forward and backward chaining approaches, which
require KB in the form of the first-order definite clause.
Definite clause: A clause which is a disjunction of literals with exactly one positive literal is known as a definite
clause or strict horn clause.
Horn clause: A clause which is a disjunction of literals with at most one positive literal is known as horn clause.
Hence all the definite clauses are horn clauses.
It is equivalent to p ∧ q → k.
A. Forward Chaining
Forward chaining is also known as a forward deduction or forward reasoning method when using an inference engine.
Forward chaining is a form of reasoning which start with atomic sentences in the knowledge base and applies
inference rules (Modus Ponens) in the forward direction to extract more data until a goal is reached.
The Forward-chaining algorithm starts from known facts, triggers all rules whose premises are satisfied, and add their
conclusion to the known facts. This process repeats until the problem is solved.
Properties of Forward-Chaining:
Consider the following famous example which we will use in both approaches:
Example:
"As per the law, it is a crime for an American to sell weapons to hostile nations. Country A, an enemy of
America, has some missiles, and all the missiles were sold to it by Robert, who is an American citizen."
To solve the above problem, first, we will convert all the above facts into first-order definite clauses, and then we will
use a forward-chaining algorithm to reach the goal.
o It is a crime for an American to sell weapons to hostile nations. (Let's say p, q, and r are variables)
American (p) ∧ weapon(q) ∧ sells (p, q, r) ∧ hostile(r) → Criminal(p) ...(1)
o Country A has some missiles. ?p Owns(A, p) ∧ Missile(p). It can be written in two definite clauses by using
Existential Instantiation, introducing new Constant T1.
Owns(A, T1) ...................(2)
Missile(T1).................... (3)
o Robert is American
American(Robert).........................(8)
Step-1:
In the first step we will start with the known facts and will choose the sentences which do not have implications, such
as: American(Robert), Enemy(A, America), Owns(A, T1), and Missile(T1). All these facts will be represented as
below.
Step-2:
At the second step, we will see those facts which infer from available facts and with satisfied premises.
Rule-(1) does not satisfy premises, so it will not be added in the first iteration.
Rule-(6) is satisfied with the substitution(p/A), so Hostile(A) is added and which infers from Rule-(7).
Step-3:
At step-3, as we can check Rule-(1) is satisfied with the substitution {p/Robert, q/T1, r/A}, so we can add
Criminal(Robert) which infers all the available facts. And hence we reached our goal statement.
B. Backward Chaining:
Backward-chaining is also known as a backward deduction or backward reasoning method when using an inference
engine. A backward chaining algorithm is a form of reasoning, which starts with the goal and works backward,
chaining through rules to find known facts that support the goal.
o In backward chaining, the goal is broken into sub-goal or sub-goals to prove the facts true.
o It is called a goal-driven approach, as a list of goals decides which rules are selected and used.
o Backward -chaining algorithm is used in game theory, automated theorem proving tools, inference engines,
proof assistants, and various AI applications.
o The backward-chaining method mostly used a depth-first search strategy for proof.
Example:
In backward-chaining, we will use the same above example, and will rewrite all the rules.
o Missile(T1)
o ?p Missiles(p) ∧ Owns (A, p) → Sells (Robert, p, A) .................. (4)
o Missile(p) → Weapons (p) ........................ (5)
o Enemy(p, America) →Hostile(p) ......................... (6)
o Enemy (A, America) .......................... (7)
o American(Robert)............................. (8)
Backward-Chaining proof:
In Backward chaining, we will start with our goal predicate, which is Criminal(Robert), and then infer further rules.
Step-1:
At the first step, we will take the goal fact. And from the goal fact, we will infer other facts, and at last, we will prove
those facts true. So our goal fact is "Robert is Criminal," so following is the predicate of it.
Step-2:
At the second step, we will infer other facts form goal fact which satisfies the rules. So as we can see in Rule-1, the
goal predicate Criminal (Robert) is present with substitution {Robert/P}. So we will add all the conjunctive facts below
the first level and will replace p with Robert.
Step-4:
At step-4, we can infer facts Missile(T1) and Owns(A, T1) form Sells(Robert, T1, r) which satisfies the Rule- 4, with the
substitution of A in place of r. So these two statements are proved here.
Step-5:
At step-5, we can infer the fact Enemy(A, America) from Hostile(A) which satisfies Rule- 6. And hence all the
statements are proved true using backward chaining.
What is knowledge-engineering?
The process of constructing a knowledge-base in first-order logic is called as knowledge- engineering. In knowledge-
engineering, someone who investigates a particular domain, learns important concept of that domain, and generates
a formal representation of the objects, is known as knowledge engineer.
In this topic, we will understand the Knowledge engineering process in an electronic circuit domain, which is already
familiar. This approach is mainly suitable for creating special-purpose knowledge base.
Following are some main steps of the knowledge-engineering process. Using these steps, we will develop a
knowledge base which will allow us to reason about digital circuit (One-bit full adder) which is given below
1. Identify the task:
The first step of the process is to identify the task, and for the digital circuit, there are various reasoning tasks
At the first level or highest level, we will examine the functionality of the circuit:
At the second level, we will examine the circuit structure details such as:
In the second step, we will assemble the relevant knowledge which is required for digital circuits. So for digital circuits,
we have the following required knowledge:
3. Decide on vocabulary:
The next step of the process is to select functions, predicate, and constants to represent the circuits, terminals, signals,
and gates. Firstly we will distinguish the gates from each other and from other objects. Each gate is represented as an
object which is named by a constant, such as, Gate(X1). The functionality of each gate is determined by its type,
which is taken as constants such as AND, OR, XOR, or NOT. Circuits will be identified by a predicate: Circuit (C1).
For gate input, we will use the function In(1, X1) for denoting the first input terminal of the gate, and for output
terminal we will use Out (1, X1).
The function Arity(c, i, j) is used to denote that circuit c has i input, j output.
The connectivity between gates can be represented by predicate Connect(Out(1, X1), In(1, X1)).
We use a unary predicate On (t), which is true if the signal at a terminal is on.
To encode the general knowledge about the logic circuit, we need some following rules:
o If two terminals are connected then they have the same input signal, it can be represented as:
∀ t1, t2 Terminal (t1) ∧ Terminal (t2) ∧ Connect (t1, t2) → Signal (t1) = Signal (2).
o Signal at every terminal will have either value 0 or 1, it will be represented as:
o Output of AND gate will be zero if and only if any of its input is zero.
∀ g Gate(g) ∧ Type(g) = XOR → Signal (Out(1, g)) = 1 ⇔ Signal (In(1, g)) ≠ Signal (In(2, g)).
o All the gates in the above circuit have two inputs and one output (except NOT gate).
∀ g Gate(g) ∧ r =Type(g) ∧ (r= AND ∨r= OR ∨r= XOR) → Arity (g, 2, 1).
Now we encode problem of circuit C1, firstly we categorize the circuit and its gate components. This step is
easy if ontology about the problem is already thought. This step involves the writing simple atomics
sentences of instances of concepts, which is known as ontology.
For the given circuit C1, we can encode the problem instance in atomic sentences as below:
Since in the circuit there are two XOR, two AND, and one OR gate so atomic sentences for these gates will be:
For XOR gate: Type(x1)= XOR,
Type(X2) = XOR For AND gate:
Type(A1) = AND, Type(A2)= AND
For OR gate: Type (O1) = OR.
In this step, we will find all the possible set of values of all the terminal for the adder circuit. The first query will be:
What should be the combination of input which would generate the first output of circuit C1, as 0
and a second output to be 1?
∃ i1, i2, i3 Signal (In(1, C1))=i1 ∧ Signal (In(2, C1))=i2 ∧ Signal (In(3, C1))= i3
Now we will debug the knowledge base, and this is the last step of the complete process. In this step, we
will try to debug the issues of knowledge base.
Actual numbers
Each node in the Bayesian network has condition
probabil
distribution P(Xi |Parent(Xi) ), which determines the effect
of parent on that node.
Bayesian network is based on Joint probability
distribution a conditional probability.
So let's first understand the joint probability distribution:
Above,
P(c|x) is the posterior probability of class (c, target) given predictor (x,
attributes).
Logic for reasoning with uncertain information includes Probabilistic Reasoning, which uses probabilities to
represent belief and likelihood, and Fuzzy Logic, which handles vagueness and imprecision by allowing degrees
of truth rather than strict true/false values. Non-monotonic Reasoning extends rules to reason with
incomplete information, where conclusions can be retracted with new evidence
A causal network is a graphical model using nodes (representing variables) and directed edges (arrows) to
show cause-and-effect relationships between them. These networks visually map how changes in one
variable directly influence another, helping to understand the underlying mechanisms of a complex
system, and are used in fields like artificial intelligence, data science, and epidemiology for prediction
and decision-making
General question: What’s the whole probability distribution over variable X given evidence e, P(X |
e)?
In our discrete probability situation, the only way to answer a MAP query is to compute
the probability of x given e for all possible values of x and see which one is greatest
So, in general, we’d like to be able to compute a whole probability distribution over some variable or
To answer any query involving a conjunction of variables, sum over the variables not involved in
the query
Given the joint distribution over the variables, we can easily answer any question about the
value of a single variable by summing (or marginalizing) over the other variables.
So, in a domain with four variables, A, B, C, and D, the probability that variable D has value d is
the sum over all possible combinations of values of the other three variables of the joint
probability of all four values. This is exactly the same as the procedure we went through in the
last lecture, where to compute the probability of cavity, we added up the probability of cavity
and toothache and the probability of cavity and not toothache.
In general, we’ll use the first notation, with a single summation indexed by a list of variable
names, and a joint probability expression that mentions values of those variables. But here we
can see the completely written-out definition, just so we all know what the shorthand is
supposed to mean.
In the numerator, here, you can see that we’re only summing over variables A and C, because b
and d are instantiated in the query.
We’re going to learn a general purpose algorithm for answering these joint queries fairly
efficiently. We’ll start by looking at a very simple case to build up our intuitions, then we’ll write
down the algorithm, then we’ll apply it to a more complex case.
Here’s our very simple case. It’s a bayes net with four nodes, arranged in a chain.
So, we know from before that the probability that variable D has some value little d is the sum
over A, B, and C of the joint distribution, with d fixed.
Now, using the chain rule of Bayesian networks, we can write down the joint probability as a
product over the nodes of the probability of each node’s value given the values of its parents. So,
in this case, we get P(d|c) times P(c|b) times P(b|a) times P(a).
This expression gives us a method for answering the query, given the conditional probabilities
that are stored in the net. And this method can be applied directly to any other bayes net. But
there’s a problem with it: it requires enumerating all possible combinations of assignments to A,
B, and C, and then, for each one, multiplying the factors for each node. That’s an enormous
amount of work and we’d like to avoid it if at all possible.
So, we’ll try rewriting the expression into something that might be more efficient to evaluate.
First, we can make our summation into three separate summations, one over each variable.
Then, by distributivity of addition over multiplication, we can push the summations in, so that
the sum over A includes all the terms that mention A, but no others, and so on. It’s pretty clear
that this expression is the same as the previous one in value, but it can be evaluated more
efficiently. We’re still, eventually, enumerating all assignments to the three variables, but we’re
doing somewhat fewer multiplications than before. So this is still not completely satisfactory.
If you look, for a minute, at the terms inside the summation over A, you’ll see that we’re doing
these multiplications over for each value of C, which isn’t necessary, because they’re
independent of C. Our idea, here, is to do the multiplications once and store them for later use.
So, first, for each value of A and B, we can compute the product, generating a two dimensional
matrix.
Then, we can sum over the rows of the matrix, yielding one value of the sum for each possible
value of b.
Now, we can substitute f1 of b in for the sum over A in our previous expression. And, effectively,
we can remove node A from our diagram. Now, we express the contribution of b, which takes
the contribution of a into account, as f_1 of b.
We can continue the process in basically the same way. We can look at the summation over b
and see that the only other variable it involves is c. We can summarize those products as a set of
factors, one for each value of c. We’ll call those factors f_2 of c.
We substitute f_2 of c into the formula, remove node b from the diagram, and now we’re down to a
Given a Bayesian network, and an elimination order for the non-query variables , compute
For i = m downto 1
That was a simple special case. Now we can look at the algorithm in the general case. Let’s
assume that we’re given a Bayesian network and an ordering on the variables that aren’t fixed in
the query. We’ll come back later to the question of the influence of the order, and how we might
find a good one.
We can express the probability of the query variables as a sum over each value of each of the
non- query variables of a product over each node in the network, of the probability that that
variable has the given value given the values of its parents.
So, we’ll eliminate the variables from the inside out. Starting with variable Xm and finishing with
variable X1.
To eliminate variable Xi, we start by gathering up all of the factors that mention Xi, and removing
them from our set of factors. Let’s say there are k such factors.
Now, we make a k+1 dimensional table, indexed by Xi as well as each of the other variables that
is mentioned in our set of factors.
table. This table is our new factor, and we put a term for it back into our set
of factors. Once we’ve eliminated all the summations, we have the desired
value.
So, we start by eliminating V. We gather the two terms that mention V and see that they also
involve variable T. So, we compute the product for each value of T, and summarize those in the
factor f1 of T.
Now we can substitute that factor in for the summation, and remove the node from the network.
The next variable to be eliminated is X. There is actually only one term involving X, and it also
involves variable A. So, for each value of A, we compute the sum over X of P(x|a). But wait! We
know what this value is! If we fix a and sum over x, these probabilities have to add up to 1.
So, rather than adding another factor to our expression, we can just remove the whole sum. In
general, the only nodes that will have an influence on the probability of D are its ancestors.
Now, it’s time to eliminate S. We find that there are three terms involving S, and we gather them
into the sum. These three terms involve two other variables, B and L. So we have to make a
factor that specifies, for each value of B and L, the value of the sum of products.
Now we can substitute that factor back into our expression. We can also eliminate node S. But in
eliminating S, we’ve added a direct dependency between L and B (they used to be dependent via
S, but now the dependency is encoded explicitly in f2(b). We’ll show that in the graph by drawing
a line between the two nodes. It’s not exactly a standard directed conditional dependence, but
it’s still useful to show that they’re coupled.
Now we eliminate T. It involves two terms, which themselves involve variables A and L. So we
make a new factor f3 of A and L.
Next we eliminate L. It involves these two factors, which depend on variables A and B. So we
make a new factor, f4 of A and B, and substitute it in. We remove node L, but couple A and B.
At this point, we could just do the summations over A and B and be done. But to finish out the
Finally, to get the probability that variable D has value little d, we simply sum factor f5 over all
values of a. Yay! We did it.
Let’s see how the variable elimination algorithm performs, both in theory and in practice.
First of all, it’s pretty easy to see that it runs in time exponential in the number of variables involved
in the largest factor. Creating a factor with k variables involves making a k+1 dimensional table. If you
have b values per variable, that’s a table of size b^(k+1). To make each entry, you have to multiply at
most n numbers, where n is the number of nodes. We have to do this for each variable to be
eliminated (which is usually close to n). So we have something like time = O(n^2 b^k).
How big the factors are depends on the elimination order. You’ll see in one of the recitation exercises
just how dramatic the difference in factor sizes can be. A bad elimination order can generate huge
factors.
So, we’d like to use the elimination order that generates the smallest factors. Unfortunately, it turns
out to be NP hard to find the best elimination order.
At least, there are some fairly reasonable heuristics for choosing an elimination order. It’s usually
done dynamically. So, rather than fixing the elimination order in advance, as we suggested in the
algorithm description, you can pick the next variable to be eliminated depending on the situation. In
particular, one reasonable heuristic is to pick the variable to eliminate next that will result in the
smallest factor. This greedy approach won’t always be optimal, but it’s not usually too bad.
There is one case where Bayes net inference in general, and the variable elimination algorithm in
particular is fairly efficient, and that’s when the network is a polytree. A polytree is a network with no
cycles. That is, a network in which, for any two nodes, there is only one path between them. In a
polytree, inference is linear in the size of the network, where the size of the network is defined to be
the size of the largest conditional probability table (or exponential in the maximum number of
parents of any node). In a polytree, the optimal elimination order is to start at the root nodes, and
work downwards, always eliminating a variable that no longer has any parents. In doing so, we never
introduce additional connections into the network.
So, inference in polytrees is efficient, and even in many large non-polytree networks, it’s possible to
When the network is such that the factors are, of necessity, large, we’ll have to turn to a different class
of methods.
Bayesian belief network is key computer technology for dealing with probabilistic events and to solve a
problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and their
conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability distribution, and also
use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between multiple events, we
need a Bayesian network. It can also be used in various tasks including prediction, anomaly detection,
diagnostics, automated insight, reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it consists of two parts:
The generalized form of Bayesian network that represents and solve decision problems under uncertain
knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between random
variables. These directed links or arrows connect the pair of nodes in the
graph. These links represent that one node directly influence the other node, and if there is no directed
link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of
the network graph.
o If we are considering node B, which is connected with node A by a directed arrow, then
node A is called the parent of Node B.
o Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a directed
acyclic graph or DAG
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which determines
the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability. So let's first understand
the joint probability distribution:
If we have variables x1, x2, x3, , xn, then the probabilities of a different combination of x1, x2, x3.. xn, are known as
P[x1, x2, x3, , xn], it can be written as the following way in terms of the joint probability distribution.
In general for each variable Xi, we can write the equation as:
Let's understand the Bayesian network through an example by creating a directed acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably responds at
detecting a burglary but also responds for minor earthquakes. Harry has two neighbors David and Sophia, who
have taken a responsibility to inform Harry at work when they hear the alarm. David always calls Harry when he
hears the alarm, but sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here we would like to
compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an earthquake occurred, and David and Sophia both called
the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is showing that
burglary and earthquake is the parent node of the alarm and directly affecting the probability of alarm's
going off, but David and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary and also do not
notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an exhaustive set of
cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if there are two
parents, then CPT will contain 4 probability values
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can rewrite the
above probability statement using joint probability distribution:
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
The Conditional probability of David that he will call depends on the probability of Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
From the formula of joint distribution, we can write the problem statement in the form of probability distribution:
Hence, a Bayesian network can answer any query about the domain by using Joint
There are two ways to understand the semantics of the Bayesian network, which is given below:
Sampling
Another strategy, which is a theme that comes up also more and more in AI actually, is to say, well,
we didn't really want the right answer anyway. Let's try to do an approximation. And you can also
show that it's computationally hard to get an approximation that's within epsilon of the answer that
you want, but again that doesn't keep us from trying.
So, the other thing that we can do is the stochastic simulation or sampling. In sampling, what we do
is we look at the root nodes of our graph, and attached to this root node is some probability that A is
going to be true, right? Maybe it's .4. So we flip a coin that comes up heads with probability .4 and
see if we get true or false.
We flip our coin, let's say, and we get true for A -- this time. And now, given the assignment of true to
A, we look in the conditional probability table for B given A = true, and that gives us a probability for
B.
Now, we flip a coin with that probability. Say we get False. We enter that into the table.
Now, we look in the CPT for D given B and C, for the case where B is false and C is true, and we flip a
coin with that probability, in order to get a value for D.
So, there's one sample from the joint distribution of these four variables. And you can just keep
doing this, all day and all night, and generate a big pile of samples, using that algorithm. And now you
can ask various questions.
Estimate:
P*(D|A) = #D,A / #A
Let's say you want to know the probability of D given A. How would you answer - - given all the
examples -- what would you do to compute the probability of D given A? You would just count. You’d
count the number of cases in which A and D were true, and you’d divide that by the number
of cases in which A was true, and that would give you an unbiased estimate of the probability of D given A.
The
more samples, the more confidence you’d have that the estimated probability is close to the true one.
Estimation
Imagine that you want to estimate the probability of some disease given -- oh, I don't know -- spots
on your tongue and a sore toe. Somebody walks in and they have a really peculiar set of symptoms,
and you want to know what's the probability that they have some disease.
Well, if the symptoms are root nodes, it's easy. If the symptoms were root nodes, you could just
assign the root nodes to have their observed values and then simulate the rest of the network as
before.
But if the symptoms aren't root nodes then if you do naïve sampling, you would generate a giant
table of samples, and you'd have to go and look and say, gosh, how many cases do I have where
somebody has spots on their tongue and a sore toe; and the answer would be, well, maybe zero or
not very many.
There’s a technique called importance sampling, which allows you to draw examples from a
distribution that’s going to be more helpful and then reweight them so that you can still get an
unbiased estimate of the desired conditional probability. It’s a bit beyond the scope of this class to
get into the details, but it’s an important and effective idea.
Recitation Problem
• Do the variable elimination algorithm on the net below using the elimination order A,B,C (that
is, eliminate node C first). In computing P(D=d), what factors do you get?
Here’s the network we started with. We used elimination order C, B, A (we eliminated A first). Now
we’re going to explore what happens when we eliminate the variables in the opposite order. First,
work on the case we did, where we’re trying to calculate the probability that node D takes on a
particular value, little d. Remember that little d is a constant in this case. Now, do the case where
we’re trying to find the whole distribution over D, so we don’t know a particular value for little d.
Another Recitation Problem
Find an elimination order that keeps the factors small for the net below, or show that there is no
such order.
Here’s a pretty complicated graph. But notice that no node has more than 2 parents, so none of the
CPTs are huge. The question is, is this graph hard for variable elimination? More concretely, can you
find an elimination order that results only in fairly small factors? Is there an elimination order that
generates a huge factor?
Bayesian networks (or related models) are often used in computer vision, but they almost always
require sampling. What happens when you try to do variable elimination on a model like the grid
below?
1. Bayesian inference
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which determines the
probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two random events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian inference is an
application of Bayes' theorem, which is fundamental to Bayesian statistics.
Bayes' theorem allows updating the probability prediction of an event by observing new information of the real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine the probability of
cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with known
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of most modern AI
systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability of hypothesis A when
we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate the probability of
evidence.
P(A) is called the prior probability, probability of hypothesis before considering the
Where A1, A2, A3,. , An is a set of mutually exclusive and exhaustive events.
Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A). This is very useful in
cases where we have a good probability of these three terms and want to determine the fourth one. Suppose we
want to perceive the effect of some unknown cause, and want to compute that cause, then the Bayes' rule
becomes:
Example-1:
Question: what is the probability that a patient has diseases meningitis with a stiff neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80% of the time.
He is also aware of some more facts, which are given as follows:
Let a be the proposition that patient has stiff neck and b be the proposition that patient has meningitis. , so we
can calculate the following as:
P(a|b) = 0.8
P(b) =
1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.
Example-2:
Question: From a standard deck of playing cards, a single card is drawn. The probability that the card is
king is 4/52, then calculate posterior probability P(King|Face), which means the drawn face card is a king
card.
Solution:
o It is used to calculate the next step of the robot when the already executed step is given.
o Bayes' theorem is helpful in weather forecasting.
o It can solve the Monty Hall problem.
1. Probabilistic reasoning
Probabilistic reasoning:
Probabilistic reasoning is a way of knowledge representation where we apply the concept of probability to indicate
the uncertainty in knowledge. In probabilistic reasoning, we combine probability theory with logic to handle the
uncertainty.
We use probability in probabilistic reasoning because it provides a way to handle the uncertainty that is the result of
someone's laziness and ignorance.
In the real world, there are lots of scenarios, where the certainty of something is not confirmed, such as "It will rain
today," "behavior of someone for some situations," "A match between two teams or two players." These are probable
sentences for which we can assume that it will happen but not sure about it, so here we use probabilistic reasoning.
In probabilistic reasoning, there are two ways to solve problems with uncertain knowledge:
o Bayes' rule
o Bayesian Statistics
As probabilistic reasoning uses probability and related terms, so before understanding probabilistic reasoning, let's
understand some common terms:
Probability: Probability can be defined as a chance that an uncertain event will occur. It is the numerical measure of
the likelihood that an event will occur. The value of probability always remains between 0 and 1 that represent ideal
uncertainties.
We can find the probability of an uncertain event by using the below formula.
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in the real world.
Prior probability: The prior probability of an event is probability computed before observing new information.
Posterior Probability: The probability that is calculated after all evidence or information has taken into account. It is
a combination of prior probability and new information.
Conditional probability:
Conditional probability is a probability of occurring an event when another event has already happened.
Let's suppose, we want to calculate the event A when event B has already occurred, "the probability of A under the
If the probability of A is given and we need to find the probability of B, then it will be given as:
It can be explained by using the below Venn diagram, where B is occurred event, so sample space will be reduced to
set B, and now we can only calculate event A when event B is already occurred by dividing the probability of P(A⋀B)
by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the students who likes English and mathematics,
and then what is the percent of students those who like English also like mathematics?
Solution:
1. Casual Networks:
A causal network is an acyclic digraph arising from an evolution of a substitution system, and
representing its history. The illustration above shows a causal network corresponding to the
rules
The figure above shows the procedure for diagrammatically creating a causal network from
a mobile automaton.
In an evolution of a multiway system, each substitution event is a vertex in a causal network.
Two events which are related by causal dependence, meaning one occurs just before the other,
have an edge between the corresponding vertices in the causal network. More precisely, the edge
is a directed edge leading from the past event to the future event.
Some causal networks are independent of the choice of evolution, and these are called causally
invariant.
1. Exact inference:
In exact inference, we analytically compute the conditional probability distribution over the
variables of interest.
But sometimes, that’s too hard to do, in which case we can use approximation techniques based on
statistical sampling
General question: What’s the whole probability distribution over variable X given evidence e, P(X |
e)?
In our discrete probability situation, the only way to answer a MAP query is to compute the
probability of x given e for all possible values of x and see which one is greatest
So, in general, we’d like to be able to compute a whole probability distribution over some variable or
To answer any query involving a conjunction of variables, sum over the variables not involved in the
query
Given the joint distribution over the variables, we can easily answer any question about the value of
a single variable by summing (or marginalizing) over the other variables.
So, in a domain with four variables, A, B, C, and D, the probability that variable D has value d is the
sum over all possible combinations of values of the other three variables of the joint probability of
all four values. This is exactly the same as the procedure we went through in the last lecture, where
to compute the probability of cavity, we added up the probability of cavity and toothache and the
probability of cavity and not toothache.
In general, we’ll use the first notation, with a single summation indexed by a list of variable names,
and a joint probability expression that mentions values of those variables. But here we can see the
completely written-out definition, just so we all know what the shorthand is supposed to mean.
In the numerator, here, you can see that we’re only summing over variables A and C, because b and
d are instantiated in the query.
We’re going to learn a general purpose algorithm for answering these joint queries fairly efficiently.
We’ll start by looking at a very simple case to build up our intuitions, then we’ll write down the
algorithm, then we’ll apply it to a more complex case.
Here’s our very simple case. It’s a bayes net with four nodes, arranged in a chain.
So, we know from before that the probability that variable D has some value little d is the sum over
A, B, and C of the joint distribution, with d fixed.
Now, using the chain rule of Bayesian networks, we can write down the joint probability as a
product over the nodes of the probability of each node’s value given the values of its parents. So, in
this case, we get P(d|c) times P(c|b) times P(b|a) times P(a).
This expression gives us a method for answering the query, given the conditional probabilities that
are stored in the net. And this method can be applied directly to any other bayes net. But there’s a
problem with it: it requires enumerating all possible combinations of assignments to A, B, and C, and
then, for each one, multiplying the factors for each node. That’s an enormous amount of work and
we’d like to avoid it if at all possible.
So, we’ll try rewriting the expression into something that might be more efficient to evaluate. First,
we can make our summation into three separate summations, one over each variable.
Then, by distributivity of addition over multiplication, we can push the summations in, so that the
sum over A includes all the terms that mention A, but no others, and so on. It’s pretty clear that this
expression is the same as the previous one in value, but it can be evaluated more efficiently. We’re
still, eventually, enumerating all assignments to the three variables, but we’re doing somewhat
fewer multiplications than before. So this is still not completely satisfactory.
If you look, for a minute, at the terms inside the summation over A, you’ll see that we’re doing these
multiplications over for each value of C, which isn’t necessary, because they’re independent of C.
Our idea, here, is to do the multiplications once and store them for later use. So, first, for each value
of A and B, we can compute the product, generating a two dimensional matrix.
Then, we can sum over the rows of the matrix, yielding one value of the sum for each possible value
of b.
Now, we can substitute f1 of b in for the sum over A in our previous expression. And, effectively, we
can remove node A from our diagram. Now, we express the contribution of b, which takes the
contribution of a into account, as f_1 of b.
We can continue the process in basically the same way. We can look at the summation over b and
see that the only other variable it involves is c. We can summarize those products as a set of factors,
one for each value of c. We’ll call those factors f_2 of c.
We substitute f_2 of c into the formula, remove node b from the diagram, and now we’re down to a
simple expression in which d is known and we have to sum over values of c.
Given a Bayesian network, and an elimination order for the non-query variables , compute
For i = m downto 1
That was a simple special case. Now we can look at the algorithm in the general case. Let’s assume
that we’re given a Bayesian network and an ordering on the variables that aren’t fixed in the query.
We’ll come back later to the question of the influence of the order, and how we might find a good
one.
We can express the probability of the query variables as a sum over each value of each of the non-
query variables of a product over each node in the network, of the probability that that variable has
the given value given the values of its parents.
So, we’ll eliminate the variables from the inside out. Starting with variable Xm and finishing with
variable X1.
To eliminate variable Xi, we start by gathering up all of the factors that mention Xi, and removing
them from our set of factors. Let’s say there are k such factors.
Now, we make a k+1 dimensional table, indexed by Xi as well as each of the other variables that is
mentioned in our set of factors.
We then sum the table over the Xi dimension, resulting in a k-dimensional table.
This table is our new factor, and we put a term for it back into our set of factors.
Once we’ve eliminated all the summations, we have the desired [Link]
more example
Here’s a more complicated example, to illustrate the variable elimination algorithm in a more
general case. We have this big network that encodes a domain for diagnosing lung disease.
(Dyspnea, as I understand it, is shortness of breath).
So, we start by eliminating V. We gather the two terms that mention V and see that they also
involve variable T. So, we compute the product for each value of T, and summarize those in the
factor f1 of T.
Now we can substitute that factor in for the summation, and remove the node from the network.
The next variable to be eliminated is X. There is actually only one term involving X, and it also
involves variable A. So, for each value of A, we compute the sum over X of P(x|a). But wait! We
know what this value is! If we fix a and sum over x, these probabilities have to add up to 1.
So, rather than adding another factor to our expression, we can just remove the whole sum. In
general, the only nodes that will have an influence on the probability of D are its ancestors.
Now, it’s time to eliminate S. We find that there are three terms involving S, and we gather them
into the sum. These three terms involve two other variables, B and L. So we have to make a factor
that specifies, for each value of B and L, the value of the sum of products.
Now we can substitute that factor back into our expression. We can also eliminate node S. But in
eliminating S, we’ve added a direct dependency between L and B (they used to be dependent via S,
but now the dependency is encoded explicitly in f2(b). We’ll show that in the graph by drawing a line
between the two nodes. It’s not exactly a standard directed conditional dependence, but it’s still
useful to show that they’re coupled.
Now we eliminate T. It involves two terms, which themselves involve variables A and L. So we make
a new factor f3 of A and L.
We can substitute in that factor and eliminate T. We’re getting close!
Next we eliminate L. It involves these two factors, which depend on variables A and B. So we make a
new factor, f4 of A and B, and substitute it in. We remove node L, but couple A and B.
At this point, we could just do the summations over A and B and be done. But to finish out the
It involves both of our remaining terms, and it seems to depend on variables A and D. However, in
this case, we’re interested in the probability of a particular value, little d of D, and so the variable d
is instantiated. Thus, we can treat it as a constant in this expression, and we only need to generate a
factor over a, which we’ll call f5 of a. And we can now, in some sense, remove D from our network
as well (because we’ve already factored it into our answer).
Finally, to get the probability that variable D has value little d, we simply sum factor f5 over all
values of a. Yay! We did it.
Let’s see how the variable elimination algorithm performs, both in theory and in practice.
First of all, it’s pretty easy to see that it runs in time exponential in the number of variables involved in
the largest factor. Creating a factor with k variables involves making a k+1 dimensional table. If you have
b values per variable, that’s a table of size b^(k+1). To make each entry, you have to multiply at most n
numbers, where n is the number of nodes. We have to do this for each variable to be eliminated (which
is usually close to n). So we have something like time = O(n^2 b^k).
How big the factors are depends on the elimination order. You’ll see in one of the recitation exercises
just how dramatic the difference in factor sizes can be. A bad elimination order can generate huge
factors.
So, we’d like to use the elimination order that generates the smallest factors. Unfortunately, it turns out
to be NP hard to find the best elimination order.
At least, there are some fairly reasonable heuristics for choosing an elimination order. It’s usually done
dynamically. So, rather than fixing the elimination order in advance, as we suggested in the algorithm
description, you can pick the next variable to be eliminated depending on the situation. In particular,
one reasonable heuristic is to pick the variable to eliminate next that will result in the smallest factor.
This greedy approach won’t always be optimal, but it’s not usually too bad.
There is one case where Bayes net inference in general, and the variable elimination algorithm in
particular is fairly efficient, and that’s when the network is a polytree. A polytree is a network with no
cycles. That is, a network in which, for any two nodes, there is only one path between them. In a
polytree, inference is linear in the size of the network, where the size of the network is defined to be the
size of the largest conditional probability table (or exponential in the maximum number of parents of
any node). In a polytree, the optimal elimination order is to start at the root nodes, and work
downwards, always eliminating a variable that no longer has any parents. In doing so, we never
introduce additional connections into the network.
So, inference in polytrees is efficient, and even in many large non-polytree networks, it’s possible to
When the network is such that the factors are, of necessity, large, we’ll have to turn to a different class
of methods.
1. Approximate inference:
Sampling
So, the other thing that we can do is the stochastic simulation or sampling. In sampling, what we do is
we look at the root nodes of our graph, and attached to this root node is some probability that A is going
to be true, right? Maybe it's .4. So we flip a coin that comes up heads with probability .4 and see if we
get true or false.
We flip our coin, let's say, and we get true for A -- this time. And now, given the assignment of true to A,
we look in the conditional probability table for B given A = true, and that gives us a probability for B.
Now, we flip a coin with that probability. Say we get False. We enter that into the table.
Now, we look in the CPT for D given B and C, for the case where B is false and C is true, and we flip a coin
with that probability, in order to get a value for D.
So, there's one sample from the joint distribution of these four variables. And you can just keep doing
this, all day and all night, and generate a big pile of samples, using that algorithm. And now you can ask
various questions.
Estimate:
P*(D|A) = #D,A / #A
Let's say you want to know the probability of D given A. How would you answer - - given all the
examples -- what would you do to compute the probability of D given A? You would just count. You’d
count the number of cases in which A and D were true, and you’d divide that by the number of cases in
which A was true, and that would give you an unbiased estimate of the probability of D given A. The
more samples, the more confidence you’d have that the estimated probability is close to the true one.
Estimation
Exactly because of the process we’re using to generate the samples, the majority of them will be the
typical cases. Oh, it's someone with a cold, someone with a cold, someone with a cold, someone with a
cold, someone with a cold, someone with malaria, someone with a cold, someone with a cold. So the
rare results are not going to come up very often. And so doing this sampling naively can make it really
hard to estimate the probability of a rare event. If it's something that happens one in ten thousand
times, well, you know for sure you're going to need, some number of tens of thousands of samples to
get even a reasonable estimate of that probability.
Imagine that you want to estimate the probability of some disease given -- oh, I don't know -- spots on
your tongue and a sore toe. Somebody walks in and they have a really peculiar set of symptoms, and
you want to know what's the probability that they have some disease.
Well, if the symptoms are root nodes, it's easy. If the symptoms were root nodes, you could just assign
the root nodes to have their observed values and then simulate the rest of the network as before.
But if the symptoms aren't root nodes then if you do naïve sampling, you would generate a giant table
of samples, and you'd have to go and look and say, gosh, how many cases do I have where somebody
has spots on their tongue and a sore toe; and the answer would be, well, maybe zero or not very many.
There’s a technique called importance sampling, which allows you to draw examples from a distribution
that’s going to be more helpful and then reweight them so that you can still get an unbiased estimate of
the desired conditional probability. It’s a bit beyond the scope of this class to get into the details, but it’s
an important and effective idea.
Recitation Problem
• Do the variable elimination algorithm on the net below using the elimination order A,B,C (that is,
eliminate node C first). In computing P(D=d), what factors do you get?
Find an elimination order that keeps the factors small for the net below, or show that there is
no such order.
Here’s a pretty complicated graph. But notice that no node has more than 2 parents, so none
of the CPTs are huge. The question is, is this graph hard for variable elimination? More
concretely, can you find an elimination order that results only in fairly small factors? Is there
an elimination order that generates a huge factor?
Bayesian networks (or related models) are often used in computer vision, but they almost
always require sampling. What happens when you try to do variable elimination on a model
like the grid below?