Syntax Analysis
Table of Contents
Syntax Errors........................................................................................................................................2
Context-Free Grammar....................................................................................................................3
Parse Tree.............................................................................................................................................4
Grammar Ambiguity...........................................................................................................................4
Types of Parsing.................................................................................................................................7
Top-Down Parsing.............................................................................................................................8
Eliminating Left Recursion..........................................................................................................8
Left Factoring................................................................................................................................10
Parse Table.....................................................................................................................................13
Predictive Parsing........................................................................................................................15
Bottom-Up Parsing...........................................................................................................................17
LR Parsers.......................................................................................................................................18
SLR Transition Diagram..............................................................................................................19
SLR Parse Table Generation.....................................................................................................21
The role of a syntax analyzer is to create a parse tree from the tokens passed to it
by the lexical analyzer.
Syntax Errors
Syntactical Errors are errors caused by invalid orders of tokens. For example,
arithmetic expressions with unbalanced brackets, two consecutive operands or
operators and the lack of a semicolon at the end of a statement are all errors that will
be caught by the syntax analyzer.
The goal of the syntax error handler is to report the presence of syntactical errors
clearly and accurately, recover from each error quickly and avoid slowing down the
entire compilation process.
There are several error recovery strategies that are attempted:
Panic Mode – Delete input symbols one by one until a symbol is found that
results in the error being resolved.
Phrase Level – The compiler designer can specify some local corrections to
attempt, such as replacing a symbol, inserting a symbol, deleting a symbol,
etc.
Error Productions – We can add grammar to the production rules that detect
potential errors.
Global Corrections – Ideally, we would like to make as few changes to the
input as possible since the more changes we make, the less likely we are to
obtain the intended code. There are algorithms that can detect which set of
changes will lead to an acceptable code with the least changes.
Context-Free Grammar
A context-free grammar consists of a set of terminals, a set of non-terminals, a
start symbol and a set of productions.
Terminals – These are the basic symbols from which the input strings are
formed.
Non-Terminals – Syntactic variables that denote sets of strings.
Start Symbol – A special non-terminal from which the generated strings will
start.
Productions – The grammar that specifies how the terminals and non-
terminals can be combined to form strings.
An example production is given below:
Here, +¿ , ¿, −¿, ¿, ¿, id and num are terminals, E is a non-terminal and E is the start
symbol.
Productions essentially tell us how something on the left-hand side can be replaced
using something on the right-hand side. This replacement process is called
derivation.
By convention, terminals use lower case letters, operators, punctuation, digits and
boldface strings while non-terminals use uppercase letters. The start symbol is
usually denoted as S. Usually, the first production’s left-hand side symbol is the start
symbol. Lowercase Greek letters may also be used to represent strings of grammar
symbols.
Parse Tree
The derivation process can be visually represented using a parse tree.
Grammar Ambiguity
There are cases where a grammar can create multiple parse trees for the same
input.
Grammars that can result in situations like this are said to be ambiguous. Suppose
we have the string ‘2+3*4’. The correct output for this is 14. However, the two parse
trees above will result in two different answers. The first one will process the input as
‘2+(3*4)’ while the second will process it as ‘(2+3)*4’.
There are no direct methods or algorithms that can eliminate the ambiguity. It is a
manual process that we need to check using different inputs. There are ways we can
go about doing this though.
Consider why we ran into the issue in the first place in the example above. The
precedence of the operators was not respected by the grammar. We should always
deal with multiplication first, but the grammar does not specify a rule for this so we
end up being confused about whether to prioritize the multiplication or the addition.
Another possible ambiguity would be for the string ‘2+3+4’. There is an operand (3)
that can be associated with either the operator on the left-hand side or the one on
the right-hand side. Depending on which we do, we end up with a different parse
tree. Even though the results are the same, the grammar is still ambiguous. In such
cases, we should choose the left-hand side. This will lead to correct results.
The way we can integrate these rules into the grammar so that it becomes
unambiguous is as follows:
1. The parse tree derivation should be left recursive, i.e., the left-most non-
terminal should be expanded more. Left-recursive grammars have the
recursive symbol (the same symbol) on the left-hand side.
2. Add non-terminals to the grammar in a way such that there is no choice but to
respect the priority of the operators. The operators with lower priority should
come first and depend on the outputs of the operators with higher priority. In
the example below, it is not possible to computer E without first computing T ,
thus preserving the priority.
Types of Parsing
There are mainly two methods of parsing a string, Top-Down Parsing (LL Parsing)
and Bottom-Up Parsing (LR Parsing). There are subcategories to each of these
types which we will be looking into later.
In LL Parsing, we scan the input from left to right and use the left-most derivation.
In LR Parsing, we scan the input from left to right and use the right-most derivation.
Each of the steps during the derivation process creates a string that is said to be in
sentential form. If the string is the result of right-most derivation, it is said to be in
right-sentential form. If the string is the result of left-most derivation, it is said to
be in left-sentential form. A sentence is a sentential form consisting only of
terminals.
Top-Down Parsing
There are two types of Top-Down Parsing, Recursive Descent Parsing and Non-
recursive Predictive Parsing.
In recursive descent parsing, we randomly pick a production rule to derive a non-
terminal. If we hit a dead end, we backtrack.
In non-recursive predictive parsing, we attempt to predict the correct production
rule to use at each stage. This is a complicated process that requires many steps.
The first step to non-recursive predictive parsing is to modify the grammar in
preparation of predictive parsing. This involves two steps, elimination of left
recursion and performing left factoring.
Eliminating Left Recursion
If we are not careful about choosing our production rules, we can fall into an infinite
recursion.
We need to eliminate this possibility without affecting the language represented by
the grammar.
Suppose we have the following grammar:
A → Aα ∨β
This grammar can trap us in the infinite recursion loop. In reality the grammar is
¿
generating sentences of the form β α . This means that the start of the grammar
must always be β . Thus, we can re-write the grammar as follows:
'
A→ β A
' '
A → ε∨α A
We can use this format to fix other grammars as well.
E → E +T ∨T
'
E→T E
' '
E → ε∨+T E
S → S 0 S 1 S∨01
'
S → 01 S
' '
S → ε ∨0 S 1 S S
Even if we have multiple productions that have left recursion in them, it is just an
expansion of the same process.
Left Factoring
Suppose we have a set of productions where the first symbol of the right-hand side
are the same, i.e., they have common prefixes.
This can cause issues since we will not know which production to use. Suppose we
have the string bE . We might first try A → bC , fail, backtrack, try A → bD , fail,
backtrack and finally get to A → bE. The repeated backtracking is happening
because of the common prefixes.
This issue occurs because we are taking the decision too early. We are deciding
which production to use based on just the first symbol when the decision is affected
by the second symbol as well. To resolve this, we can delay the point at which we
have to decide to a later stage so that the backtracking issue does not occur. This is
called left factoring.
First and Follow
Before we can dive into predictive parsing, we need to understand two functions,
first and follow. The first of a non-terminal symbol is the set of symbols that can
be the starting point for every string derived from that symbol. Suppose we have the
following grammar:
For this grammar, First ( C ) ={ d , f } and First ( A )= { b , d , f }.
For terminal symbols, the first is the symbol itself, i.e., First ( b ) =b.
One caveat to the First function is when ε appears. Suppose we have this grammar:
S → AB
A → b/ε
B→ c
In this case, if A=ε , the first symbol actually becomes the first symbol of B, so
First ( S )= {b , c }. However, First ( A )= { b , ϵ }.
The follow of a non-terminal is the set of terminals that appear immediately after
the symbol. For the grammar above, Follow (C )= { e }. However, Follow ( A )= { $ }, the
symbol used to indicate the end of the input. This is because nothing follows the
starting symbol and A does not appear anywhere else in the grammar. The follow of
any symbol can never contain ε .
In the example above, notice that Follow ( E' )=Follow ( E ). This is because the string
can be repeated multiple times. If the string isn’t followed by anything, then
Follow ( E )= { $ }, but this might not be the case.
Parse Table
The reason we need to know the first and follow of each symbol is to generate a
parse table, which will be used during predictive parsing. Suppose we have the
following grammar.
We can use these values to determine which productions to use when we find a
particular symbol. This can be determined using First ( X ) . The mapping is the parse
table.
There are several things we need to consider here. The table has the terminals as
columns and non-terminals as rows. We are also considered $ to be a terminal but
not ϵ .
For each terminal and non-terminal pair, we write down the production rule that has
the terminal as the first for the non-terminal. For example, First ( E' )=+ ¿, to the cell for
E and +¿ has the production E →+ T E .
' ' '
The only time we look at the Follow column is if the First column contains an ϵ . If it
does, then the terminals in the Follow column should have the production for the
corresponding non-terminal. For example, First ( E' )=ϵ and Follow ( E' )= { $ ,¿ }, so the
cells for $ and E' and ¿ and E' both contain the production E' → ϵ . This makes sense
since using E' → ϵ during parsing may allow us to get the $ and ¿.
Predictive Parsing
Finally, we can look into how the actual parsing process works.
Consider that we have the following grammar:
S → ( S )∨ε
For this grammar, the parse table is:
We now want to parse the string (())$ .
To do this, we use a stack. Initially, the value in the stack is S.
$ S
For each symbol we encounter in the input, we remove a terminal from the stack and
replace it with its derivation. The derived symbols are added in the reverse order.
Input: ¿
$ ¿ S ¿
If the top of the stack matches the input, we remove it and move to the next input
symbol.
$ ¿ S
Input: ¿
$ ¿ ¿ S ¿
$ ¿ ¿ S
Input: ¿
An ε production will just remove the non-terminal.
$ ¿ ¿
$ ¿
Input: ¿
There are however, various scenarios under which we cannot use this parsing
method. If a grammar causes multiple entries in the parsing table, we cannot use it.
Bottom-Up Parsing
There are four types of Bottom-Up Parsing: Operator Precedence Parsing, Simple
LR (SLR), Canonical LR (CLR) and Look Ahead LR (LALR).
During parsing, the prefix of the right-sentential form may appear in the stack. This
prefix is called the viable prefix.
A handle is a substring of the right-sentential form which matches a production rule.
Handles always appear at the top of the stack. When we replace the handle with the
non-terminal on the left-hand side of the production, the process is called handle
pruning.
LR Parsers
All of the LR parsers we will be studying work in the same way as the predictive
parser we saw earlier. The only change will be in how the parsing table is generated.
LR Parsers make shift-reduce decisions. They maintain states, which are generated
from the grammar, which are used to track where we are in the parsing process.
Shift refers to shifting to a new state, while reduce refers to applying a reduction
operation using a handle, i.e., handle pruning.
To identify if a reduction operation should be done, we check if the next input
matches FOLLOW ( A ).
Consider that we have the production rule A → XYZ . It is possible to place a dot (.) at
different positions of this rule as shown below.
A → . XYZ
A → X .YZ
A → XY . Z
A → XYZ .
Thus, this single production gives us four different states.
The dot signifies how much of the right-hand side production we have ‘seen’. The
dot allows us to identify when we have seen the entire right-hand side of the
production. A reduction should only be performed if the entire right-hand side has
been seen.
SLR Transition Diagram
The first step to understanding how the states change is to augment the production
rules. The rules below are a problem, since there are two start states. We need to
modify this so that there is only one start state.
These rules are then used to generate a transition diagram.
Each of the states in the diagram is called an LR(0) item. Each state contains an
initial production, called the kernel item. In addition, we also add the productions for
the non-terminals which are on the right of the dot. For example, the state I 0 starts
with E' →. E , so we add the production of E , which in turn allows us to add the
production of T and so on. The productions we add are called the closure items of
the original production, given by CLOSURE ( X ).
Each of the points at which there is a dot, we consider the symbol on the right of the
dot and create a new state, one where the dot is after the symbol. For example, we
see that there are two productions with . E. These two productions are used to
create the state I 1. This same process is repeated for all of the production rules.
Notice that if the symbol on the right of the dot in the new state is a non-terminal,
we need to add the closure of the production. This is not the case for terminals.
SLR Parse Table Generation
ACTION GOTO
State
Id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
The SLR Parse Table is generated using the transition diagram. For the action
column, the cell value is the new state to which we should go given the current state
and the terminal which is found on the right of the dot. For the GOTO column, we do
the same for the non-terminals.
For the ACTION column, si represents a new state and acc represents the accept
state. We also have ri values. Here, ri refers to the ith production rule. If a state has a
production where the dot is at the end of the production already, we mark all of the
terminals in FOLLOW ( X ) using that production. For example, in state 2, we have the
closure item E → T ., which comes from the second production, E → T . Thus, we place
r 2 in all of the non-terminals which are in FOLLOW ( E ).
During this processing, if we encounter multiple values in a single cell, the grammar is
ambiguous and it is not possible to parse strings for this grammar. During
processing a string, if we somehow end up in an empty cell, the string is not correct
for this grammar.