COMPILER DESIGN
SUBJECT CODE: 303105349
Ami sachin Shah, Assistant Professor
Computer Science & Engineering
CHAPTER-2
Introduction to syntax analysis
Syntax Analyzer
•Syntax Analyzer creates the syntactic structure of the given source program.
•This syntactic structure is mostly a parse tree.
•Syntax Analyzer is also known as parser.
•The syntax of a programming is described by a context-free grammar (CFG). We will use BNF
(Backus- Naur Form) notation in the description of CFGs.
•The syntax analyzer (parser) checks whether a given source program satisfies the rules implied by
a context-free grammar or not.
– If it satisfies, the parser creates the parse tree of that program.
– Otherwise the parser gives the error messages.
•A context-free grammar
– gives a precise syntactic specification of a programming language.
– the design of the grammar is an initial phase of the design of a compiler.
– a grammar can be directly converted into a parser by some tools.
Parser
• Parser works on a stream of tokens.
• The smallest item is a token.
Source token
Lexical Parser
Program parse tree
Analyzer get next token
Parsers (cont.)
• We categorize the parsers into two groups:
1. Top-Down Parser
– the parse tree is created top to bottom, starting from the root.
2. Bottom-Up Parser
– the parse is created bottom to top; starting from the leaves
• Both top-down and bottom-up parsers scan the input from left to right (one symbol at a time).
• Efficient top-down and bottom-up parsers can be implemented only for sub-classes of context-free
grammars.
– LL for top-down parsing
– LR for bottom-up parsing
Context-Free Grammars
•Inherently recursive structures of a programming language are defined by a context-free grammar.
•In a context-free grammar, we have:
– A finite set of terminals (in our case, this will be the set of tokens)
– A finite set of non-terminals (syntactic-variables)
– A finite set of productions rules in the following form
• A→α where A is a non-terminal and
α is a string of terminals and non-terminals (including the empty string)
– A start symbol (one of the non-terminal symbol)
•Example:
E→ E+E | E–E | E*E | E/E | -E
E→ (E)
E → id
Derivations
E ⇒ E+E
•E+E derives from E
– we can replace E by E+E
E ⇒ E+E ⇒ id+E ⇒ id+id
– to able to do this, we have to have a production rule E→E+E in our grammar.
•A sequence of replacements of non-terminal symbols is called a derivation of id+id from E.
αAβ ⇒
•In general a derivation step is
if there is a production rule A→γ in our grammar
αγβ where α and β are arbitrary strings of terminal and non-terminal symbols
α1 ⇒ α2 ⇒ ... ⇒ (αn derives from α1 or α1 derives αn )
⇒
αn
⇒
: derives in one step
⇒
: derives in zero or more steps
: derives in one or more steps
CFG - Terminology
•L(G) is the language of G (the language generated by G) which is a set of
sentences.
•A sentence of L(G) is a string of terminal symbols of G.
ω is a sentence of L(G) iff S ⇒ ω
•If S is the start symbol of G then
where ω is a string of terminals of G.
•If G is a context-free grammar, L(G) is a context-free language.
•Two grammars are equivalent if they produce the same language.
•S ⇒ - If α contains non-terminals, it is called as a sentential form of G.
α - If α does not contain non-terminals, it is called as a sentence of G.
Derivation Example
E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(id+E) ⇒ -(id+id)
E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(E+id) ⇒ -(id+id)
OR
•At each derivation step, we can choose any of the non-terminal in the sentential form of G for the replacement.
•If we always choose the left-most non-terminal in each derivation step, this derivation is called as left-
most derivation.
•If we always choose the right-most non-terminal in each derivation step, this derivation is called as right-
most derivation.
Left-Most and Right-Most Derivations
Left-Most Derivation
E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(id+E) ⇒ -(id+id)
Right-Most Derivation
E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(E+id) ⇒ -(id+id)
•We will see that the top-down parsers try to find the left-most derivation of the given source program.
•We will see that the bottom-up parsers try to find the right-most derivation of the given source
program in the reverse order.
Parse Tree
• Inner nodes of a parse tree are non-terminal symbols.
• The leaves of a parse tree are terminal symbols.
• A parse tree can be seen as a graphical representation of a derivation.
E⇒ - E
⇒- E
⇒- E
E - E (E) - E (E+E) - E
( E ) ( E )
E E E + E
⇒- ⇒-
- E - E
( E ) ( E )
(id+E) (id+id)
E + E E + E
Ambiguity
• A grammar produces more than one parse tree for a sentence
is called as an ambiguous grammar.
E
E ⇒ E+E ⇒ id+E ⇒ E + E
⇒ id+id*E ⇒ id+id*id
id+E*E id *
E E
id id
E ⇒ E*E ⇒ E+E*E ⇒
E * E
⇒ id+id*E ⇒ id+id*id
id+E*E E +
id
E
Ambiguity (cont.)
•For the most parsers, the grammar must be unambiguous.
•unambiguous grammar
🡺 unique selection of the parse tree for a sentence
•We should eliminate the ambiguity in the grammar during the design phase of the compiler.
•An unambiguous grammar should be written to eliminate the ambiguity.
•We have to prefer one of the parse trees of a sentence (generated by an ambiguous grammar)
to disambiguate that grammar to restrict to this choice.
Ambiguity (cont.)
stmt → if expr then stmt |
if expr then stmt else | otherstmts
stmt
if E1 then if E2 then S1 else S2
stmt stmt
if expr then stmt else stmt if expr then stmt
E1 if then stmt S2 E1 if expr stmt else
expr then stmt
S1
E2 S1 S2
E2
Ambiguity (cont.)
• We prefer the second parse tree (else matches with closest if).
• So, we have to disambiguate our grammar to reflect this choice.
• The unambiguous grammar will be:
stmt → matchedstmt | unmatchedstmt
matchedstmt → if expr then matchedstmt else | otherstmts
matchedstmt
unmatchedstmt → if expr then stmt |
if expr then matchedstmt else
unmatchedstmt
Ambiguity – Operator Precedence
•Ambiguous grammars (because of ambiguous operators) can be disambiguated according to the
precedence and associativity rules.
E → E+E | E*E | E^E | id | (E)
disambiguate the grammar
precedence: ^ (right to left)
* (left to right)
+ (left to right)
E → E+T | T
T → T*F | F
F → G^F | G
G → id | (E)
Left Recursion
•A grammar is left recursive if it has a non-terminal A such that there is a derivation.
A ⇒ Aα for some string α
•Top-down parsing techniques cannot handle left-recursive grammars.
•So, we have to convert our left-recursive grammar into an equivalent grammar which is not left-
recursive.
•The left-recursion may appear in a single step of the derivation (immediate left-recursion), or
may appear in more than one step of the derivation.
Immediate Left-Recursion
𝖴
A→Aα| β where β does not start with A
eliminate immediate left recursion
A → β A’
A’ → α A’ | ε an equivalent grammar
In general,
A → A α1 | ... | A αm | β1 | ... | βn where β1 ... βn do not
start with A
� eliminate immediate left recursion
A → β A | ... | β A
�
’ ’
1 n
A → α A | ... | α A | ε
’
1
’
m an equivalent grammar
’
Immediate Left-Recursion -- Example
E → E+T | T
T → T*F | F
F → id | (E)
𝖴 eliminate immediate left recursion
E → T E’
E’ → +T E’ | ε
T → F T’
T’ → *F T’ |
ε F → id |
(E)
Left-Recursion -- Problem
• A grammar cannot be immediately left-recursive, but it still can be
left-recursive.
• By just eliminating the immediate left-recursion, we may not
get a grammar which is not left-recursive.
S → Aa | b
A → Sc | d This grammar is not immediately
left-recursive,
but it is still left-recursive.
S ⇒ Aa ⇒
Sca A ⇒ Sc
or
⇒ Aac
causes to a left-recursion
So, we have to eliminate all left-recursions from our grammar
Eliminate Left-Recursion -- Algorithm
- Arrange non-terminals in some order: A1 ... An
- for i from 1 to n do {
- for j from 1 to i-1 do
{ replace each
production
Ai → Aj γ
by
Ai → α1 γ | ... | αk
γ
where Aj → α1 | ...
| αk
}
Eliminate Left-Recursion -- Example
S → Aa | b
A → Ac | Sd | f
-Order of non-terminals: S,
A for S:
- we do not enter
the inner loop.
- there is no
immediate left
recursion in S.
for A:
- Replace A → Sd with A
→ Aad | bd So, we will have A → Ac
| Aad | bd | f
- Eliminate the immediate left-recursion
in A
A → bdA’ | fA’
A’ → cA’ | adA’ | ε
Eliminate Left-Recursion – Example2
S → Aa | b
A → Ac | Sd | f
- Order of non-terminals: A, S
for A:
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A → SdA’ | fA’
A’ → cA’ | ε
for S:
- Replace S → Aa with S → SdA’a | fA’a
So, we will have S → SdA’a | fA’a | b
- Eliminate the immediate left-recursion in S
S → fA’aS’ | bS’
S’ → dA’aS’ | ε
So, the resulting equivalent grammar which is not left-recursive is:
S → fA’aS’ | bS’
S’ → dA’aS’ | ε
A → SdA’ | fA’
A’ → cA’ | ε
Left-Factoring
•A predictive parser (a top-down parser without backtracking) insists that the grammar must be left-
factored.
grammar 🡺 a new equivalent grammar suitable for predictive parsing
stmt → if expr then stmt else stmt |
if expr then stmt
•when we see if, we cannot now which production rule to choose to re-write stmt in the derivation.
Left-Factoring (cont.)
•In general,
A → αβ1 | where α is non-empty and the first symbols
of β1 and β2 (if they have one)are different.
αβ2
•when processing α we cannot know whether expand
A to αβ1 or
A to αβ2
•But, if we re-write the grammar as follows
A → αA’
A’ → β1 | β2 so, we can
immediately expand A to αA’
Left-Factoring -- Algorithm
•For each non-terminal A with two or more alternatives (production rules) with a common non-empty
prefix, let say
A → αβ1 | ... | αβn | γ1 | ... | γm
convert it into
A → αA’ | γ1 | ... | γm
A’ → β1 | ... | βn
Left-Factoring – Example1
𝖴
A → abB | aB | cdg | cdeB | cdfB
A → aA’ | cdg | cdeB | cdfB
𝖴
A’ → bB | B
A → aA’ | cdA’’
A’ → bB | B
A’’ → g | eB |
fB
Left-Factoring – Example2
A → ad | a | ab | abc |
𝖴
b
A → aA’ | b
𝖴
A’ → d | ε | b | bc
A → aA’ | b
A’ → d | ε | bA’’
A’’ → ε | c
Non-Context Free Language Constructs
•There are some language constructions in the programming languages which are not context-free. This
means that, we cannot write a context-free grammar for these constructions.
•L1 = { ωcω | ω is in (a|b)*} is not context-free
🡺 declaring an identifier and checking whether it is declared or not later. We cannot
do this with a context-free language. We need semantic analyzer (which is not context-free).
•L2 = {anbmcndm | n≥1 and m≥1 } is not context-free
🡺 declaring two functions (one with n parameters, the other one with
m parameters), and then calling them with actual parameters.
Top-Down Parsing
•The parse tree is created top to bottom.
•Top-down parser
– Recursive-Descent Parsing
• Backtracking is needed (If a choice of a production rule does not work, we backtrack to try other
alternatives.)
• It is a general parsing technique, but not widely used.
• Not efficient
– Predictive Parsing
• no backtracking
• efficient
• needs a special form of grammars (LL(1) grammars).
• Recursive Predictive Parsing is a special form of Recursive Descent parsing without backtracking.
• Non-Recursive (Table Driven) Predictive Parser is also known as LL(1) parser.
Recursive-Descent Parsing (uses Backtracking)
•Backtracking is needed.
•It tries to find the left-most derivation.
S → aBc
B → bc | b
S S
input: abc
a B c a B c
b c fails, backtrack b
Predictive Parser
a grammar 🡺 🡺 a grammar suitable for predictive
eliminate left parsing (a LL(1) grammar)
left recursion factor no %100 guarantee.
•When re-writing a non-terminal in a derivation step, a predictive parser can uniquely choose a
production rule by just looking the current symbol in the input string.
A → α1 | ... | αn input: ... a .......
current token
Predictive Parser (example)
stmt → if ...... |
while .... |
.. |
begin ....
..
forto ...
•When we are trying write the non-terminal stmt, if the current token is if we have to choose first
production rule. ..
•When we are trying to write the non-terminal stmt, we can uniquely choose the production rule by
just looking the current token.
•We eliminate the left recursion in the grammar, and left factor it. But it may not be suitable for
predictive parsing (not LL(1) grammar).
Recursive Predictive Parsing
•Each non-terminal corresponds to a procedure.
Ex: A → aBb(This is only the production rule for A)
proc A {
- match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
}
Recursive Predictive Parsing (cont.)
A → aBb | bAB
proc A {
case of the current token {
‘a’: - match the current token with a, and move
to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
‘b’: - match the current token with b, and move to the next token;
- call ‘A’;
- call ‘B’;
}
}
Recursive Predictive Parsing (cont.)
•When to apply ε-productions.
A → aA | bB | ε
•If all other productions fail, we should apply an ε-production. For example, if the current token is not a or
b, we may apply the ε-production.
•Most correct choice: We should apply an ε-production for a non-terminal A when the current token is
in the follow set of A (which terminals can follow A in the sentential forms).
Recursive Predictive Parsing (Example)
•Content inside
A → aBe | cBd | C
B → bB | ε
C→f
proc C { match the current token with f,
proc A { and move to the next token; }
case of the current token {
a: - match the current token with a,
and move to the next token; proc B {
- call B; case of the current token {
- match the current token with e, b: - match the
current token with b,
and move to the next token; and move to the next token;
c: - match the current token with c, - call B
and move to the next token; e,d: do nothing
- call B; }
- and
match the to
move current token
the next with d,
token; }
f: - call C follow set of B
}
}
first set of C
Non-Recursive Predictive Parsing -- LL(1) Parser
•Non-Recursive predictive parsing is a table-driven parser.
•It is a top-down parser.
•It is also known as LL(1) Parser.
input buffer
stack Non-recursive output
Predictive Parser
Parsing Table
[Link].
ac.i
n