0% found this document useful (0 votes)
5 views30 pages

Compiler Design Notes

The document provides comprehensive revision notes on compiler design, covering key topics such as compiler phases, lexical analysis, and parsing techniques. It includes historical milestones in compiler development, various types of compilers, and detailed explanations of the compilation process, symbol tables, and intermediate representations like Three-Address Code. Additionally, it discusses the role of lexical analyzers, the use of regular expressions, and the design of lexical analyzer generators.

Uploaded by

vectorsigma389
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views30 pages

Compiler Design Notes

The document provides comprehensive revision notes on compiler design, covering key topics such as compiler phases, lexical analysis, and parsing techniques. It includes historical milestones in compiler development, various types of compilers, and detailed explanations of the compilation process, symbol tables, and intermediate representations like Three-Address Code. Additionally, it discusses the role of lexical analyzers, the use of regular expressions, and the design of lexical analyzer generators.

Uploaded by

vectorsigma389
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

COMPILER DESIGN

Complete Revision Notes


UE23CS351B | PES University
Prof. Preet Kanwal
Units 1 & 2 | Full Coverage for Exams

Coverage: Compilers, Language Processing, Compiler Phases, Symbol Table, Lexical


Analysis, Lex Tool, RE-NFA-DFA, C Grammar, Parsers, LL(1), LR(0), SLR, CLR, LALR,
Syntax Directed Definitions
UNIT 1: Compilers, Lexical Analysis & Top-Down Parsers
1.1 Introduction to Compilers
Compiler: A program that reads source code in one language and translates it into an equivalent
program in another (usually lower-level) language.
Interpreter: Directly executes source program operations. Slower to execute but provides better error
diagnostics.

History of Compilers
Year Event
1951 First practical compiler by Corrado Bohm (PhD thesis)
1952 First implemented compiler: Grace Hopper's A-0 System. Hopper coined "compiler".
1954-57 FORTRAN (FORmula TRANslation) - first widely used high-level language & complete
compiler (IBM, John W. Backus)
1969-73 C language created by Ken Thompson & Dennis Ritchie alongside Unix. B compiler
bootstrapped.
1987 GCC (GNU Compiler Collection) released by Richard Stallman - initially only C

Variants of Compilers
Variant Description Example
Self-hosting Compiler written in the same language it compiles. GCC (C), Java, Python
Can compile itself.
Bootstrapping Technique for producing a self-hosting compiler. C written in B; C++ in C
Start with core compiler in another language. (Cfront)
Native Compiler Generates code for the same platform it runs on. Turbo C
Cross Compiler Generates code for a different platform than Linux/x86 building for
where it runs. Used for embedded systems. Arduino/ARM
Transpiler Source-to-source: translates between high-level Emscripten (C++ to
languages at same abstraction level. JavaScript)
Decompiler Low-level to high-level. Inherently difficult (loses Ghidra (written in C++)
info during compilation).
JIT Compiler Compiles code just-in-time during runtime. Java Virtual Machine (JVM)
Combination of AOT compiler and interpreter.
Compiler-compiler Generates parser/compiler from formal grammar Lex, YACC
description. Parser generators use BNF/EBNF.
1.2 Language Processing System
Pipeline: Source to Executable
Stage Tool Input Output GCC Command
Preprocessing Preprocessor Source .c file .i file (macros gcc -E hello.c > hello.i
(cpp) expanded)
Compilation Compiler (gcc) .i preprocessed .s assembly file gcc -S hello.i
file
Assembly Assembler (as) .s assembly .o relocatable as hello.s -o hello.o
object file
Linking Linker (ld) .o object files + executable gcc hello.o
libraries ([Link])
Loading OS Loader executable process in ./[Link]
memory

Static vs Shared Libraries


Property Static Library (.a / .lib) Shared Library (.so / .dll)
Linking Machine code copied into executable Only a small table added; code loaded
at runtime
Executable size Larger Smaller
Version Compiled-in version always runs Latest version auto-used (no
recompile needed)
Memory Each process has own copy One copy in memory shared by all
processes
Problem Larger binaries DLL Hell (early Windows)
1.3 Phases of a Compiler
The compiler works in 7 phases. All phases have access to the Symbol Table and Error Handler.
Phase Name Input Output
1 Lexical Analysis Source characters Token stream
(Scanner)
2 Syntax Analysis (Parser) Token stream Parse Tree / AST
3 Semantic Analysis Parse Tree Annotated parse tree (type-checked)
4 Intermediate Code Annotated parse tree Three-Address Code (TAC)
Generation
5 Machine-Independent TAC Optimized IR (DAG, CSE, etc.)
Optimization
6 Code Generation Optimized IR Target assembly code
7 Machine-Dependent Assembly Optimized target code
Optimization

Front End vs Back End


Front End (Analysis) Back End (Synthesis)
Phases 1-3 + IR Generation Optimization + Code Gen
Produces IR + Symbol Table Target program from IR
Complexity O(n) or O(n log n) NP-complete in general
Automation Mostly automated (lex/yacc) Much harder to automate
Focus Recognize legal programs, report errors Instruction selection, scheduling,
register allocation

IR Optimizations (Compiler Middle End)


• Constant Folding: Evaluate constant expressions at compile time. E.g., 2+3 -> 5
• Constant Propagation: Replace variables with known constant values.
• Common Subexpression Elimination (CSE): Replace identical sub-expressions with a single
variable.
• Dead Code Elimination: Remove unreachable or unused code.

Parse Tree vs AST: Parse Tree (Concrete Syntax Tree) retains all lexemes including non-
terminals and punctuation - large and complicated. AST (Abstract Syntax Tree) retains only
information needed for code generation - no punctuation, no non-terminals. All nodes are
tokens + dummy structural nodes.
1.4 Symbol Table
Definition: A data structure created and maintained by the compiler to store information about every
identifier (variable, function, type) in the source program.

Operations
Operation Description
insert(name, type, ...) Add a new identifier with its attributes
lookup(name) Retrieve attributes of an identifier
delete(name) Remove an identifier (at end of scope)
update(name, ...) Modify attributes of an existing identifier

Implementation Options
Structure Search Time Notes
Linear List O(n) Simple but slow for large programs
Hash Table O(1) average Most commonly used; fast lookup
Tree (BST) O(log n) Ordered; useful for structured scoping
Linked lists of hash tables O(1) + scoping Used for nested scopes
Each entry stores: name, type, size, scope level, offset in memory, and any other relevant attributes.
1.5 Three-Address Code (TAC / 3AC)
An intermediate representation where each instruction has at most one operator and at most three
addresses (operands).
Form: x = y op z (at most 1 operator per instruction)
Complex expressions are broken using temporary variables t0, t1, ...

TAC Example: while (number > 0) { factorial = factorial * number; --number; }


L0: if number > 0 goto L1
goto L2
L1: t1 = factorial * number
factorial = t1
t2 = number - 1
number = t2
goto L0
L2: ...

Static Single Assignment (SSA): Each variable is assigned exactly once. Simplifies data flow
analysis. Used in modern compilers like GCC and LLVM.
1.6 Lexical Analysis
Key Definitions
Term Definition Example
Token Abstract symbol representing a type of lexical unit. <ID, 3> or <keyword, while>
Structure: <token-name, attribute-value>
Lexeme Actual sequence of characters in source that "number", "factorial", "while"
matches a pattern
Pattern Rule (regex) defining valid lexemes for a token [a-zA-Z_][a-zA-Z0-9_]* for
identifiers

Common Token Classes


Token Pattern Example Lexemes
keyword Exact string match if, while, for, int, return
id [a-zA-Z_]+[a-zA-Z0-9_]* sum, avg1, factorial
number digit+(.\ digit+)?([eE][-+]?digit+)? 10001, 3.14, 1.5e3
Comparison <|>|<=|>=|==|!= <=, !=
Assign = =
ArithOp +|-|*|/ +, *
Inc_op / ++|-- ++, --
Dec_op
Punctuation (|)|{|}|[|]|;|, ;, (, )

Role of Lexer
• Primary: Convert source into token stream by grouping characters into lexemes
• Enter lexemes representing identifiers into the symbol table
• Strip out comments and whitespace
• Track line numbers; correlate errors with source line numbers
• Expand macros
Why separate from parser? (1) Simplicity - comments removed before parser; (2) Efficiency -
specialized buffering; (3) Portability - input specifics isolated to lexer.

Buffer Pairs & Sentinels


Used for efficient character reading with lookahead:
• Buffer Pairs: Two buffers of N bytes (N = disk block size = 4096). Alternately reloaded. Two
pointers: lexeme_begin (start of current lexeme) and forward (scans ahead). When forward
reaches buffer end, reload other buffer.
• Sentinels: EOF added at end of each buffer. Avoids double-checking buffer boundaries - only
one check per character (is it EOF?). If EOF in middle of buffer = end of input. If EOF at buffer
end = reload buffer.

Lexical Errors
Error Type Example
Spelling error di{...} instead of if{...}
Unmatched string printf("hi); missing closing quote
Illegal characters printf("hi")$$; -- $$ not valid
Identifier too long External/global identifiers in C cannot exceed 31 characters
Error messages must: pinpoint the error, be specific, be understandable, and not be redundant.
1.7 Lex Tool
Lex: Reads a specification file with regular expressions and generates a C routine ([Link].c) that
performs lexical analysis. YACC handles grammar-based parsing.

Structure of a .l File
%{
/* C code copied verbatim (includes, declarations) */
%}

/* Definitions section: substitution strings */


letter [a-zA-Z_]
digit [0-9]

%%

/* Rules section: pattern { action } */


{letter}({letter}|{digit})* { printf("ID: %s\n", yytext); }
[0-9]+ { yylval = atoi(yytext); return INTEGER; }
\n { yylineno++; }
. { /* ignore */ }

%%

/* Auxiliary functions */
int yywrap() { return 1; }
int main() { yylex(); return 0; }

Predefined Lex Variables


Variable/Function Type Description
yylex() int Main entry point. Call to invoke lexer. Returns token type.
yytext char* Pointer to matched string (lexeme)
yyleng int Length of matched string
yylval union Value associated with token (passed to parser)
yyin FILE* Input file (default: stdin)
yyout FILE* Output file (default: stdout)
yylineno int Current line number (use %option yylineno)
yywrap() int Called when input exhausted. Return 1 if done, 0 for more.
ECHO macro Writes matched string to yyout. Default action for unmatched.
BEGIN condition macro Switch start condition
INITIAL condition Default start condition

Compilation and Running


lex echo.l # generate [Link].c
gcc [Link].c # compile to [Link]
./[Link] # run
./[Link] < [Link] # redirect input

Maximal Munch Rule (Most Important!)


Maximal Munch Rule (Longest Match Rule):
1. The LONGEST match wins (among all patterns that match a prefix of input).
2. Among patterns matching the SAME length, the EARLIEST rule in the file wins.
Example: Patterns: abb->1, aba->2, a->3. String "ababa" -> "aba" matches rule 2 (length 3 > length 1
for "a"), then "b" unmatched, then "a" matches rule 3. Output: 2b3
Failure case: Patterns a*->1, ab->2, bb->3. String "aabbb": maximal munch gives "aa"->1, "bb"->3, last
"b" unmatched. Without maximal munch: a->1, ab->2, bb->3.

Regular Expressions in Lex


Pattern Matches
. Any character except newline
\n Newline
\t Tab
^ Beginning of line
$ End of line
? Zero or one copy of preceding expression
* Zero or more copies
+ One or more copies
a|b a or b (alternation)
(ab)+ One or more copies of ab
[abc] One of: a, b, c
[a-z] Any letter a through z
[^ab] Anything except a or b
[ \t\n]+ Whitespace

Start Conditions
• Mechanism for conditionally activating patterns in Lex.
• Inclusive (%s): Being in condition scn ADDS those rules to the active set.
• Exclusive (%x): Being in condition scn means ONLY those rules (prefixed with <scn>) are
active.
• Use case: multiline comments, quoted strings, indentation handling (Python).
%x COMMENT
%%
"/*" BEGIN(COMMENT);
<COMMENT>"*/" BEGIN(INITIAL);
<COMMENT>. ; /* discard comment content */
<COMMENT>\n yylineno++;
1.8 Design of Lexical Analyzer Generator: RE -> NFA -> DFA
Pipeline
Lex internally converts each pattern (RE) to NFA, combines all NFAs, then converts to DFA for efficient
simulation.
Step Process Algorithm
1 Each RE pattern -> NFA Thompson's Construction
2 Combine NFAs with new start state + epsilon Union construction
transitions to each NFA's start
3 Combined NFA -> DFA Subset Construction
4 DFA used to simulate lexer Maximal munch on input

Thompson's Construction (RE to NFA)


RE Pattern NFA Construction
Single char a (start) --a--> (accept)
epsilon (start) --eps--> (accept)
r1 | r2 (union) New start with eps to both NFA starts; both NFA accepts to new accept
r1 r2 (concat) NFA1 accept merges with NFA2 start via eps
r* (Kleene star) New start & accept; eps to inner NFA; inner accept loops back; eps to new
accept

Subset Construction (NFA to DFA)


• State in DFA = SET of NFA states (epsilon-closure of reachable states)
• Start state of DFA = epsilon-closure({NFA start state})
• DFA state is accepting if any NFA state in the set is accepting
• For conflicts: if multiple accepting NFA states, use earliest pattern rule

Lookahead Operator in Lex (/)


r1/r2: Match r1 only if followed by r2. The r2 part is NOT consumed. Example: fortran/ means match
"fortran" only if not followed by alphanumeric.
1.9 C Language Grammar
P -> S
S -> Declr; S | Assign; S | if (Cond) {S} S | while (Cond) {S} S |
if (Cond) {S} else {S} S | for (Assign; Cond; UnaryExpr) {S} S |
return E; S | lambda
Declr -> Type ListVar
Type -> int | float
ListVar -> X | ListVar, X
X -> id | Assign
Assign -> id = E
Cond -> E RelOp E
RelOp -> < | > | <= | >= | == | !=
UnaryExpr -> E++ | ++E | E-- | --E

E -> E + T | E - T | T
T -> T * F | T / F | F
F -> G ^ F | G
G -> (E) | id | num

Token Counting Example


Code: while (number > 0) { factorial = factorial * number; --number; }
Tokens (17): <while>, <(>, <ID:number>, <>>, <0>, <)>, <{>, <ID:factorial>, <=>, <ID:factorial>, <*>,
<ID:number>, <;>, <-->, <ID:number>, <;>, <}>
1.10 Parsing Overview
Error Recovery Strategies
Strategy Mechanism Notes
Panic Mode Discard input until synchronizing token Simple; may skip large portions
Recovery (;, ), }) found, then restart
Phrase Level Local corrections: insert missing ;), Must be careful not to cause further
Recovery replace erroneous token errors
Error Productions Add common errors as explicit Complicates grammar
grammar productions with informative
actions
Global Correction Find minimum edit distance from Extremely hard to implement; may not
erroneous to correct program match programmer intent

Top-Down vs Bottom-Up Parsers


Property Top-Down (LL) Bottom-Up (LR)
Direction Start symbol to terminals Terminals to start symbol
Derivation Leftmost derivation Reverse rightmost derivation
Grammar Must be left-factored and left- No such restrictions needed
requirements recursion-free
Input reading Left to right Left to right
Types RDP, LL(1)/Predictive Parser LR(0), SLR(1), LALR(1), CLR(1)
Power Less powerful More powerful (LL(1) subset LR(1))
1.11 Grammar Transformations for Top-Down Parsing
1. Left Factoring
Factor out common prefixes to avoid ambiguity in first-choice prediction.
Before: S -> iCtSex | iCtS | a
After: S -> iCtSX | a X -> eS | lambda

2. Elimination of Left Recursion


Algorithm (Immediate Left Recursion):
Given: A -> A alpha1 | A alpha2 | ... | beta1 | beta2 | ...
Replace with:
A -> beta1 A' | beta2 A' | ...
A' -> alpha1 A' | alpha2 A' | ... | lambda
Classic Example:
E -> E + T | T => E -> T E'
T -> T * F | F E' -> + T E' | lambda
F -> id | num | (E) T -> F T'
T' -> * F T' | lambda
F -> id | num | (E)

For Indirect Left Recursion: Order the non-terminals. Replace each Ai in Aj productions (j < i) with
Ai's body to convert indirect to immediate, then eliminate.
1.12 Recursive Descent Parsing (RDP)
• A top-down parser that implements one recursive procedure per non-terminal.
• Parses without backtracking if grammar is suitable.
• Power: Accepts LL(*) grammars - MORE powerful than LL(1) table-driven parsers (which
accept only LL(1)).
• Grammar must be left-factored and left-recursion-free.
• Implementation language must support recursion efficiently.

RDP vs Predictive RDP (without backtracking) Predictive/LL(1) Parser


Parser
Mechanism Mutually recursive procedures, one Explicit stack + parsing table
per non-terminal
Grammar class LL(*) - more powerful LL(1) only
Lookahead Flexible (can look ahead multiple Exactly 1 token
tokens)
Implementation Manual coding required Generated from grammar
Error recovery More flexible Panic mode with SYNC tokens
1.13 FIRST and FOLLOW Sets
FIRST Set Rules
FIRST(alpha) = set of terminals that can appear as first symbol of strings derived from
alpha
Rule 1: FIRST(t) = {t} where t is any terminal
Rule 2: If X -> lambda, add lambda to FIRST(X)
Rule 3: If X -> Y1 Y2 ... Yn:
Add FIRST(Y1) to FIRST(X)
If lambda in FIRST(Y1), add FIRST(Y2), etc.
If lambda in FIRST(Yi) for ALL i, add lambda to FIRST(X)
Key: FIRST sets never contain $

FOLLOW Set Rules


FOLLOW(A) = set of terminals that can appear immediately to right of A in some
sentential form
Rule 1: FOLLOW(S) = {$} where S is start symbol
Rule 2: If A -> alpha B beta:
Add (FIRST(beta) - lambda) to FOLLOW(B)
If lambda in FIRST(beta), also add FOLLOW(A) to FOLLOW(B)
Rule 3: If A -> alpha B (B at end of production):
Add FOLLOW(A) to FOLLOW(B)
Key: FOLLOW sets never contain lambda; always contain $ for start symbol

Worked Example: Arithmetic Grammar


E -> T E' E' -> + T E' | lambda T -> F T' T' -> * F T' | lambda F ->
id | num | (E)

FIRST(F) = {id, num, (}


FIRST(T) = FIRST(F) = {id, num, (}
FIRST(E) = FIRST(T) = {id, num, (}
FIRST(E') = {+, lambda}
FIRST(T') = {*, lambda}

FOLLOW(E) = {$, )}
FOLLOW(E') = FOLLOW(E) = {$, )}
FOLLOW(T) = (FIRST(E') - lambda) + FOLLOW(E) = {+} + {$,)} = {+, $, )}
FOLLOW(T') = FOLLOW(T) = {+, $, )}
FOLLOW(F) = (FIRST(T') - lambda) + FOLLOW(T) = {*} + {+,$,)} = {*, +, $, )}
1.14 LL(1) Predictive Parsing
LL(1) Table Construction
For each production A -> alpha:
Step 1: For each terminal a in FIRST(alpha), add A->alpha to M[A, a]
Step 2: If lambda in FIRST(alpha), for each terminal b in FOLLOW(A), add A-
>alpha to M[A, b]
Step 3: If lambda in FIRST(alpha) and $ in FOLLOW(A), add A->alpha to M[A,
$]

LL(1) Parsing Algorithm


Stack: [S, $] Input: [w, $]

while ([Link] != $):


X = [Link]
a = next_input_symbol
if X is terminal:
if X == a: pop X, advance input
else: ERROR
else if M[X, a] = X -> Y1 Y2 ... Yk:
pop X
push Yk, ..., Y2, Y1 (Y1 on top)
else: ERROR

if [Link] == $ and input == $: ACCEPT

When is a Grammar NOT LL(1)?


Condition Meaning Example
FIRST(alpha) & FIRST(beta) Two alternatives A->alpha|beta A -> ab | ac (need left factoring)
not empty share first terminal
FIRST(alpha) & FOLLOW(A) When A can vanish, its follow X -> eS|lambda where e in
not empty (for A- conflicts with another option FOLLOW(X)
>alpha|lambda)
Grammar not left-factored Common prefixes cause M[A,t] S -> iCtS | iCtSeS
to have multiple entries
Left recursion present Parser enters infinite loop E -> E + T | T

LL(k) Grammars
• LL(k): Choose production after reading k lookahead symbols.
• Hierarchy: LL(k) is a strict subset of LL(k+1).
• S -> aaBb | aaCb requires k=3 to distinguish (B or C) -> LL(3) not LL(1)
• Key fact: LL(1) is strictly contained in LR(1). Every LL(1) grammar is LR(1) but not vice versa.

Error Recovery in LL(1): Panic Mode with SYNC Tokens


For each non-terminal X, add SYNC tokens under elements of FOLLOW(X) in the parsing table:
if M[X, a] == blank: // syntax error
ignore input symbol a // skip it
if M[X, a] == sync:
if X is ONLY symbol on stack: ignore a // keep X, skip input
else: pop X from stack // discard X, let ancestor handle
UNIT 2: Bottom-Up Parsers & Syntax Directed Definitions
2.1 Bottom-Up Parsing Concepts
Key Definitions
Term Definition
Handle Substring at top of stack that matches RHS of a production rule; can be
reduced.
Handle Pruning Process: iteratively find handle, reduce to LHS. Repeat until start symbol.
Viable Prefix Any prefix of a right sentential form that does not extend past the rightmost
handle. The LR(0) automaton recognizes exactly viable prefixes.
Reduction Replace TOS substring (matching RHS) with LHS of production. Bottom-up
parsing is REVERSE rightmost derivation.
Rightmost Derivation Output of bottom-up parser when w in L(G)
in Reverse

Shift-Reduce Parsing Actions


Action Description
Shift Push next input symbol onto stack; advance input pointer
Reduce Pop |RHS| symbols from stack; push LHS symbol. If reducing A->alpha (|alpha|=k),
pop 2k elements (k symbols + k states), push A, push GOTO[state, A]
Accept Stack contains only start symbol S and input is $
Error No valid action; invoke error recovery

Conflicts in Shift-Reduce Parsing


Conflict Type Cause Example Resolution in
YACC
Shift-Reduce TOS matches some RHS Dangling else: if (c) if (c) s Prefer SHIFT
AND is prefix of another else s - is else for inner or (associates else
production (can shift or outer if? with innermost if)
reduce)
Reduce-Reduce TOS matches RHS of two id could reduce to expr or Use FIRST rule
different productions parameter listed in grammar
2.2 LR Parser Types
LR(k): L=left-to-right scan, R=rightmost derivation reversed, k=lookahead tokens.
Parser Lookahead for States Power Notes
Reduce
LR(0) None - reduces entire n states Weakest Reduces blindly; no grammar
row with epsilon-prod is LR(0)
SLR(1) FOLLOW(LHS) Same n states Moderate Uses same LR(0) automaton,
as LR(0) better reduce precision
LALR(1) Precise lookahead Same n states Strong YACC generates LALR(1);
(merged LR(1) states) as LR(0) practical for real languages
CLR(1)/LR(1) Exact lookahead from Can be O(2^n) Strongest Impractical for large
LR(1) items grammars; too many states

Power Hierarchy: LR(0) ⊂ SLR(1) ⊂ LALR(1) ⊂ CLR(1)


State counts: n_LR(0) = n_SLR = n_LALR <= n_CLR
Also: LL(1) ⊂ LR(1). Every LL(1) grammar is LR(1).

LR Automaton Construction
LR(0) automaton is used by LR(0) and SLR(1). LR(1) automaton is used by LALR(1) and CLR(1).
2.3 LR(0) Parser
LR(0) Items
An LR(0) item is a production with a dot (.) on its RHS showing parsing progress.
A -> . X Y Z (nothing seen yet, expecting X next)
A -> X . Y Z (X seen, expecting Y next)
A -> X Y . Z (X, Y seen, expecting Z)
A -> X Y Z . (entire RHS seen - FINAL/KERNEL item - can REDUCE)

LR(0) Automaton Construction Algorithm


Steps to build LR(0) automaton:
1. Augment grammar: Add S' -> S as new start production
2. Number all original productions (for reduce actions)
3. Start state I0: Closure({S' -> . S})
Closure(I): If A -> alpha . B beta in I, add all B -> . gamma to I. Repeat.
Goto(I, X): Shift dot past X in all items containing . X. Take closure.
4. Repeat Goto for every symbol after every dot until no new states

LR(0) Table Construction


Situation Action
State I has transition to J on terminal ACTION[I, a] = s J (shift to state J)
a
State I has GOTO on non-terminal A GOTO[I, A] = J
to state J
State I has FINAL item A -> alpha . ACTION[I, a] = r k for ALL terminals a (entire row) - weakness
of LR(0)
State I contains S' -> S . ACTION[I, $] = accept
LR(0) Parsing Algorithm: On reduce A -> alpha (|alpha| = k): pop 2*k symbols off stack, then push A,
then push GOTO[current_state, A].

LR(0) Conflicts
Conflict Cause Example
Shift-Reduce State has both a shift item (A -> E -> T . +E and E -> T .
alpha . a beta) and a final item (B ->
gamma .)
Reduce-Reduce State has two or more final items A -> alpha . and B -> beta . both in
same state
Any grammar with Final item (A -> .) in I0 causes r/r Y -> lambda always causes LR(0) failure
epsilon productions conflict with shift items
2.4 SLR(1) Parser
Key Improvement over LR(0): Uses FOLLOW sets to restrict where reductions are placed.
SLR(1) Rule: For final item A -> alpha . in state I:
Add ACTION[I, a] = r_k ONLY for terminals a in FOLLOW(A)
(NOT the entire row as in LR(0))

SLR(1) Conflicts
Conflict Condition Solution
Shift-Reduce Shift terminal is in FOLLOW(LHS) of Not SLR(1) - need CLR/LALR
reduce
Reduce-Reduce FOLLOW(A) & FOLLOW(B) not Not SLR(1) - need CLR/LALR
empty, where A->alpha. and B-
>beta. in same state
No conflict Shift terminals and FOLLOW sets Grammar is SLR(1)
are all disjoint

Viable Prefixes
• A viable prefix is a prefix of a right sentential form that does not extend past the rightmost
handle.
• The parser stack always contains a viable prefix - never allows an invalid prefix.
• The LR(0) automaton recognizes EXACTLY the set of viable prefixes.
• This property (valid prefix property) is shared by both LL(1) and LALR(1) parsers.
2.5 CLR(1) / LR(1) Parser
LR(1) Items
An LR(1) item = [LR(0) item, lookahead set]
A -> alpha . beta , {a, b, c}

Initial item: S' -> . S , {$}

For parent item A -> alpha . B beta , a:


Child item B -> . gamma , FIRST(beta a)

LR(1) Lookahead Calculation


Lookahead Calculation Rule:
Parent: A -> alpha . B beta , a
Child: B -> . gamma , FIRST(beta a)
Where "a" is the lookahead of the parent item and "beta" is what follows B in the parent.
If beta is empty (B is at end): Child lookahead = {a} (propagated from parent)

CLR(1) Table Construction


Situation Action
Transition from I to J on terminal a ACTION[I, a] = s J
Transition from I to J on non-terminal GOTO[I, A] = J
A
Final item A -> alpha ., {b, c} in state ACTION[I, b] = r k AND ACTION[I, c] = r k (ONLY for
I lookahead tokens)
S' -> S . , {$} ACTION[I, $] = accept

CLR(1) Conflicts
Conflict Cause
Reduce-Reduce Two final items in same state with overlapping lookaheads: [A -> alpha., {a}] and
[B -> beta., {a}]
Shift-Reduce Final item [A -> alpha., {a}] and shift item [B -> gamma . a delta] in same state
Problem: CLR(1) can create O(2^n) states - impractical for large grammars. Solution: LALR(1)
2.6 LALR(1) Parser
Strategy: Build CLR(1) automaton, then MERGE states with identical LR(0) cores (same productions
with dots in same positions) but different lookaheads. Union the lookahead sets.

LALR(1) Key Properties:


- Same number of states as LR(0) automaton (n_LALR = n_LR(0))
- Better lookahead precision than SLR (uses LR(1) lookaheads, not FOLLOW
sets)
- YACC generates LALR(1) parsers
- Merging CAN introduce r/r conflicts not in CLR (but CANNOT introduce new
s/r conflicts)

LALR Conflicts
Conflict Type In LALR? Explanation
Reduce-Reduce Possible (can If merged states have final items with overlapping lookaheads
be introduced from different states: [A->a.,{a}] merged with [B->b.,{a}] gives r/r
by merging) conflict on "a"
Shift-Reduce CANNOT be S/R conflict in LALR must also exist in CLR. Merging only
introduced by unifies lookaheads, never creates new shifts.
merging

Merging Example
Suppose CLR has I4 = {A -> alpha., {=, $}} and I11 = {A -> alpha., {=}} with same LR(0) core.
After merge: I4-11 = {A -> alpha., {=, $}}. Note the merged state's transitions must also be merged.
Check: If I4 went to state X on some symbol and I11 went to state Y on same symbol, they must be X-
Y merged state.
2.7 Parser Comparison Summary
Feature LR(0) SLR(1) LALR(1) CLR(1)
Lookahead for reduce None (entire FOLLOW(A) LR(1) lookahead LR(1) lookahead
row) (precise)
Number of states n n (same as n (same as Can be >> n
LR(0)) LR(0))
Power Weakest Moderate Very strong Strongest
Practical use Theoretical only Rarely used now YACC, most Rarely (too many
parsers states)
Automaton type LR(0) LR(0) LR(1) merged LR(1)
Error detection Slowest Faster Fast Fastest
(spurious (immediate)
reductions)
Handles epsilon No Sometimes Usually Yes
grammars

Grammar Classification Reference Table


Grammar LR(0) SLR LALR CLR Notes
S->AA, A->aA|b Yes Yes Yes Yes Classic example
E->T+E|T, T->id No(s/r) Yes Yes Yes SLR resolved by
FOLLOW(E)={$}
S->A|a, A->a No(r/r) No(r/r) No(r/r) No(r/r) Ambiguous grammar - no
parser handles
S->AaAb|BbBa, A- No(r/r) No(r/r) Yes Yes SLR fails:
>lambda,B->lambda FOLLOW(A)=FOLLOW(B)
S->L=R|R, L->*R|id, R->L No(s/r) No(s/r) Yes Yes Classic SLR failure case
S->Aa|bAc|Bc|bBa, A->d, B- No(r/r) No(r/r) No(r/r) Yes LALR fails; CLR only
>d
S->dA|aB, A->bA|c, B->bB|c Yes Yes Yes Yes All parsers work
S->SS+|SS*|a Yes Yes Yes Yes Postfix notation

LL(1) vs LALR(1)
Property LL(1) LALR(1)
Parse approach Top-down, leftmost derivation Bottom-up, rightmost derivation
reversed
Stack contents Predicted (future) symbols - useful for Past symbols (what has been seen)
error repair
Error detection Earlier (no spurious reductions) May do reductions before detecting
error
Error repair Easier (stack shows expected future) Harder (stack shows past, not
expected future)
Time complexity O(n) O(n)
Property LL(1) LALR(1)
Grammar coverage Subset of LALR(1) Superset of LL(1)
2.8 Syntax Directed Definitions (SDD)
SDD = CFG + Semantic Rules. Grammar symbols have attributes; productions have associated rules
to compute attribute values.
Purpose: Semantic Analysis - compute meanings, perform type checking, build symbol table, generate
code.

Types of Attributes
Type Computed From Traversal Order Example
Synthesized Children's attributes Post-order (bottom-up) [Link] = [Link] + [Link]
(bottom-up)
Inherited Parent's and siblings' Complex (dependency- [Link] = [Link] (type
attributes based) declaration)

SDD Example: Desk Calculator


Production Semantic Rule
E -> E1 + T [Link] = [Link] + [Link]
E -> T [Link] = [Link]
T -> T1 * F [Link] = [Link] * [Link]
T -> F [Link] = [Link]
F -> num [Link] = [Link]
F -> id [Link] = [Link]

S-Attributed SDDs
S-Attributed SDD: Contains ONLY synthesized attributes. Can be evaluated by bottom-up parser in
single pass.
Evaluation: Stack-based during bottom-up parsing. On reduce A -> beta, compute [Link] from beta's
attributes on stack.

Inherited Attributes Example: Type Declarations


Production Semantic Rule
D -> T L [Link] = [Link]
T -> int [Link] = integer
T -> real [Link] = float
L -> L1, id [Link] = [Link]; addType([Link], [Link])
L -> id addType([Link], [Link])
[Link] is an inherited attribute - propagated top-down from T (which has synthesized type) to all L nodes.

Dependency Graphs
• Nodes = attribute instances in the parse tree
• Edge from c to b means b depends on c (compute c first)
• Must be a DAG (directed acyclic graph) - no cycles
• Topological sort gives valid evaluation order
To evaluate an SDD:
1. Construct parse tree for input
2. Build dependency graph (one node per attribute instance)
3. Topological sort the dependency graph
4. Evaluate semantic rules in topological order
5. Result: Annotated parse tree

SDD Examples: Binary Operations


Example Production Semantic Rule
Count 1s B -> 1 [Link] = 1
Count 1s B -> 0 [Link] = 0
Count 1s L -> L1 B [Link] = [Link] + [Link]
Binary to Decimal B -> 0/1 [Link] = 0 or 1
Binary to Decimal L -> L1 B [Link] = 2 * [Link] + [Link]
Count balanced S -> (S1) [Link] = [Link] + 1
parens
Count balanced S -> a [Link] = 0
parens
Type determination T -> [Link] [Link] = float
Type determination T -> num [Link] = integer
Type determination E -> E1 + T if [Link]==float or [Link]==float then
[Link]=float else [Link]=integer
Infix to Postfix E -> E1 + T printf("+")
Sign detection E -> -E1 if [Link]==POS then [Link]=NEG else
[Link]=POS
Sign detection E -> E1*E2 [Link] = POS if [Link]==[Link], else NEG
Quick Reference: Key Algorithms & Facts
LR Parsing Stack Operations
Shift a:
push a onto stack
push next_state onto stack
advance input

Reduce A -> alpha (|alpha| = k):


pop 2*k elements from stack (k symbols + k states)
push A onto stack
push GOTO[current_top_state, A] onto stack
DO NOT advance input

Accept: stack = [S, 1] and input = [$]

Summary: When Does Each Parser Fail?


Parser Fails When...
LR(0) (1) State has both shift item and final item (s/r conflict), OR (2) State has two or more
final items (r/r conflict), OR (3) Grammar has epsilon productions (always creates
conflicts in I0)
SLR(1) (1) Shift terminal is in FOLLOW(LHS) causing s/r conflict, OR (2) FOLLOW(A) &
FOLLOW(B) overlap for two final items in same state
LALR(1) (1) After merging CLR states, final items in merged state have overlapping
lookaheads (r/r conflict). NOTE: Merging cannot create new s/r conflicts.
CLR(1) (1) Two final items in same state with same lookahead (r/r), OR (2) Final item
lookahead matches a shift terminal (s/r). Never fails for practical deterministic
languages.

Error Detection Speed


Parser Error Detection
CLR(1) Fastest - detects error immediately with no spurious reductions
LALR(1) May perform one or a few reductions before detecting error
SLR(1) May perform more spurious reductions
LR(0) Slowest - may perform many spurious reductions (no lookahead)

Final Important Facts:


1. Ambiguous grammars: No parser (LL or LR) can handle without resolving conflicts
manually.
2. No grammar with epsilon productions is LR(0) - always causes conflict in I0.
3. YACC generates LALR(1) parsers. It resolves s/r by preferring shift, r/r by first rule.
4. In LR parsing: always keep state numbers on stack; states alternate with grammar
symbols.
5. S-attributed SDDs can be evaluated in one bottom-up pass (most practical compilers use
this).
6. Dependency graph must be acyclic (DAG) for SDD evaluation to be possible.

You might also like