Compiler Design Notes
Compiler Design Notes
History of Compilers
Year Event
1951 First practical compiler by Corrado Bohm (PhD thesis)
1952 First implemented compiler: Grace Hopper's A-0 System. Hopper coined "compiler".
1954-57 FORTRAN (FORmula TRANslation) - first widely used high-level language & complete
compiler (IBM, John W. Backus)
1969-73 C language created by Ken Thompson & Dennis Ritchie alongside Unix. B compiler
bootstrapped.
1987 GCC (GNU Compiler Collection) released by Richard Stallman - initially only C
Variants of Compilers
Variant Description Example
Self-hosting Compiler written in the same language it compiles. GCC (C), Java, Python
Can compile itself.
Bootstrapping Technique for producing a self-hosting compiler. C written in B; C++ in C
Start with core compiler in another language. (Cfront)
Native Compiler Generates code for the same platform it runs on. Turbo C
Cross Compiler Generates code for a different platform than Linux/x86 building for
where it runs. Used for embedded systems. Arduino/ARM
Transpiler Source-to-source: translates between high-level Emscripten (C++ to
languages at same abstraction level. JavaScript)
Decompiler Low-level to high-level. Inherently difficult (loses Ghidra (written in C++)
info during compilation).
JIT Compiler Compiles code just-in-time during runtime. Java Virtual Machine (JVM)
Combination of AOT compiler and interpreter.
Compiler-compiler Generates parser/compiler from formal grammar Lex, YACC
description. Parser generators use BNF/EBNF.
1.2 Language Processing System
Pipeline: Source to Executable
Stage Tool Input Output GCC Command
Preprocessing Preprocessor Source .c file .i file (macros gcc -E hello.c > hello.i
(cpp) expanded)
Compilation Compiler (gcc) .i preprocessed .s assembly file gcc -S hello.i
file
Assembly Assembler (as) .s assembly .o relocatable as hello.s -o hello.o
object file
Linking Linker (ld) .o object files + executable gcc hello.o
libraries ([Link])
Loading OS Loader executable process in ./[Link]
memory
Parse Tree vs AST: Parse Tree (Concrete Syntax Tree) retains all lexemes including non-
terminals and punctuation - large and complicated. AST (Abstract Syntax Tree) retains only
information needed for code generation - no punctuation, no non-terminals. All nodes are
tokens + dummy structural nodes.
1.4 Symbol Table
Definition: A data structure created and maintained by the compiler to store information about every
identifier (variable, function, type) in the source program.
Operations
Operation Description
insert(name, type, ...) Add a new identifier with its attributes
lookup(name) Retrieve attributes of an identifier
delete(name) Remove an identifier (at end of scope)
update(name, ...) Modify attributes of an existing identifier
Implementation Options
Structure Search Time Notes
Linear List O(n) Simple but slow for large programs
Hash Table O(1) average Most commonly used; fast lookup
Tree (BST) O(log n) Ordered; useful for structured scoping
Linked lists of hash tables O(1) + scoping Used for nested scopes
Each entry stores: name, type, size, scope level, offset in memory, and any other relevant attributes.
1.5 Three-Address Code (TAC / 3AC)
An intermediate representation where each instruction has at most one operator and at most three
addresses (operands).
Form: x = y op z (at most 1 operator per instruction)
Complex expressions are broken using temporary variables t0, t1, ...
Static Single Assignment (SSA): Each variable is assigned exactly once. Simplifies data flow
analysis. Used in modern compilers like GCC and LLVM.
1.6 Lexical Analysis
Key Definitions
Term Definition Example
Token Abstract symbol representing a type of lexical unit. <ID, 3> or <keyword, while>
Structure: <token-name, attribute-value>
Lexeme Actual sequence of characters in source that "number", "factorial", "while"
matches a pattern
Pattern Rule (regex) defining valid lexemes for a token [a-zA-Z_][a-zA-Z0-9_]* for
identifiers
Role of Lexer
• Primary: Convert source into token stream by grouping characters into lexemes
• Enter lexemes representing identifiers into the symbol table
• Strip out comments and whitespace
• Track line numbers; correlate errors with source line numbers
• Expand macros
Why separate from parser? (1) Simplicity - comments removed before parser; (2) Efficiency -
specialized buffering; (3) Portability - input specifics isolated to lexer.
Lexical Errors
Error Type Example
Spelling error di{...} instead of if{...}
Unmatched string printf("hi); missing closing quote
Illegal characters printf("hi")$$; -- $$ not valid
Identifier too long External/global identifiers in C cannot exceed 31 characters
Error messages must: pinpoint the error, be specific, be understandable, and not be redundant.
1.7 Lex Tool
Lex: Reads a specification file with regular expressions and generates a C routine ([Link].c) that
performs lexical analysis. YACC handles grammar-based parsing.
Structure of a .l File
%{
/* C code copied verbatim (includes, declarations) */
%}
%%
%%
/* Auxiliary functions */
int yywrap() { return 1; }
int main() { yylex(); return 0; }
Start Conditions
• Mechanism for conditionally activating patterns in Lex.
• Inclusive (%s): Being in condition scn ADDS those rules to the active set.
• Exclusive (%x): Being in condition scn means ONLY those rules (prefixed with <scn>) are
active.
• Use case: multiline comments, quoted strings, indentation handling (Python).
%x COMMENT
%%
"/*" BEGIN(COMMENT);
<COMMENT>"*/" BEGIN(INITIAL);
<COMMENT>. ; /* discard comment content */
<COMMENT>\n yylineno++;
1.8 Design of Lexical Analyzer Generator: RE -> NFA -> DFA
Pipeline
Lex internally converts each pattern (RE) to NFA, combines all NFAs, then converts to DFA for efficient
simulation.
Step Process Algorithm
1 Each RE pattern -> NFA Thompson's Construction
2 Combine NFAs with new start state + epsilon Union construction
transitions to each NFA's start
3 Combined NFA -> DFA Subset Construction
4 DFA used to simulate lexer Maximal munch on input
E -> E + T | E - T | T
T -> T * F | T / F | F
F -> G ^ F | G
G -> (E) | id | num
For Indirect Left Recursion: Order the non-terminals. Replace each Ai in Aj productions (j < i) with
Ai's body to convert indirect to immediate, then eliminate.
1.12 Recursive Descent Parsing (RDP)
• A top-down parser that implements one recursive procedure per non-terminal.
• Parses without backtracking if grammar is suitable.
• Power: Accepts LL(*) grammars - MORE powerful than LL(1) table-driven parsers (which
accept only LL(1)).
• Grammar must be left-factored and left-recursion-free.
• Implementation language must support recursion efficiently.
FOLLOW(E) = {$, )}
FOLLOW(E') = FOLLOW(E) = {$, )}
FOLLOW(T) = (FIRST(E') - lambda) + FOLLOW(E) = {+} + {$,)} = {+, $, )}
FOLLOW(T') = FOLLOW(T) = {+, $, )}
FOLLOW(F) = (FIRST(T') - lambda) + FOLLOW(T) = {*} + {+,$,)} = {*, +, $, )}
1.14 LL(1) Predictive Parsing
LL(1) Table Construction
For each production A -> alpha:
Step 1: For each terminal a in FIRST(alpha), add A->alpha to M[A, a]
Step 2: If lambda in FIRST(alpha), for each terminal b in FOLLOW(A), add A-
>alpha to M[A, b]
Step 3: If lambda in FIRST(alpha) and $ in FOLLOW(A), add A->alpha to M[A,
$]
LL(k) Grammars
• LL(k): Choose production after reading k lookahead symbols.
• Hierarchy: LL(k) is a strict subset of LL(k+1).
• S -> aaBb | aaCb requires k=3 to distinguish (B or C) -> LL(3) not LL(1)
• Key fact: LL(1) is strictly contained in LR(1). Every LL(1) grammar is LR(1) but not vice versa.
LR Automaton Construction
LR(0) automaton is used by LR(0) and SLR(1). LR(1) automaton is used by LALR(1) and CLR(1).
2.3 LR(0) Parser
LR(0) Items
An LR(0) item is a production with a dot (.) on its RHS showing parsing progress.
A -> . X Y Z (nothing seen yet, expecting X next)
A -> X . Y Z (X seen, expecting Y next)
A -> X Y . Z (X, Y seen, expecting Z)
A -> X Y Z . (entire RHS seen - FINAL/KERNEL item - can REDUCE)
LR(0) Conflicts
Conflict Cause Example
Shift-Reduce State has both a shift item (A -> E -> T . +E and E -> T .
alpha . a beta) and a final item (B ->
gamma .)
Reduce-Reduce State has two or more final items A -> alpha . and B -> beta . both in
same state
Any grammar with Final item (A -> .) in I0 causes r/r Y -> lambda always causes LR(0) failure
epsilon productions conflict with shift items
2.4 SLR(1) Parser
Key Improvement over LR(0): Uses FOLLOW sets to restrict where reductions are placed.
SLR(1) Rule: For final item A -> alpha . in state I:
Add ACTION[I, a] = r_k ONLY for terminals a in FOLLOW(A)
(NOT the entire row as in LR(0))
SLR(1) Conflicts
Conflict Condition Solution
Shift-Reduce Shift terminal is in FOLLOW(LHS) of Not SLR(1) - need CLR/LALR
reduce
Reduce-Reduce FOLLOW(A) & FOLLOW(B) not Not SLR(1) - need CLR/LALR
empty, where A->alpha. and B-
>beta. in same state
No conflict Shift terminals and FOLLOW sets Grammar is SLR(1)
are all disjoint
Viable Prefixes
• A viable prefix is a prefix of a right sentential form that does not extend past the rightmost
handle.
• The parser stack always contains a viable prefix - never allows an invalid prefix.
• The LR(0) automaton recognizes EXACTLY the set of viable prefixes.
• This property (valid prefix property) is shared by both LL(1) and LALR(1) parsers.
2.5 CLR(1) / LR(1) Parser
LR(1) Items
An LR(1) item = [LR(0) item, lookahead set]
A -> alpha . beta , {a, b, c}
CLR(1) Conflicts
Conflict Cause
Reduce-Reduce Two final items in same state with overlapping lookaheads: [A -> alpha., {a}] and
[B -> beta., {a}]
Shift-Reduce Final item [A -> alpha., {a}] and shift item [B -> gamma . a delta] in same state
Problem: CLR(1) can create O(2^n) states - impractical for large grammars. Solution: LALR(1)
2.6 LALR(1) Parser
Strategy: Build CLR(1) automaton, then MERGE states with identical LR(0) cores (same productions
with dots in same positions) but different lookaheads. Union the lookahead sets.
LALR Conflicts
Conflict Type In LALR? Explanation
Reduce-Reduce Possible (can If merged states have final items with overlapping lookaheads
be introduced from different states: [A->a.,{a}] merged with [B->b.,{a}] gives r/r
by merging) conflict on "a"
Shift-Reduce CANNOT be S/R conflict in LALR must also exist in CLR. Merging only
introduced by unifies lookaheads, never creates new shifts.
merging
Merging Example
Suppose CLR has I4 = {A -> alpha., {=, $}} and I11 = {A -> alpha., {=}} with same LR(0) core.
After merge: I4-11 = {A -> alpha., {=, $}}. Note the merged state's transitions must also be merged.
Check: If I4 went to state X on some symbol and I11 went to state Y on same symbol, they must be X-
Y merged state.
2.7 Parser Comparison Summary
Feature LR(0) SLR(1) LALR(1) CLR(1)
Lookahead for reduce None (entire FOLLOW(A) LR(1) lookahead LR(1) lookahead
row) (precise)
Number of states n n (same as n (same as Can be >> n
LR(0)) LR(0))
Power Weakest Moderate Very strong Strongest
Practical use Theoretical only Rarely used now YACC, most Rarely (too many
parsers states)
Automaton type LR(0) LR(0) LR(1) merged LR(1)
Error detection Slowest Faster Fast Fastest
(spurious (immediate)
reductions)
Handles epsilon No Sometimes Usually Yes
grammars
LL(1) vs LALR(1)
Property LL(1) LALR(1)
Parse approach Top-down, leftmost derivation Bottom-up, rightmost derivation
reversed
Stack contents Predicted (future) symbols - useful for Past symbols (what has been seen)
error repair
Error detection Earlier (no spurious reductions) May do reductions before detecting
error
Error repair Easier (stack shows expected future) Harder (stack shows past, not
expected future)
Time complexity O(n) O(n)
Property LL(1) LALR(1)
Grammar coverage Subset of LALR(1) Superset of LL(1)
2.8 Syntax Directed Definitions (SDD)
SDD = CFG + Semantic Rules. Grammar symbols have attributes; productions have associated rules
to compute attribute values.
Purpose: Semantic Analysis - compute meanings, perform type checking, build symbol table, generate
code.
Types of Attributes
Type Computed From Traversal Order Example
Synthesized Children's attributes Post-order (bottom-up) [Link] = [Link] + [Link]
(bottom-up)
Inherited Parent's and siblings' Complex (dependency- [Link] = [Link] (type
attributes based) declaration)
S-Attributed SDDs
S-Attributed SDD: Contains ONLY synthesized attributes. Can be evaluated by bottom-up parser in
single pass.
Evaluation: Stack-based during bottom-up parsing. On reduce A -> beta, compute [Link] from beta's
attributes on stack.
Dependency Graphs
• Nodes = attribute instances in the parse tree
• Edge from c to b means b depends on c (compute c first)
• Must be a DAG (directed acyclic graph) - no cycles
• Topological sort gives valid evaluation order
To evaluate an SDD:
1. Construct parse tree for input
2. Build dependency graph (one node per attribute instance)
3. Topological sort the dependency graph
4. Evaluate semantic rules in topological order
5. Result: Annotated parse tree